Once more unto the breach, dear friends, once more;
– The Life of King Henry the Fifth, Act III, Scene 1
When last we met, I tried to use Shakespeare’s Romeo and Juliet to help explain genome sequencing. In particular, we looked at how to find differences by comparing the genomes of individual people, either to each other or to a “gold standard” reference. It might be worth giving that article a read, if you haven’t already. It’s full of useful information, like why reading a human genome is like diving into the most boring 46-volume, 857,000-page epic you can imagine.
Although I didn’t use the term there, the process of finding a short DNA sequence in that magnum opus is usually referred to as “mapping”. If you find out where it came from, you’ve “mapped” it to its proper genomic location. By analogy, we can take an isolated phrase (like the famous “Romeo, Romeo, wherefore are thou…” speech) and locate it by comparing with the whole text of the play. Doing this, we’d find it near the beginning of Act II, Scene 2.
That works well for organisms that have had their genomes sequenced, like humans, mice, cattle, or bacteria – hundreds of species. There’s a nice, although I think incomplete, list at the GNN – Quick Guide to Sequenced Genomes. But what about the millions of species with no available genome sequence?
Imagine, if you will, that you are a researcher who has just discovered a brand-new species of beetle (there are lots of species of beetles, so this is actually pretty likely). You might want to see what’s in its DNA that makes it different from other beetles, or find some genetic evidence to help work out how it’s related to other species.
Most DNA sequencing specialists would take the approach of extracting the poor beetle’s DNA, fragmenting it up into manageable pieces, and sequencing these small fragments en masse. There are a number of good technologies for doing this, and they all give you the same result: millions and millions of short “words”, made up of the letters A, C, G and T, the four chemical bases that make up DNA. These words are randomly derived from all over the hapless insect’s genome. Putting them together to make the beetle version of that 857,000-page book (beetle genomes are generally a bit smaller – so maybe 43,000 pages) is like doing a huge jigsaw puzzle. One with really tiny pieces, all of more or less the same colour – and without the picture on the box lid to help you.
This process, called “assembly”, relies on finding overlaps between sequences. Remember, these are all random fragments, starting and ending at different places, so if we have enough DNA (and thus enough copies of the genome) to begin with, by chance some of these fragments will have recognizable places where they overlap. And this is where Julius Caesar comes in.
Imagine, if you will, that you’ve never read the play (I had to in high school, and have seen it performed once; your experience may differ). Think of this set of text fragments as being like a handful of short DNA sequences:
ds, Romans, count ns, countrymen, le Friends, Rom end me your ears; trymen, lend me
If you didn’t know the play, you might have trouble making sense of this – until you realize that some of them overlap. When you shuffle them around and line them up, you get something like this:
Friends, Rom ds, Romans, count ns, countrymen, le trymen, lend me end me your ears;
Which gives you the consensus:
Friends, Romans, countrymen, lend me your ears;
If you’ve read the play, you know this as the beginning of Antony’s famous speech from Act III, Scene 2. If not, you might still recognize it as “real” English – it has syntax and meaning. And this is just how genome assembly works. Sophisticated computer programs use pattern recognition to look for overlaps, and assemble clusters of sequences, commonly referred to as “contigs” (for “contiguous assemblies”, I suppose). Smaller contigs are joined together to make larger ones, usually using other information about how they might fit together. These are are generally referred to as “scaffolds”. Ultimately, all of these are put together into the whole genome sequence.
However, it’s not necessarily quite that simple. All but the simplest genomes (viruses and bacteria, for example) are riddled with pieces of DNA sequence that all look like each other, the dreaded “repetitive elements”. In our play analogy, you can think of them as common words that appear multiple times. These can play havoc with the assembly process. Consider this, one of my favourite passages of Shakespearean dialogue, from Act IV, Scene 2:
CASSIUS: Stand, ho!
BRUTUS: Stand, ho! Speak the word along.
First Soldier: Stand!
Second Soldier: Stand!
Third Soldier: Stand!
Imagine you were trying to re-assemble the play’s script from many tiny pieces, and you came across a fragment, “Stand,”. Where does it go? Did Cassius speak it, or does it belong to Brutus?
Even worse, what if there was an ambiguity in the last character? Perhaps it was blurred due to water damage, so that you only knew it was “Stand”, followed by something. Is that something a comma, an exclamation mark (both of which would fit in the passage above), or something else (a letter “s”, for example)? This kind of ambiguous character happens all the time in DNA sequencing, and the result in our example above is that now our piece of text could go in one of five places, spoken by no fewer than five different characters.
We can fix this, at least some of the time, by asking for longer pieces of sequence (or, getting back to the jigsaw analogy, bigger pieces). If we have a longer text fragment like “Stand, ho! Speak the”, then we know exactly where it’s supposed to go, even though each of the four words individually occurs more than once in the play, and even if there are a few ambiguities in it. This is how we get around really short repeats in DNA sequence – by using sequence reads long enough to either encompass them entirely, or to make a unique sequence. But for long words or phrases that appear more than once, sometimes we just can’t find the right location, no matter what we do.
Except – there is one more commonly used trick, called “paired-end”, or “mate pair” sequencing (these are slightly different things, but for our purpose it doesn’t really matter). This takes advantage of deriving DNA sequences from both ends of a much larger fragment, and using the location information from one to help position the other. Here’s how it works.
The phrase “Caesar’s house” is spoken twice, once by a servant in Act III, Scene 2, and again by Antony in Act IV, Scene 1. Now, imagine we have a fragment of the play of, say, half a page or so in length, and we’ve found “Caesar’s house” at one end of it. If we look at the other end, we might find the rather unusual phrase, “slanderous loads”. The only place this occurs is in Act IV, almost exactly half a page away from one of the instances of “Caesar’s house”. Now, we can be reasonably certain that we’ve properly “mapped” the first phrase onto the play, and that it’s the one Antony speaks. The key here is that we don’t need to know all the text in between – in other words, we don’t need the whole half-page of text to help in mapping the ambiguous end. We just need to know that the two ends are about a half a page apart. This method of using paired sequences from the ends of large DNA fragments to map non-unique sequences is very useful in assembling unknown genomes.
Genome assembly, of course, is much more complex and fraught with problems than I’ve led you to believe, but that’s the basic idea. At the end of the article are a few more
frightening technical references, if you’d like to dig deeper. In the meantime, I can’t find a suitable quotation to finish off with, so I’ll just leave you with Octavius’ closing words:
So call the field to rest; and let’s away,
To part the glories of this happy day.
If you’re really interested in beetle genomes, the description of the Red Flour Beetle one is a good place to start. Of course, if you are really interested in beetle genomes, you’ve probably read it already. You’ll need a subscription to the journal, though.
- Tribolium Genome Sequencing Consortium, et al. The genome of the model beetle and pest Tribolium castaneum. Nature, April 2008, vol. 452 issue 7190, pp. 949-55.
Alignment is a technical and tricky process, but is somewhat de-mystified in some excellent reviews. I recommend the following:
- Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nature Methods, November 2009, vol. 6 issue 11 supplement, pp. S6-S12. Somewhat technical, but a nice introduction from two experts from the European Bioinformatics Institute. You will, unfortunately, need a subscription to access it.
- Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends in Genetics, March 2008, vol. 24 issue 3, pp. 142-149. A technical, but accessible review of the assembly problem and other challenges of high throughput genome sequencing. Freely accessible online.
- The NCBI Handbook (Editors: Jo McEntyre and Jim Ostell; National Center for Biotechnology Information, Bethesda, MD, USA: 2002-present) has some good information on genome assembly, but is very technical indeed. Start with Chapter 14, available online here.
Finally, if you’re in a silly mood, you could try these articles, which have nothing at all to do with sequence assembly.
- Hook EB. Shakespeare, genetics, malformations, and the Wars of the Roses: hereditary themes in Henry VI and Richard III. Teratology, Feb. 1987, vol. 35 issue 1, 147-55. I believe this one is freely available online here.
- Berg JM. Shakespeare as a geneticist. Clinical Genetics, March 2001, vol. 59 issue 3, pp. 165-70. Unfortunately, you’ll need a subscription for this one too.
Last but not least – if you’re really interested in stretching the Shakespeare-medical science connection, try this blog post of mine, discussing a very interesting paper about anal fistula and All’s Well That Ends Well. You just can’t make this stuff up.