Genome Assembly – a primer for the Shakespeare fan | The Occam's Typewriter Irregulars

Once more unto the breach, dear friends, once more;
– The Life of King Henry the Fifth, Act III, Scene 1

When last we met, I tried to use Shakespeare’s Romeo and Juliet to help explain genome sequencing. In particular, we looked at how to find differences by comparing the genomes of individual people, either to each other or to a “gold standard” reference. It might be worth giving that article a read, if you haven’t already. It’s full of useful information, like why reading a human genome is like diving into the most boring 46-volume, 857,000-page epic you can imagine.

Although I didn’t use the term there, the process of finding a short DNA sequence in that magnum opus is usually referred to as “mapping”. If you find out where it came from, you’ve “mapped” it to its proper genomic location. By analogy, we can take an isolated phrase (like the famous “Romeo, Romeo, wherefore are thou…” speech) and locate it by comparing with the whole text of the play. Doing this, we’d find it near the beginning of Act II, Scene 2.

That works well for organisms that have had their genomes sequenced, like humans, mice, cattle, or bacteria – hundreds of species. There’s a nice, although I think incomplete, list at the GNN – Quick Guide to Sequenced Genomes . But what about the millions of species with no available genome sequence?

Imagine, if you will, that you are a researcher who has just discovered a brand-new species of beetle (there are lots of species of beetles, so this is actually pretty likely). You might want to see what’s in its DNA that makes it different from other beetles, or find some genetic evidence to help work out how it’s related to other species.

This isn’t an unknown species of beetle. In fact, it’s not a beetle at all.

Most DNA sequencing specialists would take the approach of extracting the poor beetle’s DNA, fragmenting it up into manageable pieces, and sequencing these small fragments en masse. There are a number of good technologies for doing this, and they all give you the same result: millions and millions of short “words”, made up of the letters A, C, G and T, the four chemical bases that make up DNA. These words are randomly derived from all over the hapless insect’s genome. Putting them together to make the beetle version of that 857,000-page book (beetle genomes are generally a bit smaller – so maybe 43,000 pages) is like doing a huge jigsaw puzzle. One with really tiny pieces, all of more or less the same colour – and without the picture on the box lid to help you.

Some high-throughput sequencing reactions. Each dot is one, growing up out of the screen toward you.

This process, called “assembly”, relies on finding overlaps between sequences. Remember, these are all random fragments, starting and ending at different places, so if we have enough DNA (and thus enough copies of the genome) to begin with, by chance some of these fragments will have recognizable places where they overlap. And this is where Julius Caesar comes in.

A Roman wall that has nothing to do with Julius Caesar.

Imagine, if you will, that you’ve never read the play (I had to in high school, and have seen it performed once; your experience may differ). Think of this set of text fragments as being like a handful of short DNA sequences:

ds, Romans, count
ns, countrymen, le
Friends, Rom
end me your ears;
trymen, lend me

If you didn’t know the play, you might have trouble making sense of this – until you realize that some of them overlap. When you shuffle them around and line them up, you get something like this:


Friends, Rom
     ds, Romans, count
             ns, countrymen, le
                     trymen, lend me
                              end me your ears;

Which gives you the consensus:

Friends, Romans, countrymen, lend me your ears;

If you’ve read the play, you know this as the beginning of Antony’s famous speech from Act III, Scene 2. If not, you might still recognize it as “real” English – it has syntax and meaning. And this is just how genome assembly works. Sophisticated computer programs use pattern recognition to look for overlaps, and assemble clusters of sequences, commonly referred to as “contigs” (for “contiguous assemblies”, I suppose). Smaller contigs are joined together to make larger ones, usually using other information about how they might fit together. These are are generally referred to as “scaffolds”. Ultimately, all of these are put together into the whole genome sequence.

However, it’s not necessarily quite that simple. All but the simplest genomes (viruses and bacteria, for example) are riddled with pieces of DNA sequence that all look like each other, the dreaded “repetitive elements”. In our play analogy, you can think of them as common words that appear multiple times. These can play havoc with the assembly process. Consider this, one of my favourite passages of Shakespearean dialogue, from Act IV, Scene 2:

CASSIUS: Stand, ho!
BRUTUS: Stand, ho! Speak the word along.
First Soldier: Stand!
Second Soldier: Stand!
Third Soldier: Stand!

Imagine you were trying to re-assemble the play’s script from many tiny pieces, and you came across a fragment, “Stand,”. Where does it go? Did Cassius speak it, or does it belong to Brutus?

Even worse, what if there was an ambiguity in the last character? Perhaps it was blurred due to water damage, so that you only knew it was “Stand”, followed by something. Is that something a comma, an exclamation mark (both of which would fit in the passage above), or something else (a letter “s”, for example)? This kind of ambiguous character happens all the time in DNA sequencing, and the result in our example above is that now our piece of text could go in one of five places, spoken by no fewer than five different characters.

We can fix this, at least some of the time, by asking for longer pieces of sequence (or, getting back to the jigsaw analogy, bigger pieces). If we have a longer text fragment like “Stand, ho! Speak the”, then we know exactly where it’s supposed to go, even though each of the four words individually occurs more than once in the play, and even if there are a few ambiguities in it. This is how we get around really short repeats in DNA sequence – by using sequence reads long enough to either encompass them entirely, or to make a unique sequence. But for long words or phrases that appear more than once, sometimes we just can’t find the right location, no matter what we do.

Except – there is one more commonly used trick, called “paired-end”, or “mate pair” sequencing (these are slightly different things, but for our purpose it doesn’t really matter). This takes advantage of deriving DNA sequences from both ends of a much larger fragment, and using the location information from one to help position the other. Here’s how it works.

The phrase “Caesar’s house” is spoken twice, once by a servant in Act III, Scene 2, and again by Antony in Act IV, Scene 1. Now, imagine we have a fragment of the play of, say, half a page or so in length, and we’ve found “Caesar’s house” at one end of it. If we look at the other end, we might find the rather unusual phrase, “slanderous loads”. The only place this occurs is in Act IV, almost exactly half a page away from one of the instances of “Caesar’s house”. Now, we can be reasonably certain that we’ve properly “mapped” the first phrase onto the play, and that it’s the one Antony speaks. The key here is that we don’t need to know all the text in between – in other words, we don’t need the whole half-page of text to help in mapping the ambiguous end. We just need to know that the two ends are about a half a page apart. This method of using paired sequences from the ends of large DNA fragments to map non-unique sequences is very useful in assembling unknown genomes.

This wasn’t Caesar’s house. He’d already been dead for a while when it was built.

Genome assembly, of course, is much more complex and fraught with problems than I’ve led you to believe, but that’s the basic idea. At the end of the article are a few more ~~frightening~~ technical references, if you’d like to dig deeper. In the meantime, I can’t find a suitable quotation to finish off with, so I’ll just leave you with Octavius’ closing words:

So call the field to rest; and let’s away,
To part the glories of this happy day.

Further Reading

I have relied on MIT’s excellent archive of the complete works of William Shakespeare for the quotations here and in the previous post.

If you’re really interested in beetle genomes, the description of the Red Flour Beetle one is a good place to start. Of course, if you are really interested in beetle genomes, you’ve probably read it already. You’ll need a subscription to the journal, though.

Tribolium Genome Sequencing Consortium, et al. The genome of the model beetle and pest Tribolium castaneum. Nature, April 2008, vol. 452 issue 7190, pp. 949-55.

Alignment is a technical and tricky process, but is somewhat de-mystified in some excellent reviews. I recommend the following:

Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nature Methods, November 2009, vol. 6 issue 11 supplement, pp. S6-S12. Somewhat technical, but a nice introduction from two experts from the European Bioinformatics Institute. You will, unfortunately, need a subscription to access it.
Pop M, Salzberg SL. Bioinformatics challenges of new sequencing technology. Trends in Genetics, March 2008, vol. 24 issue 3, pp. 142-149. A technical, but accessible review of the assembly problem and other challenges of high throughput genome sequencing. Freely accessible online.
The NCBI Handbook (Editors: Jo McEntyre and Jim Ostell; National Center for Biotechnology Information, Bethesda, MD, USA: 2002-present) has some good information on genome assembly, but is very technical indeed. Start with Chapter 14, available online here.

Finally, if you’re in a silly mood, you could try these articles, which have nothing at all to do with sequence assembly.

Hook EB. Shakespeare, genetics, malformations, and the Wars of the Roses: hereditary themes in Henry VI and Richard III. Teratology, Feb. 1987, vol. 35 issue 1, 147-55. I believe this one is freely available online here.
Berg JM. Shakespeare as a geneticist. Clinical Genetics, March 2001, vol. 59 issue 3, pp. 165-70. Unfortunately, you’ll need a subscription for this one too.

Last but not least – if you’re really interested in stretching the Shakespeare-medical science connection, try this blog post of mine, discussing a very interesting paper about anal fistula and All’s Well That Ends Well. You just can’t make this stuff up.

25 Responses to Genome Assembly – a primer for the Shakespeare fan

Cath@VWXYNot? says:

January 4, 2011 at 8:14 pm

BRAVO! Excellent post! I especially love the photo captions, and the sneaky use of the word “primer”.
rpg says:

January 4, 2011 at 9:43 pm

Awesome stuff, Richard.
Stephen says:

January 4, 2011 at 9:45 pm

Another gem Richard – only sorry it took me so long to get around to reading it!

However, I still have to question your judgement if that passage is your favourite!

Oh, I see. Irony?
ricardipus says:

January 4, 2011 at 10:18 pm

Thanks, all – I confess I was a bit disappointed that this seemed to have sunk without trace, but I guess posting it right before the new year was probably not the most strategic move, readership-wise.

Cath – you know, I really hadn’t meant to use “primer” so cleverly. Totally didn’t occur to me, so thanks for pointing it out. Note to non-molecular-biologist readers: a “primer” is a short piece of DNA, usually made synthetically, that acts as a starting point for a DNA sequencing reaction.

Richard – ta. That preformatted courier is ugly though isn’t it? It looked much better at one relative point size larger, but I was worried about it spilling into the sidebar, or wrapping.

Stephen – a bit of irony, yes, although I really do like how Shakespeare can effortlessly switch between long, flowery monologues and quick, pithy dialogue like this. I also like the whole “Do you bite your thumb at me, sir? / I bite my thumb…” etc. bit in the prologue of Romeo and Juliet.
rpg says:

January 4, 2011 at 10:40 pm

Courier is ugly. That’s a feature.

(Although I’m happy to take suggestions for making it better.)
Cath@VWXYNot? says:

January 4, 2011 at 10:44 pm

Courier’s ugly, but I’ve never found a better [what’s that word that describes fonts where you can align different letters properly?] one. It’s in all of my papers that needed to show sequence alignments – just short ones, luckily!
ricardipus says:

January 5, 2011 at 12:48 am

I think there used to be a Mac alternative (“even-spaced font”? I think that’s the term) that was better looking, but I don’t remember what it’s called. My main beef is that on my screen, anyway, it looks really pixelated when I used the <pre> tags for formatting. But it would be churlish to complain too much I think.
Cath@VWXYNot? says:

January 5, 2011 at 1:13 am

According to Wikipedia, “Courier is a monospaced slab serif typeface”. But I think there’s another word other than monospaced.

I must be getting old. Either that or I’ve killed my brain with too much alcohol and hockey over Christmas.

I bet Eva knows.
- ricardipus says:
  
  January 5, 2011 at 2:10 am
  
  Also according to Wikipedia, “fixed-pitch” or “non-proportional”. And Monaco is the one I was trying to remember, although there are some other reasonably attractive options:
  
  Click here for Wikipedia entry.
- Eva says:
  
  January 5, 2011 at 5:56 pm
  
  Huh, thought I commented this morning.
  
  “Fixed-width font” is what I’d call Courier and the like.
Eva says:

January 5, 2011 at 5:57 pm

(Oh, moderation, I see. Have I not commented on the OTIrrs before?)
ricardipus says:

January 5, 2011 at 7:41 pm

Hm, puzzling – I’m not seeing anything in the moderation or spam queues, Eva.
Ken says:

January 5, 2011 at 8:59 pm

Excellent post! Will there be a third?
ricardipus says:

January 6, 2011 at 2:47 pm

Ken, thanks – and yes, I am planning an exome sequencing post, this time using Hamlet. I just need to find a character that speaks roughly 1.5% of the words in the play. 😉
- chall says:
  
  January 9, 2011 at 12:16 am
  
  1.5% of the total text? There are so much monologue with odd words in there so maybe Rosencrantz or Guildenstern might work…. 😉 After all, they aren’t really the big speakers – to say the least….
  - ricardipus says:
    
    January 9, 2011 at 12:44 am
    
    Hm, good idea, I’ll see about those two. Ideally, to make the analogy work best, it would be someone who speaks very little, but all through the play. I have no idea who that might be. I need to find a summary of number of words spoken by each character… which might not exist, of course… but then again, it is the internet> I’m searching on.
chall says:

January 9, 2011 at 12:14 am

Wow, this is awesome Ricardipus. Lovely analogies and explanation. (was away for the holidays and didn’t get internet until back.)

I wondered briefly earlier why you’d choose Julius Caesar but I understand now (brought back anymore fun school memories? 😉 )

As a former contig-assembler and trying to get sequences together (especially without any assembled genome to compare it to) I actually like courier – as previously stated; the fixed-width for all letters makes it easy to compare with the eyes…
- ricardipus says:
  
  January 9, 2011 at 12:46 am
  
  Thank you chall. There is much about you I don’t know, clearly… 😉
  
  I appreciate the uses of courier (and used it for alignment figures in my thesis and whatnot), I just think it’s a really ugly font.
  
  I did have to memorize the “Friends, Romans…” speech and re-write it from memory, with correct punctuation and all. At least, I was supposed to be able to.
  - chall says:
    
    January 9, 2011 at 1:35 am
    
    haha, I had to memorize the ending monologue by Puck in “Midsummer night’s dream” (“If we shadows have offended….” ) and the famous (?) Henry V St Crispin’s day (“We few….”)
  - Frank says:
    
    January 9, 2011 at 8:29 am
    
    Do you mean “Friends, New Romans…” or “Friends, Couriers..”?
    - ricardipus says:
      
      January 12, 2011 at 11:07 pm
      
      Frank – argh. Well done, but argh.