Our “genome” is the DNA in the cells of our body. It spends most of its time as an unruly-looking blob in the nucleus of the cell, but packages itself up nicely into chromosomes when cells divide. It’s the “genetic code”, the material of heredity that passes on traits from parents to children.
The science of “genomics”, which is what I spend much of my time thinking about, is about making sense of the three billion or so letters of the genetic code that is written in this DNA. It’s helpful to think of it as text – DNA is a long, thin molecule that is made up of four different “letters”. Imagine a string, strung with four types of beads. Each has a single letter on it, and they’re all mixed up together. These make a one-letter shorthand, based on the names of the chemical units that make up DNA: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). When genome scientists talk about “reading the DNA sequence”, this is all they mean: what is the order of those beads on the string? We use very sophisticated equipment to read it, but really, that’s all it comes down to in the end.
DNA sequence is, take my word for it, terribly boring to look at. Here’s an example – in this case, a piece of a gene that is responsible for making salivary amylase, an enzyme that digests sugars in your food:
Doesn’t look like much, does it? Now, imagine three billion letters of this, arranged in forty-six enormous volumes. Those volumes are chromosomes; most people have one each of chromosomes 1-22, and two X chromosomes if they’re female, an X and a Y if male. That three billion letters is roughly equivalent to 857,000 pages of text, or about 28,000 copies of a medium-sized Shakespeare play (say, Romeo and Juliet).
The problem of understanding the genome is that while Shakespeare is written in a language that we understand, using familiar concepts (love, jealousy, betrayal), and words that we can look up in a dictionary, the genome sequence is not. It’s a featureless plain of those four letters. It’s got a great deal of meaning embedded in it though, and much has been done to understand it. While a lot of that information came from complicated, specialized biology, some can be found by comparing one genome sequence to another – in other words, looking at variability between individual people. Just as our outward anatomy (hair and eye colour, height, the shape of your nose) varies from person to person, so does the genome sequence. So how do we find this variation?
Returning to Shakespeare, suppose we have a modern edition of Romeo and Juliet, and we suspect that it might have some typographical errors in it. To find them, we could compare it to a “gold standard” – perhaps the first printed edition, or maybe better yet, one of Shakespeare’s original manuscripts. By comparing the language, we could find errors that change the meaning. Of course, some of them will be obvious. Here’s a very famous line from Act II, Scene 2:
O Romeo, Romeo! wherefore art thou Rodeo?
You don’t need the original to compare with, or even know the play, to infer that there’s probably an error in the last word. Genomics researchers can do the same thing – if you show me part of a gene’s sequence, I might be able to guess that one of those A, C, G, or T changes is a problem. That comes with experience, just like reading and speaking English provides you with the experience to guess that “Rodeo” should read “Romeo”.
Reading the rest of the play would make you even more confident that it’s a typo – there are no references to “rodeos” anywhere else in its nearly 26,000 words. Genome scientists use this approach too, relying on computer programs to find things that just “don’t belong”. Rather than rodeos in Shakespeare, we look for changes in DNA that just don’t occur much (like a “STOP” signal in the middle of a gene). Even without knowing what that gene is supposed to look like, we might infer that such a genetic “typo” would be bad.
Other errors might be a lot tougher to spot, though. Consider this quotation, from right after the first one:
Deny the father and refuse thy name;
Without knowing the play, you’d never be able to guess there’s an error there – the first “the” is supposed to read “thy”. It’s just one little letter that changes the meaning a bit, but it’s hard to spot because either “the” or “thy” makes sense. To find it, you need that “gold standard” to compare with.
This is essentially the same as sequencing my genome, and comparing it to yours. They’re both editions of the same book, and tiny differences can have impacts that are huge (a mutation that makes me sick), modest (a change that gives me a higher risk of being sick), or inconsequential. Recent studies suggest that among the three billion or so letters of our genomes, each of us differs by something like three million single-letter typos, and another 45 million that are rearranged in big chunks (in the wrong place, the wrong order, duplicated, or completely missing). Fortunately, almost all of these don’t seem to have much impact on our health.
We can stretch this analogy even further. Our “gold standard” Shakespeare script is likely to have been pieced together from at least five different Quartos and Folios, which is also how the first human genome reference sequence was made. This reference, still used by most genome scientists, was assembled from sequences of DNA from nearly 750 different sources. It’s still extremely useful, but it’s only recently that complete sequences from individual humans have become available instead. And just as we use annotations in the margins to tell us what Shakespeare meant by “in choler”, or how one might go about hoisting a “petard”, so also do genome scientists use annotations to describe different pieces of that three billion character book – where the genes are, for example.
So there you go. Genomes are like Shakespeare, and variation between people is like typographical errors. Sometimes they’re invisible (suppose I switched the places of the two letter “o”s in the word “too”), sometimes they don’t change the meaning much (“the” and “thy”), and sometimes they’re disastrous (where is that rodeo, anyway?). Using modern genome science, we can find them, if, as Romeo says, we “know the letters and the language”.
Some technical reading
- Feuk L, et al. (2006). Structural variation in the human genome. Nature Reviews Genetics, vol. 7 no. 2, pp. 85-97. An older review, but still an interesting discussion of variation between people. Fairly technical. You’ll need a subscription to the journal.
- Khaja R, et al. (2006). Genome assembly comparison identifies structural variants in the human genome. Nature Genetics, vol. 38 no. 12, pp. 1413-1418. One method of comparing two human genome sequences to each other. Very technical. Article freely available here.
- Levy S, et al. (2007). The diploid genome sequence of an individual human. Public Library o Science Biology, vol. 5 no. 10, article e254. The first individual human genome sequence, in this case belonging to Dr. J. Craig Venter. Quite technical. Article freely available here.
- Pang AW, et al. (2010). Towards a comprehensive structural variation map of an individual human genome. Genome Biology, vol. 11 no. 5, article R52. This is one paper that shows peoples’ genomes differ from each other in millions of different places. Quite technical. Article freely available here.
- Wheeler DA, et al. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature, vol. 452 no. 7189, pp. 872-876. The genome sequence of Dr. James Watson, one of the discoverers of the famous “double helix” structure of DNA. Quite technical, although the box about ethical issues of genome sequencing is an interesting and easy read. You’ll need a subscription for this one, too.