{"id":32,"date":"2010-12-11T17:58:37","date_gmt":"2010-12-11T17:58:37","guid":{"rendered":"http:\/\/occamstypewriter.org\/irregulars\/?p=32"},"modified":"2010-12-31T21:30:51","modified_gmt":"2010-12-31T21:30:51","slug":"genome-sequencing-shakespeare-style","status":"publish","type":"post","link":"https:\/\/occamstypewriter.org\/irregulars\/2010\/12\/11\/genome-sequencing-shakespeare-style\/","title":{"rendered":"Genome sequencing, Shakespeare style"},"content":{"rendered":"<p><a href=\"http:\/\/www.flickr.com\/photos\/ricardipus\/4510683905\/\" title=\"Sanger sequence by Ricardipus, on Flickr\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/farm5.static.flickr.com\/4055\/4510683905_ce4d76b310.jpg\" width=\"500\" height=\"333\" alt=\"Sanger sequence\" \/><\/a><br \/>\n<em>Some DNA sequence. Each column is one sample, and the four colours are those DNA \u201cbuilding blocks\u201d \u2013 A, C, G and T.<\/em><\/p>\n<p>Our \u201cgenome\u201d is the DNA in the cells of our body. It spends most of its time as an unruly-looking blob in the nucleus of the cell, but packages itself up nicely into chromosomes when cells divide. It\u2019s the \u201cgenetic code\u201d, the material of heredity that passes on traits from parents to children.<\/p>\n<p>The science of \u201cgenomics\u201d, which is what I spend much of my time thinking about, is about making sense of the three billion or so letters of the genetic code that is written in this DNA. It\u2019s helpful to think of it as text \u2013 DNA is a long, thin molecule that is made up of four different \u201cletters\u201d. Imagine a string, strung with four types of beads. Each has a single letter on it, and they\u2019re all mixed up together. These make a one-letter shorthand, based on the names of the chemical units that make up DNA: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). When genome scientists talk about \u201creading the DNA sequence\u201d, this is all they mean:  what is the order of those beads on the string? We use very sophisticated equipment to read it, but really, that\u2019s all it comes down to in the end.<\/p>\n<p><a href=\"http:\/\/www.flickr.com\/photos\/ricardipus\/4452215186\/\" title=\"Applied Biosystems SOLiD by Ricardipus, on Flickr\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/farm5.static.flickr.com\/4016\/4452215186_b399ec7781.jpg\" width=\"500\" height=\"375\" alt=\"Applied Biosystems SOLiD\" \/><\/a><br \/>\n<em>A piece of fancy DNA sequencing equipment.<\/em><\/p>\n<p>DNA sequence is, take my word for it, terribly boring to look at. Here\u2019s an example \u2013 in this case, a piece of a gene that is responsible for making salivary amylase, an enzyme that digests sugars in your food:<\/p>\n<p>TGGTATCTGTACATACCTTTGATGTCAGTGTTTAGTACACGTGGCTTGGTCACTTCATGGCTAA<\/p>\n<p>Doesn\u2019t look like much, does it? Now, imagine three billion letters of this, arranged in forty-six enormous volumes. Those volumes are chromosomes; most people have one each of chromosomes 1-22, and two X chromosomes if they\u2019re female, an X and a Y if male. That three billion letters is roughly equivalent to 857,000 pages of text, or about 28,000 copies of a medium-sized Shakespeare play (say, <em>Romeo and Juliet<\/em>).<\/p>\n<p><a href=\"http:\/\/www.flickr.com\/photos\/ricardipus\/2715441290\/\" title=\"46, XY by Ricardipus, on Flickr\"><img loading=\"lazy\" decoding=\"async\" src=\"http:\/\/farm4.static.flickr.com\/3194\/2715441290_20c791ea62.jpg\" width=\"500\" height=\"354\" alt=\"46, XY\" \/><\/a><br \/>\n<em>Chromosomes. Mine, in fact.<\/em><\/p>\n<p>The problem of understanding the genome is that while Shakespeare is written in a language that we understand, using familiar concepts (love, jealousy, betrayal), and words that we can look up in a dictionary, the genome sequence is not. It\u2019s a featureless plain of those four letters. It\u2019s got a great deal of meaning embedded in it though, and much has been done to understand it. While a lot of that information came from complicated, specialized biology, some can be found by comparing one genome sequence to another \u2013 in other words, looking at variability between individual people. Just as our outward anatomy (hair and eye colour, height, the shape of your nose) varies from person to person, so does the genome sequence. So how do we find this variation?<\/p>\n<p>Returning to Shakespeare, suppose we have a modern edition of <em>Romeo and Juliet<\/em>, and we suspect that it might have some typographical errors in it. To find them, we could compare it to a \u201cgold standard\u201d \u2013 perhaps the first printed edition, or maybe better yet, one of Shakespeare\u2019s original manuscripts. By comparing the language, we could find errors that change the meaning. Of course, some of them will be obvious. Here\u2019s a very famous line from Act II, Scene 2:<\/p>\n<p><em>O Romeo, Romeo! wherefore art thou Rodeo?<\/em><\/p>\n<p>You don\u2019t need the original to compare with, or even know the play, to infer that there\u2019s probably an error in the last word. Genomics researchers can do the same thing \u2013 if you show me part of a gene\u2019s sequence, I might be able to guess that one of those A, C, G, or T changes is a problem. That comes with experience, just like reading and speaking English provides you with the experience to guess that \u201cRodeo\u201d should read \u201cRomeo\u201d.<\/p>\n<p>Reading the rest of the play would make you even more confident that it\u2019s a typo \u2013 there are no references to \u201crodeos\u201d anywhere else in its nearly 26,000 words. Genome scientists use this approach too, relying on computer programs to find things that just \u201cdon\u2019t belong\u201d. Rather than rodeos in Shakespeare, we look for changes in DNA that just don\u2019t occur much (like a \u201cSTOP\u201d signal in the middle of a gene). Even without knowing what that gene is supposed to look like, we might infer that such a genetic \u201ctypo\u201d would be bad.<\/p>\n<p>Other errors might be a lot tougher to spot, though. Consider this quotation, from right after the first one:<\/p>\n<p><em>Deny the father and refuse thy name;<\/em><\/p>\n<p>Without knowing the play, you\u2019d never be able to guess there\u2019s an error there \u2013 the first \u201cthe\u201d is supposed to read \u201cthy\u201d. It\u2019s just one little letter that changes the meaning a bit, but it\u2019s hard to spot because either \u201cthe\u201d or \u201cthy\u201d makes sense. To find it, you need that \u201cgold standard\u201d to compare with.<\/p>\n<p>This is essentially the same as sequencing my genome, and comparing it to yours. They\u2019re both editions of the same book, and tiny differences can have impacts that are huge (a mutation that makes me sick), modest (a change that gives me a higher risk of being sick), or inconsequential. Recent studies suggest that among the three billion or so letters of our genomes, each of us differs by something like three million single-letter typos, and another 45 million that are rearranged in big chunks (in the wrong place, the wrong order, duplicated, or completely missing). Fortunately, almost all of these don\u2019t seem to have much impact on our health.<\/p>\n<p>We can stretch this analogy even further. Our \u201cgold standard\u201d Shakespeare script is likely to have been pieced together from at least five different Quartos and Folios, which is also how the first human genome reference sequence was made. This reference, still used by most genome scientists, was assembled from sequences of DNA from nearly 750 different sources. It\u2019s still extremely useful, but it\u2019s only recently that complete sequences from individual humans have become available instead. And just as we use annotations in the margins to tell us what Shakespeare meant by \u201cin choler\u201d, or how one might go about hoisting a \u201cpetard\u201d, so also do genome scientists use annotations to describe different pieces of that three billion character book \u2013 where the genes are, for example.<\/p>\n<p>So there you go. Genomes are like Shakespeare, and variation between people is like typographical errors. Sometimes they\u2019re invisible (suppose I switched the places of the two letter \u201co\u201ds in the word \u201ctoo\u201d), sometimes they don\u2019t change the meaning much (&#8220;the&#8221; and &#8220;thy&#8221;), and sometimes they\u2019re disastrous (where <em>is<\/em> that rodeo, anyway?). Using modern genome science, we can find them, if, as Romeo says, we <em>&#8220;know the letters and the language&#8221;<\/em>.<br \/>\n<\/p>\n<p><strong>Some technical reading<\/strong><\/p>\n<ul>\n<li> <strong>Feuk L, <em>et al.<\/em> (2006). Structural variation in the human genome. <em>Nature Reviews Genetics<\/em>, vol. 7 no. 2, pp. 85-97.<\/strong> An older review, but still an interesting discussion of variation between people. Fairly technical. You&#8217;ll need a subscription to the journal.<\/li>\n<li> <strong>Khaja R, <em>et al.<\/em> (2006). Genome assembly comparison identifies structural variants in the human genome. <em>Nature Genetics<\/em>, vol. 38 no. 12, pp. 1413-1418.<\/strong> One method of comparing two human genome sequences to each other. Very technical. Article freely available <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC2674632\/?tool=pubmed\u201d\">here<\/a>.<\/li>\n<li> <strong>Levy S, <em>et al.<\/em> (2007). The diploid genome sequence of an individual human. <em>Public Library o Science Biology<\/em>, vol. 5 no. 10, article e254.<\/strong> The first individual human genome sequence, in this case belonging to Dr. J. Craig Venter. Quite technical. Article freely available <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC1964779\/?tool=pubmed\u201d\">here<\/a>.<\/li>\n<li> <strong>Pang AW, <em>et al.<\/em> (2010). Towards a comprehensive structural variation map of an individual human genome. <em>Genome Biology<\/em>, vol. 11 no. 5, article R52.<\/strong> This is one paper that shows peoples&#8217; genomes differ from each other in millions of different places. Quite technical. Article freely available <a href=\"http:\/\/www.ncbi.nlm.nih.gov\/pmc\/articles\/PMC2898065\/?tool=pubmed\u201d\">here<\/a>.<\/li>\n<li> <strong>Wheeler DA, <em>et al.<\/em> (2008). The complete genome of an individual by massively parallel DNA sequencing. <em>Nature<\/em>, vol. 452 no. 7189, pp. 872-876.<\/strong> The genome sequence of Dr. James Watson, one of the discoverers of the famous \u201cdouble helix\u201d structure of DNA. Quite technical, although the box about ethical issues of genome sequencing is an interesting and easy read. You&#8217;ll need a subscription for this one, too.<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>What is a genome sequence, and how does mine differ from everyone else&#8217;s? We turn to William Shakespeare to help us find out. <a href=\"https:\/\/occamstypewriter.org\/irregulars\/2010\/12\/11\/genome-sequencing-shakespeare-style\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":14,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21,20],"tags":[5,11,12,10],"class_list":["post-32","post","type-post","status-publish","format-standard","hentry","category-education","category-guestposts","tag-dna","tag-genomics","tag-sequencing","tag-shakespeare"],"_links":{"self":[{"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/posts\/32","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/users\/14"}],"replies":[{"embeddable":true,"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/comments?post=32"}],"version-history":[{"count":0,"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/posts\/32\/revisions"}],"wp:attachment":[{"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/media?parent=32"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/categories?post=32"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/occamstypewriter.org\/irregulars\/wp-json\/wp\/v2\/tags?post=32"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}