Wednesday, October 7, 2020

Digitizing the genome


by Cam Lamoureux, UC San Diego bioengineering PhD candidate 

The genome has historically been known as life’s instruction manual. Indeed, the genome sequence of any organism contains all of the information needed to specify its form and function, from the simplest single-celled bacterium to complex organisms such as humans. But with rapidly developing sequencing technology, the genome is taking the stage as a new type of hard drive, nature’s way of storing information.

Understanding exactly how the genome represents an organism’s information remains a challenge for scientists. Any given DNA base (A, T, C or G) in the genome sequence can be involved in multiple different functions. As part of a gene, for example, a DNA base codes for a particular building block, known as an amino acid, of the protein that the gene specifies. That amino acid, in turn, may be part of a particular shape in the final protein. The DNA base may also be part of a sequence on the opposite side of the DNA double helix that is involved in controlling another gene’s activity. With so many different functions, information encoded by the genome sequence is convoluted and overlapping, yet it is critical to understanding an organism’s behavior.

Our work in bioengineering professor Bernhard Palsson’s Systems Biology Research Group at UC San Diego addresses this challenge. We introduce a completely new way of representing this information. For every DNA base, we can answer a simple yes/no question about every type of information the sequence can encode: does this DNA base encode that information? Borrowing from computer science, we realized that the answer to this question can be thought of as a “bit,” a binary digit. By doing so, we can scan across the entire genome of any organism, ask this question, and tabulate the answer as 1 for “yes” and 0 for “no.”

With this approach, we can construct a clean, quantitative record of the bits of information that an entire genome encodes. We call this method of genome annotation the “Bitome.”


We envision that the Bitome will serve as a key foundational tool for genome engineering, with applications in the sustainable production of industrial and medical compounds. For example, bioprocess engineers who reprogram bacterial genomes to sustainably produce chemical compounds can use our method to quickly assess which parts of an organism’s genome sequence are important for their application, and which are less important. They can make predictions about how proposed changes to the genome sequence will affect the organism.

While the Bitome’s capability mirrors traditional genome browsers, our approach provides far more utility and flexibility. Because we have digitized genome information, we can perform computations on those bits of information.

As a test case, we studied the E. coli genome and showed that DNA bases that contain fewer bits of information are more likely to be mutated during adaptive evolution. Because this observation is based on information that can be encoded by any genome sequence—not just E. coli—it could be used to predict genes that are more likely to mutate in cancerous tissues, for example.

The Bitome’s digitized representation facilitates prediction with machine learning. In part of our study, we applied machine learning to pinpoint the use of a particular stop codon as a predictor of mutability. This result is significant because it provides a deeper understanding of how genes mutated during adaptive evolution, a key tool for genome engineering. We also used machine learning to predict gene essentiality directly from the genome, another key capability for engineering genomes.

We are excited by the potential future applications of the Bitome as a way of analyzing genome sequences. This concept is inherently extensible to any organism’s genome and will undoubtedly serve useful both for deeply understanding the information encoded in a genome and for predicting behavior based on that information. With this work, we hope to further bridge the gap between the genome sequence information and the complex, critical functions that it encodes.

Publication: Lamoureux, C. et al (2020) The Bitome: digitized genomic features reveal fundamental genome organization. Nucleic Acids Res. https://doi.org/10.1093/nar/gkaa774

No comments:

Post a Comment