DNA Can Encode Data – So Scientists Uploaded a Movie Into It

16/10/2017

They encoded information within bacterial DNA, making it a living hard-disk that underwent an automatic backup every 20 minutes as the bacteria reproduced.

Neha C.V. is a masters student in molecular and cell biology at the Jawaharlal Nehru Centre for Advanced Scientific Research, Bengaluru.

The idea of using living organisms as hard-disks may seem outlandish. Not so to some scientists who are trying to meld the disparate fields of biology and information technology.

Over the last few years, the idea of using DNA for digital data storage has caught the fancy of many scientists. The idea became mainstream in 2011, thanks to Nick Goldman and his colleagues at the European Bioinformatics Institute (EBI), Cambridge. Goldman and his group develop tools for analysis of genome data and were trying to tackle a problem they faced often: shortage of data storage space.

Though a handy tool, the advent of DNA-sequencing tech led to an explosion of data that had become a burden in terms of storage space. So Goldman & co. thought of a sci-fi-esque solution to this problem: why not use DNA itself?

This thought was brought to fruition by a collaborative effort between groups led by Goldman and Ewan Birney, who works on sequence algorithms for genome analysis at EBI. They were able to encode 739 kb’s worth of information, including a part of Martin Luther King’s ‘I have a dream’ speech and some of Shakespeare’s sonnets, into a DNA sequence they’d synthesised in the lab. This feat was closely followed by George Church and his group at Harvard University, who independently demonstrated the information-storing ability of DNA. Church Church is a pioneer in sequencing and engineering of genomes and synthetic biology.

The DNA super-molecule contains our genetic blueprints. While magnetic hard-drives use binary inputs – 0s and 1s – to encode information, DNA usually uses quaternary inputs, i.e. inputs of four kinds. DNA is made up of smaller molecular blocks called nucleotides. Each nucleotide has a sugar (deoxyribose), a phosphate group and a nitrogenous base. While the sugar and the phosphate group are common to all nucleotides, the bases are unique to each nucleotide and form the digits of the code: adenine, guanine, cytosine and thymine (A, G, C and T).

The ability to make copies of itself that are passed on to the daughter cells and to remain unchanged over time make DNA an ideal information storage molecule. A recent study, published in the journal Nature, took advantage of these possibilities to encode a short movie into the genome of bacteria. The researchers were led by Church, and exploited the intrinsic ability of bacteria to acquire and archive invading DNA sequences to ‘write’ a short digital movie into their genetic material.

Bacteria – similar to some cells within us – remember pathogens (usually viruses called bacteriophages) that have visited them before. While these memories are stored by special cells in our immune system, single-celled bacteria do it differently. They have specialised sentinels called the Cascade complex (made up of proteins called Cas proteins) that recognise the invaders’ genetic material based on some signatures they contain (called protospacer acquisition motifs – PAMs) and chop it up. The fragments are then incorporated within the bacteria’s genetic material at a particular site, called the ‘clustered regularly interspaced short palindromic repeats’ (CRISPR) locus, similar to the archives in a library. This site is maintained and passed on to the bacteria’s progeny. The next time the same invader visits the bacteria, it is immediately recognised and neutralised with the help of the previously acquired piece of DNA. This is called Cas interference.

Taking advantage of this system, the group at Harvard bombarded the bacteria with sets of short virus-like chemically synthesised DNA sequences called oligonucleotides. Each oligonucleotide encoded one of the five frames of Eadweard Muybridge’s galloping mare, which were immediately recognised and archived by the CRISPR system of the host. This information was later successfully retrieved by sequencing the bacteria’s genomes.

'The Horse in Motion' by Eadweard Muybridge. Credit: Wikimedia Commons — ‘The Horse in Motion’ by Eadweard Muybridge. Credit: Wikimedia Commons

The information encoded within the bacterial DNA is continually propagated, making it a living hard-disk that undergoes an automatic backup every 20 minutes (since bacteria divide about once every 20 minutes). But it is obviously not as simple as it sounds. The event of a DNA sequence entering a bacterial cell is a probability, not a certainty. So across the bacterial population, different fragments will be acquired randomly. For example, if you encode each word of the phrase ‘Make hay while the sun shines’, you will have ‘Make hay’ in some cells, ‘hay while the’ in other cells and ‘the sun shines’ in even others – but almost never the entire sentence in all the cells.

Moreover, when you retrieve the information, you need to know how to align it to make sense of it. This problem is a lot more complicated in the case of encrypting a movie because there are multiple frames and, within each frame, multiple pixels. The researchers overcame this problem by barcoding the set of pixels into the DNA sequence itself, and the five frames were introduced sequentially over five days. Since newly acquired sequences are present closer to the initiation site – the point where archiving begins – than the older ones, the order of frames could be reconstructed by aligning the sequences based on an algorithm.

The authors of this study have also managed to identify the critical parameters within certain motifs that dictate which DNA sequences will be incorporated more often than others. It also opens up questions of an evolutionary nature: whether there exists an evolutionary arms-race between the invaders and the hosts, and how this selection pressure shapes the recognition sequences of the viral genome.

Jeff Nivala, a post-doctoral fellow at Harvard University and one of the authors of this study, said, “It has previously been shown that bacteriophages will mutate the protospacer adjacent motifs or the protospacer region to escape recognition by the Cas interference proteins (e.g., Cas9, Cascade-Cas3, etc.). However, I haven’t seen any evidence to suggest that bacteriophage genomes have evolved to exclude [other motifs]. I think it is much more likely that bacteriophage genomes mutate the PAM sites, as the PAM affects acquisition efficiency much more strongly than [another] motif does.”

This is an example of how a study within the realm of basic biology can have applications in seemingly unrelated fields. Nivala agrees: “I like to think of it as a positive feedback loop: by building technology out of biological parts, you end up with a better understanding of the biology itself, and the more you understand the biology, the better the chances you can harness it in technology.”

With this advancement in utilising biomolecules as devices for information storage, there have been reports of using it to encode malware into bacteria and hacking the “decoding” computers. Nivala said, “Though I think we are a long way from DNA-based malware taking over conventional computer systems, scientists and engineers working with this type of technology need to recognise that something like bio-based malware may be possible one day. All the good guys need to be at the forefront of exploring the technological possibilities and coming up with ways of preventing these types of problems before they happen in the future.”