Picture this: You are a researcher in the Molecular Information Systems Lab housed at the Paul G. Allen School. You and your labmates have developed a system for storing and retrieving digital images in synthetic DNA to demonstrate how this extremely dense and durable medium might be used to preserve the world’s growing trove of data. And you have a particular fondness for cats.
If you wanted to sift through photos of frisky felines — and really, what better way to spend an afternoon — how would you pick out the relevant files floating around in your test-tube database without having to sequence the entire pool?
In a paper published recently in the journal Nature Communications, a team of MISL scientists presented the first technique for performing content-based similarity search among digital image files stored in DNA molecules. The approach is akin to that of a modern search engine, albeit in a much smaller form factor than your average server farm and with the potential to be much more energy efficient.
”Content-based search enables us to type a word or phrase into a hundred-page document and be taken to the exact page it appears, or upload a photo of a daisy and get flower images in return,” said co-author Yuan-Jyue Chen, senior researcher at Microsoft and an affiliate professor at the Allen School. “We don’t know the specific page numbers or files we’re looking for, so that computation saves us the trouble of reading the entire document or scrolling through every photo on the internet. Our team took that same idea and applied it to data stored in molecular form.”
Files, whether stored in digital or molecular form, make use of a process known in database parlance as key-based retrieval. In an electronic database, it is typically a file name; in a molecular database, it is a unique sequence, reminiscent of a barcode, that is encoded in the snippets of DNA associated with a particular file. Items with this barcode can be amplified via polymerase chain reaction (PCR) to reassemble a file in its entirety, since a single digital file might be split among hundreds — possibly even thousands — of DNA oligonucleotides, depending on its total size. Generally speaking, key-based retrieval works great when you know the contents of the files and can pick out which ones you want to retrieve; if you don’t and your data is stored as the As, Ts, Cs and Gs of DNA instead of 0s and 1s, the entire database would have to be sequenced in order to perform a content-based search.
To move beyond the limitations of key-based retrieval, the researchers leveraged DNA’s natural hybridization behavior along with machine learning techniques to enable similarity search to be performed on the stored data. In a conventional digital database, similarity search relies on a set of feature vectors that are stored separately from the original data. When a search is executed, its results point to the location of each data file associated with a particular feature vector. For the molecular version, the MISL team developed an image-to-sequence encoding scheme that employs a convolutional neural network to designate image feature vectors as “similar” or “not similar” and then maps them to DNA sequences that will predictably hybridize — or bind — with the reverse complement of a query feature vector processed by the same neural network during execution of a search. The technique can be applied to new images not seen during training, and the entire process is easily extended to other types of data such as videos and text.
The researchers created an experimental database by running a collection of 1.6 million images through their encoder, which converted them to DNA sequences incorporating their feature vectors, and tacked on a unique barcode identifier for each file. They then performed a similarity search for three photos — including one of a tuxedo cat named Janelle — using the reverse complement of each query image’s encoded feature sequence against a sample of the database. After filtering out the hybridized target/query pairs for high-throughput sequencing, they found the most frequently sequenced oligos did, indeed, corresponded to images in the database that were visually similar to the query images.
The team found that its molecular-based approach was comparable to that of in silico algorithms representing the state of the art in similarity search. Unlike those algorithms, however, the team points out that DNA-based search has the potential to scale to significantly larger databases without a correspondingly significant increase in processing time and energy consumption due to its inherently parallel nature. In this way, researchers have barely scratched the surface of what DNA computing can do.
“As DNA data storage is made more practical by advances in fast and low-cost synthesis, sequencing and automation, the ability to perform computation within these databases will pave the way for hybrid molecular-electronic computer systems. Starting with similarity search is exciting because that is a popular primitive in machine learning systems, which are quickly becoming pervasive,” Allen School professor and co-corresponding author Luis Ceze said.
In addition to Ceze and Chen, contributors to the paper include lead author and Allen School alumna Callista Bee (Ph.D., ‘20), Ph.D. students Melissa Queen and Lee Organick, former research scientist Xiaomeng (Aaron) Liu, lab manager David Ward, Allen School and UW Electrical & Computer Engineering professor Georg Seelig, and co-corresponding author Karin Strauss, an affiliate professor in the Allen School and senior principal research manager at Microsoft. Bee initially presented the team’s proof of concept and precursor to this work at the 24th International Conference on DNA Computing and Molecular Programming (DNA 24), for which she earned a Best Student Paper Award.
Read the team’s latest paper, “Molecular-level similarity search brings computing to DNA data storage,” in Nature Communications.