A revolutionary DNA search engine accelerates genetic discoveries
It is now possible to detect rare genetic diseases in patients and identify tumor-specific mutations – a feat made possible by DNA sequencing, which has transformed biomedical research for decades. In recent years, the introduction of new sequencing technologies (next generation sequencing) has led to a wave of breakthroughs. During 2020 and 2021, for example, these approaches enabled rapid decoding and global surveillance of the SARS-CoV-2 genome.
At the same time, an increasing number of researchers are making their sequencing results publicly available. This has led to an explosion of data stored in major databases such as the American SRA (Sequence Read Archive) and the European ENA (European Nucleotide Archive). Together these archives now contain about 100 petabytes of information, which is roughly equivalent to the total amount of text on the entire Internet, with one petabyte equaling one million gigabytes.
Until now, biomedical scientists have needed vast computing resources to search these huge genetic repositories and compare them with their own data, making comprehensive searches nearly impossible. Researchers at ETH Zurich have now developed a way to overcome this limitation.
Full text search instead of downloading entire datasets
The team created a tool called MetaGraph, which greatly simplifies and speeds up the process. Instead of downloading entire datasets, MetaGraph allows for direct searches within raw DNA or RNA data – just like using an Internet search engine. Scientists simply enter the genetic sequence of interest into a search field, and within seconds or minutes depending on the query, they can see where that sequence appears in global databases.
“It’s like the Google search engine for DNA,” explains Professor Gunnar Rach, a data scientist at the Department of Computer Science at ETH Zurich. Previously, researchers could only search for metadata and then had to download the full datasets to access the raw sequences. This approach was slow, incomplete and expensive.
According to the study authors, MetaGraph is also remarkably cost-effective. Representing all publicly available biological sequences would require only a few computer hard drives, and large queries would cost no more than about $0.74 per megabase.
Because the new DNA search engine is fast and accurate, it could dramatically speed up research — especially in identifying emerging pathogens or analyzing genetic factors associated with antibiotic resistance. The system may also help locate beneficial viruses that destroy harmful bacteria (phages) hidden within these massive databases.
Compression by a factor of 300
In their study published on October 8 naturethe ETH team demonstrated how MetaGraph works. The tool organizes and compresses genetic data using advanced mathematical graphs that organize information more efficiently, similar to the way a spreadsheet program arranges values. “Mathematically, it’s a huge matrix with millions of columns and trillions of rows,” Rach explains.
Creating indexes to make large datasets searchable is a familiar concept in computer science, but ETH’s approach stands out in how it links metadata to metadata while achieving an exceptional compression rate of about 300 times. This reduction works much like summarizing a book – it removes redundancy while maintaining the narrative and essential relationships, keeping all the relevant information in a much smaller format.
“We are pushing the boundaries of what is possible in order to keep the datasets as compact as possible without losing necessary information,” says Dr. Andre Kallis, who, like Rach, is a member of the Biomedical Informatics Group at ETH Zurich. In contrast to other DNA masks currently being researched, the ETH researchers’ approach is scalable. This means that the larger the amount of data that is queried, the less additional computing power the tool requires.
Half of the data is already available now
MetaGraph was first introduced in 2020, and has been steadily improved. The tool is now generally available for searches (https://metagraph.ethz.ch/search) and already indexes millions of DNA, RNA, and protein sequences from viruses, bacteria, fungi, plants, animals, and humans. Currently, approximately half of all available global sequencing datasets are included, with the remainder expected to follow by the end of the year. Because MetaGraph is open source, it may also attract interest from pharmaceutical companies that manage large amounts of internal research data.
Callis thinks it’s possible that a DNA search engine could one day be used by individuals: “In the early days, even Google didn’t know exactly what a search engine was for. If rapid development in DNA sequencing continues, it may become common to identify your balcony plants more accurately.”














Post Comment