Genetic genealogy websites enable people to upload their results from consumer DNA testing services like Ancestry.com and 23andMe to explore their genetic makeup, familial relationships, and even discover new relatives they didn’t know they had. But how can you be sure that the person who emails you claiming to be your Uncle Phil really is a long-lost relation?
Based on what a team of Allen School researchers discovered when interacting with the largest third-party genetic genealogy service, you may want to approach plans for a reunion with caution. In their paper “Genotype Extraction and False Relative Attacks: Security Risks to Third-Party Genetic Genealogy Services Beyond Identity Inference,” they analyze how security vulnerabilities built into the GEDmatch website could allow someone to construct an imaginary relative or obtain sensitive information about people who have uploaded their personal genetic data.
Through a series of highly-controlled experiments using information from the GEDmatch online database, Allen School alumnus and current postdoctoral researcher Peter Ney (Ph.D., ‘19) and professors Tadayoshi Kohno and Luis Ceze determined that it would be relatively straightforward for an adversary to exploit vulnerabilities in the site’s application programming interface (API) that compromise users’ privacy and expose them to potential fraud. The team demonstrated multiple ways in which they could extract highly personal, potentially sensitive genetic information about individuals on the site — and use existing familial relationships to create false new ones by uploading fake profiles that indicate a genetic match where none exists.
Part of GEDmatch’s attraction is its user-friendly graphical interface, which relies on bars and color-coding to visualize specific genetic markers and similarities between two profiles. For example, the “chromosome paintings” illustrate the differences between two profiles on each chromosome, accompanied by “segment coordinates” that indicate the precise genetic markers that the profiles share. These one-to-one comparisons, however, can be used to reveal more information than intended. It was this aspect of the service that the researchers were able to exploit in their attacks. To their surprise, they were not only able to determine the presence or absence of various genetic markers at certain segments of a hypothetical user’s profile, but to reconstruct 92% of the entire profile with 98% accuracy.
As a first step, Ney and his colleagues created a research account on GEDmatch, to which they uploaded artificial genetic profiles generated from data contained in anonymous profiles from multiple, publicly available datasets designated for research use. By assigning each of their profiles a privacy setting of “research,” the team ensured that their artificial profiles would not appear in public matching results. Once the profiles were uploaded, GEDmatch automatically assigned each one a unique ID, which enabled the team to perform comparisons between a specific profile and others in the database — in this case, a set of “extraction profiles” created for this purpose. The team then performed a series of experiments. For the total profile reconstruction, they uploaded and ran comparisons between 20 extraction profiles and five targets. Based on the GEDmatch visualizations alone, they were able to recover just over 60% of the target profiles’ data. Based on their knowledge of genetics, specifically the frequency with which possible DNA bases are found within the population at a specific position on the genome, they were able to determine another 30%. They then relied on a genetic technique known as imputation to fill in the rest.
Once they had constructed nearly the whole of a target’s profile, the researchers used that information to create a false child for one of their targets. When they ran the comparison between the target profile and the false child profile through the system, GEDmatch confirmed that the two were a match for a parent-child relationship.
While it is true that an adversary would have to have the right combination of programming skills and knowledge of genetics and genealogy to pull it off, the process isn’t as difficult as it sounds — or, to a security expert, as it should be. To acquire a person’s entire profile, Ney and his colleagues performed the comparisons between extraction and target profiles manually. They estimate the process took 10 minutes to complete — a daunting prospect, perhaps, if an adversary wanted to compare a much greater number of targets. But if one were to write a script that automatically performs the comparisons? “That would take 10 seconds,” said Ney, who is the lead author of the paper.
Consumer-facing genetic testing and genetic genealogy are still relatively nascent industries, but they are gaining in popularity. And as the size of the database grows, so does the interest of law enforcement looking to crack criminal cases for which the trail has gone cold. In one high-profile example from last year, investigators arrested a suspect alleged to be the Golden State Killer, whose identity remained elusive for more than four decades before genetic genealogy yielded a breakthrough. Given the prospect of using genetic information for this and other purposes, the researchers’ findings yield important questions about how to ensure the security and integrity of genetic genealogy results, now and into the future.
“We’re only beginning to scratch the surface,” said Kohno, who co-directs the Allen School’s Security and Privacy Research Lab and previously helped expose potential security vulnerabilities in internet-connected motor vehicles, wireless medical implants, consumer robotics, mobile advertising, and more. “The responsible thing for us is to disclose our findings so that we can engage a community of scientists and policymakers in a discussion about how to mitigate this issue.”
Echoing Kohno’s concern, Ceze emphasizes that the issue is made all the more urgent by the sensitive nature of the data that people upload to a site like GEDmatch — with broad legal, medical, and psychological ramifications — in the midst of what he refers to as “the age of oversharing information.”
“Genetic information correlates to medical conditions and potentially other deeply personal traits,” noted Ceze, who co-directs the Molecular Information Systems Laboratory at the University of Washington and specializes in computer architecture research as a member of the Allen School’s Sampa and SAMPL groups. “As more genetic information goes digital, the risks increase.”
Unfortunately for those who are not prone to oversharing, the risks extend beyond the direct users of genetic genealogy services. According to Ney, GEDmatch contains the personal genetic information of a sufficient number and variety of people across the U.S. that, should someone gain illicit possession of the entire database, they could potentially link genetic information with identity for a large portion of the country. While Ney describes the decision to share one’s data on GEDmatch as a personal one, some decisions appear to be more personal — and wider reaching — than others. And once a person’s genetic data is compromised, he notes, it is compromised forever.
So whether or not you’ve uploaded your genetic information to GEDmatch, you might want to ask Uncle Phil for an additional form of identification before rushing to make up the guest bed.
“People think of genetic data as being personal — and it is. It’s literally part of their physical identity,” Ney said. “You can change your credit card number, but you can’t change your DNA.”
To learn more, read the UW News release here and an FAQ on security and privacy issues associated with genetic genealogy services here. Also check out related coverage by MIT Technology Review, OneZero, ZDNet, GeekWire, McClatchy, and Newsweek.