Mining data from 100,000 people builds a genetic research tool

June 30, 2015 − by Suzanne Elvidge − in Data mining − No Comments

By collecting saliva samples from tens of thousands of Californians and combining genomic data with medical records, researchers are creating a medical research tool that could help understand more about the genetics of disease.

The study was part of the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH) Genetic Epidemiology Research on Adult Health and Aging (GERA). RPGEH is an ongoing study of more than 200,000 members of the Kaiser Permanente Medical Care Plan who have consented to share data from their electronic medical records with researchers, along with answers to survey questions on their behaviour and background. The records include clinical, pharmacy, and laboratory test information. GERA was created in 2009 as a collaboration between RPGEH and the Institute for Human Genetics at University of California, San Francisco (UCSF), and comprises saliva samples of over 100,000 people from the RPGEH study.

“This is an incredible treasure trove of data. The information collected during medical care is much more comprehensive than the isolated measurements we would make in a traditional research study,” says Neil Risch, UCSF. “By linking these clinical records with genomic data from each person, we now have the power to track down many genetic and environmental contributions to disease.”

The data have already been used to investigate many diseases, finding genetic variants linked to prostate cancer, allergies, glaucoma, macular degeneration, diabetes, high cholesterol, and many more.

“No matter which disease we’ve looked at, we found genetic variants that influence it. And the beauty of this dataset is that it covers countless diseases and traits, and the medical records are constantly being updated as the cohort grows older,” says Risch.

RPGEH has made the data available to the research community through an application and review process, and via the NIH’s database of Genotypes and Phenotypes (dbGaP).

“The goal was to create a resource that many research groups could mine for genetic insights into a broad range of diseases,” says Catherine Schaefer, GERA co-principal investigator and executive director of the RPGEH at Kaiser Permanente.

In a study published in the journal Genetics, researchers from UCSF and Kaiser Permanente looked at the length of telomeres (the caps at the tips of chromosomes), a biomarker of aging. Over two years, the team processed more than 100,000 samples, including 70 billion genetic variants. The very large volume of samples to be processed meant the team had to develop a high-throughput robotic system that completed the laboratory tests in four months. The huge volumes of data required the development of new processes for analysing data in real time to alert the researchers to any problems as soon as they happened, and better analysis quality to make best use of the data.

“This is the largest telomere length database ever constructed from a single study population,” says Elizabeth Blackburn of UCSF. “At the start, some were sceptical that we could get reliable data from saliva. But we had a 96 percent success rate, and the results are in fact highly consistent with conclusions from studies of blood.”

According to the analysis, before 75 years old, telomeres get shorter with age. However, this switches after the age of 75, with a link between longer telomeres and longer survival in older people. Women generally had longer telomeres than men did, but this was only significant after the age of 50. The next step is to look at telomere length and disease, behaviour and environmental factors.

Around 20% of the individuals were from minority groups, with over 50 different race/ethnicity identities represented. The researchers developed four separate ethnicity-specific gene analysis arrays (gene chips) for non-Hispanic whites, African Americans, East Asians, or Latinos.

“We were particularly interested in those who checked off more than one box,” says Risch. “More and more people are identifying as multi-ethnic, which can pose some technical challenges for genomic studies. At the same time, it also presents opportunities for analysing genetic and social contributions to disease differences between groups.”

Post a Comment

Your email address will not be published. Required fields are marked *