Working with case-control data – GenoKey’s combinatorial data mining platform

January 09, 2014 − by Suzanne Elvidge − in Big data, Data analytics, Data mining, GenoKey news, GenoKey's big data analyses − No Comments

Case-control data is important for longitudinal studies, but the huge sets of complex data can make analysis difficult. GenoKey’s combinatorial data mining algorithms, run on standard computer hardware, can cut through the complexity in minutes, finding correlations with a speed and accuracy unmatched by other tools currently available.

Case-control studies

Case-control studies look at sets of data collected over long periods, comparing cases (people, plants or animals with certain diseases or disorders) with control groups who don’t develop the disease. This could be a group of smokers, looking at those who develop cancer (cases) and those who don’t (controls) by comparing their electronic medical records; or a group of trees, finding the genetic differences between those that do (case) and don’t (control) develop certain infectious diseases such as sudden oak death. Case control studies can use historical data from existing longitudinal studies, or current results generated from ongoing observational studies or investigational trials.

Case control studies can generate a lot of information – even a small proof-of-concept study, with only a few hundred or a thousand patients and controls, and a few hundred biomarkers such as SNPs, will create a complex dataset. Some studies, involving tens of thousands of patients and controls and millions of markers, will generate massive amounts of data, including genetic information (SNPs, DNA and sequence data, genetic tests), clinical data, and information from electronic medical records and blood test results.

The GenoKey workflow

GenoKey’s combinatorial data mining technology is powerful enough to help researchers to find the valuable information in these huge data sets, identifying patterns that could otherwise be missed in the noise of big data.

The workflow:

1. Validate and structure the data as tables of case and control results

  • GenoKey’s technology can handle any structured data format, including CSV files, Excel files, relational databases and others
  • The data can be made anonymous if required, excluding patient IDs

2. Create a hypothesis
3. Use the GenoKey pattern miner to find the patterns and correlations within the data to support or refute the hypothesis

  • GenoKey’s innovative pattern miner scales from standard computer hardware to powerful large-scale cloud systems, and computes in minutes or hours rather than days or weeks, essential when high quality output (permutation tests) is required

4. Use the results, or if required, feed the information back into the system and run the analysis again.

GenoKey workflow

Post a Comment

Your email address will not be published. Required fields are marked *