Direkt zum Inhalt
  1. Publizieren |
  2. Suche |
  3. Browsen |
  4. Neuzugänge rss |
  5. Open Access |
  6. Rechtsfragen |
  7. EnglishCookie löschen - von nun an wird die Spracheinstellung Ihres Browsers verwendet.

Data mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20

Darst, Burcu ; Engelman, Corinne D. ; Tian, Ye ; Bermejo, Justo

In: BMC Genetics, 19 (2018), Nr. S.1:76. S. 1-8. ISSN 1471-2156

[thumbnail of 12863_2018_Article_646.pdf]
Vorschau
PDF, Englisch
Download (886kB) | Lizenz: Creative Commons LizenzvertragData mining and machine learning approaches for the integration of genome-wide association and methylation data: methodology and main conclusions from GAW20 von Darst, Burcu ; Engelman, Corinne D. ; Tian, Ye ; Bermejo, Justo steht unter einer Creative Commons Namensnennung 3.0 Deutschland

Zitieren von Dokumenten: Bitte verwenden Sie für Zitate nicht die URL in der Adresszeile Ihres Webbrowsers, sondern entweder die angegebene DOI, URN oder die persistente URL, deren langfristige Verfügbarkeit wir garantieren. [mehr ...]

Abstract

Background: Multiple layers of genetic and epigenetic variability are being simultaneously explored in an increasing number of health studies. We summarize here different approaches applied in the Data Mining and Machine Learning group at the GAW20 to integrate genome-wide genotype and methylation array data.

Results: We provide a non-intimidating introduction to some frequently used methods to investigate high-dimensional molecular data and compare the different approaches tried by group members: random forest, deep learning, cluster analysis, mixed models, and gene-set enrichment analysis. Group contributions were quite heterogeneous regarding investigated data sets (real vs simulated), conducted data quality control and assessed phenotypes (eg, metabolic syndrome vs relative differences of log-transformed triglyceride concentrations before and after fenofibrate treatment). However, some common technical issues were detected, leading to practical recommendations.

Conclusions: Different sources of correlation were identified by group members, including population stratification, family structure, batch effects, linkage disequilibrium and correlation of methylation values at neighboring cytosine-phosphate-guanine (CpG) sites, and the majority of applied approaches were able to take into account identified correlation structures. The ability to efficiently deal with high-dimensional omics data, and the model free nature of the approaches that did not require detailed model specifications were clearly recognized as the main strengths of applied methods. A limitation of random forest is its sensitivity to highly correlated variables. The parameter setup and the interpretation of results from deep learning methods, in particular deep neural networks, can be extremely challenging. Cluster analysis and mixed models may need some predimension reduction based on existing literature, data filtering, and supplementary statistical methods, and gene-set enrichment analysis requires biological insight.

Dokumententyp: Artikel
Titel der Zeitschrift: BMC Genetics
Band: 19
Nummer: S.1:76
Verlag: BioMed Cetral
Ort der Veröffentlichung: London
Erstellungsdatum: 23 Okt. 2018 11:08
Erscheinungsjahr: 2018
ISSN: 1471-2156
Seitenbereich: S. 1-8
Institute/Einrichtungen: Medizinische Fakultät Heidelberg und Uniklinikum > Institut für Medizinische Biometrie
DDC-Sachgruppe: 610 Medizin
Leitlinien | Häufige Fragen | Kontakt | Impressum |
OA-LogoDINI-Zertifikat 2013Logo der Open-Archives-Initiative