GlycomeDB: Integration von Kohlenhydratstrukturdatenbanken

Ranzinger, René

Preview

PDF, German
Download (64kB) | Terms of use

Translation of abstract (English)

René Ranzinger Dr. sc. hum. GlycomeDB: Integration of carbohydrate structure databases Geboren am 20.09.1976 in Zittau Diplom/Master der Fachrichtung Informatik am 03.11.2004 an der Fachhochschule Darmstadt Promotionsfach: Medizinische Biometrie u. Informatik Doktorvater: Prof. Dr. rer. nat. Thomas Wetter The biggest problem working with carbohydrate structure databases is that structural data and annotation information, such as, taxonomic annotations or experimental details, are distributed in several databases worldwide. Most of these databases store the same structures and have overlapping data content. For each of these databases a specific carbohydrate sequence format and monosaccharide naming scheme was developed, which complicates the comparison of the data content and data exchange between the databases. Therefore, each existing database is an isolated island with almost no connectivity to the other databases. The aim of the GlycomeDB project is to counteract this isolation of databases and the scattering of information. For this purpose, the carbohydrate structures and their taxonomic annotation from the seven largest freely available databases were integrated into a new database with the name GlycomeDB. The seven source databases are BCSDB, CarbBank, CFG database, GLYCOSCIENCES.de, GlycoBase (Dublin), GlycoBase (Lille) and KEGG. The most significant effort required in the project was the standardization of the stored information. For storing carbohydrate structures the sequence format GlycoCT was used and for storing the taxonomic annotations the NCBI Taxonomy. A Java program library was developed for the project which is able to load, process and save carbohydrate structures. To import the structures from the sequence formats used by the different databases, these formats were analyzed and a grammar for each format was defined. Based on these grammars import routines were implemented which are able to parse the sequences and store the carbohydrate structure information in Java objects. By using predefined monosaccharide dictionaries the carbohydrate sequences were translated into the namespace of GlycoCT. In addition by the usage of predefined dictionaries the taxonomic annotations of each carbohydrate structure can be translated to NCBI Taxonomy IDs. Based on the Java library and the dictionaries, the Java program GlycoUpdateDB was implemented. GlycoUpdateDB downloads and standardizes the information from the seven databases and stores the data in GlycomeDB in an automated run. The only human intervention is the curation of the dictionaries. The program runs on a weekly base, synchronizing GlycomeDB with the newest structures from the source databases. Sequences that contain errors or cannot be translated to GlycoCT are recorded in the database. Based on these records, error reports were written and sent to the databases providers, which used these reports to reduce the number of errors in their databases. To provide the data stored in GlycomeDB to all interested scientists, the database and the program GlycuUpdateDB can be downloaded for free. In addition, a web portal (http://www.glycome-db.org) was implemented which allows for online accessing of the data. The portal provides several structure- and taxon-based search routines that allow the finding of structures in GlycomeDB and in the integrated seven databases. As a result of a search in the database all information about the carbohydrate, including references to the source databases, are shown. With these references the user can switch to the web pages of the source database and obtain further data. Additionally, a unique complex query system was implemented in GlycomeDB which allows finding structures based on various combined search criteria. Supplementary to the web portal two web service interfaces were implemented allowing other databases and programs for retrieving data and searching GlycomeDB in an automated way. The result of my thesis is the standardization and storage of structural data and taxonomic annotations of the seven major databases in a new database. This database now contains the most complete index of all available carbohydrate structures. With the implemented web portal and the web service interfaces it is possible to freely access the information in GlycomeDB. The provided information and program code were integrated in other programs and used for statistical analysis.

Document type:	Abstract of a medical dissertation
Supervisor:	Wetter, Prof. Dr. T.
Date of thesis defense:	7 October 2010
Date Deposited:	17 May 2011 15:55
Date:	2009
Faculties / Institutes:	Medizinische Fakultät Heidelberg > Dekanat der Medizinischen Fakultät Heidelberg
DDC-classification:	610 Medical sciences Medicine
Uncontrolled Keywords:	Medizinische Biometrie und Informatik