För de som följer RSS-flödet för framläggningar av exjobb på Institutionen för datavetenskap vid Linköpings universitet kommer det kanske inte som en nyhet att jag lade fram ett exjobb betitlat »An XML-based Database of Molecular Pathways« för exakt tolv timmar sedan.
Research of protein-protein interactions produce vast quantities of data and there exists a large number of databases with data from this research. Many of these databases offers the data for download on the web in a number of different formats, many of them XML-based.> With the arrival of these XML-based formats, and especially the standardized formats such as PSI-MI, SBML and Biopax, there is a need for searching in data represented in XML. We wanted to investigate the capabilities of XML query tools when it comes to searching in this data. Due to the large datasets we concentrated on native XML database systems that in addition to search in XML data also offers storage and indexing specially suited for XML documents.> A number of queries were tested on data exported from the databases IntAct and Reactome using the XQuery language. There were both simple and advanced queries performed. The simpler queries consisted of queries such as listing information on a specified protein or counting the number of reactions.> One central issue with protein-protein interactions is to find pathways, i.e. series of interconnected chemical reactions between proteins. This problem involve graph searches and since we suspected that the complex queries it required would be slow we also developed a C++ program using a graph toolkit. The simpler queries were performed relatively fast. Pathway searches in the native xml databases took long time even for short searches while the C++ program achieved much faster pathway searches.
En länk till själva arbetet kommer nog i slutet av nästa vecka när rapporten är inlämnad för tryckning.