Describing clinical data using the research data ontology
Submitted by mmarkov on January 15, 2009 - 11:55pm
The purpose of having the research data ontology (hereafter RDO), is that it should be able to describe all kinds of clinical data. We are particularly interested in describing clinical data because this would be a more innovative application of knowledge bases than to do so with other scientific data. All the same, our ontology should be general enough to incorporate non-clinical data since most clinical studies also use genotype data etc. We successfully represent subsets of 2 different data sets in RDF using the RDO, with Virtuoso server software as a triple store.
To store the data sets, a local copy of Virtuoso server was used. Notes on setting it up (a tedious process but only done once) is summed up under Applications > Triple stores > Virtuoso.
Earlier in the project, Jena was used as a triple store. Out of memory errors were encountered while reasoning with Pellet.
Java is to be used for accessing the data since many libraries such as Jena are written in/for Java.
Currently Perl is the language of choice for importing data. Previously this was done with Java, but Perl is more efficient at text processing and it is a scripting language which is more suited for use by a database administrator, who is usually responsible for running bulk imports of data.
Notes on setting up Virtuoso are to be found under Applications > Triple stores > Virtuoso. It is a tedious process but only needs to be done once.
A Perl script (main.pl, see attached) is used to import Biomarkers data into Virtuoso (see attached). It deserves a description on its own, since it is particularly well-organized for this purpose and should be used as a prototype for future imports.
The script iterates through a data file and creates SPARUL (an extension of the SPARQL query language particular to Virtuoso) INSERT INTO queries for the data and inserts them into a Virtuoso graph.
The import script presupposes another script, which puts the data in a particular format, namely a .csv having the following format for each row:
[patient id | day | hr | min | data point | ... | data point]
where day, hour, minute are offset times from the patient's first data point(s) and everything is comma-separated. The format may seem strange but it is quite necessary because 'absolute' time stamps (e.g., Jan 1, 2001) are personal data that should never be disclosed and should be discarded from any study. Our use of such data would never be approved by an ethics board. An example will best illustrate what is meant. Let's say, Joe Average had his body temperature and blood pressure measured at St Paul's on Feb 29 at 4PM, Feb 30 at 5PM, Feb 31 at 5:30AM, 2008. Then the resulting day, hour, minute would be:
0,0,0 on Feb 29 at 4PM
1,1,0 on Feb 30 at 5PM
1,13,30 on Feb 31 at 5:30AM
and the .csv might look like this, assuming Joe's ID was 42:
Another requirement of the script is that the data must be ordered by patient ID. This isn't difficult but affects how the script works.
The script makes good use of Perl hashes to store things that change from data set to data set, namely: namespaces; protocols, datatypes (float, int, etc; they are assumed to be XSD datatypes) and units of measure for each column (ie type of data point).
The rest of the script should be sufficiently well commented to be understandable upon reading.
The Virtuoso SPARQL endpoint is your friend. I have found these queries to be handy.
//TODO put in queries
1. The blank insert query -- can use to check whether Virtuoso is set up properly
2. The triple count query -- how large is the data set?
3. The patient list query -- which patients are in our data set?
Virtuoso seems to be a good triple store to use, faster than the others. No numerical comparison has been made but
A benchmark of FaCT++ and Pellet shows that FaCT++ is faster and uses less memory. (See Fig. ?) This may be the end of the out-of-memory errors but further tests need to be done.
The SIRS data set is too disorganized and poorly curated to use. There have been no problems with the Biomarkers data set so far, but ideally we would like more data sets.
The work will consist of importing a sufficiently large chunk of some data set to say that the technology is scalable. Ideally we would like to reproduce analysis of the data set as was done by the researchers who generated it, and if approved by ethics/etc, to do new analyses or at least new classifications of the data using our superior reasoning capabilities.
The script can't handle any non-flat data yet (eg, if we had 2 different tables). It should be an easy modification but in the future this functionality will need to be added.
Abstract
The purpose of having the research data ontology (hereafter RDO), is that it should be able to describe all kinds of clinical data. We are particularly interested in describing clinical data because this would be a more innovative application of knowledge bases than to do so with other scientific data. All the same, our ontology should be general enough to incorporate non-clinical data since most clinical studies also use genotype data etc. We successfully represent subsets of 2 different data sets in RDF using the RDO, with Virtuoso server software as a triple store.
Data Sets & Tools Used
To store the data sets, a local copy of Virtuoso server was used. Notes on setting it up (a tedious process but only done once) is summed up under Applications > Triple stores > Virtuoso.
Earlier in the project, Jena was used as a triple store. Out of memory errors were encountered while reasoning with Pellet.
Java is to be used for accessing the data since many libraries such as Jena are written in/for Java.
Currently Perl is the language of choice for importing data. Previously this was done with Java, but Perl is more efficient at text processing and it is a scripting language which is more suited for use by a database administrator, who is usually responsible for running bulk imports of data.
Methods
Notes on setting up Virtuoso are to be found under Applications > Triple stores > Virtuoso. It is a tedious process but only needs to be done once.
A Perl script (main.pl, see attached) is used to import Biomarkers data into Virtuoso (see attached). It deserves a description on its own, since it is particularly well-organized for this purpose and should be used as a prototype for future imports.
Import Script
The script iterates through a data file and creates SPARUL (an extension of the SPARQL query language particular to Virtuoso) INSERT INTO queries for the data and inserts them into a Virtuoso graph.
The import script presupposes another script, which puts the data in a particular format, namely a .csv having the following format for each row:
[patient id | day | hr | min | data point | ... | data point]
where day, hour, minute are offset times from the patient's first data point(s) and everything is comma-separated. The format may seem strange but it is quite necessary because 'absolute' time stamps (e.g., Jan 1, 2001) are personal data that should never be disclosed and should be discarded from any study. Our use of such data would never be approved by an ethics board. An example will best illustrate what is meant. Let's say, Joe Average had his body temperature and blood pressure measured at St Paul's on Feb 29 at 4PM, Feb 30 at 5PM, Feb 31 at 5:30AM, 2008. Then the resulting day, hour, minute would be:
0,0,0 on Feb 29 at 4PM
1,1,0 on Feb 30 at 5PM
1,13,30 on Feb 31 at 5:30AM
and the .csv might look like this, assuming Joe's ID was 42:
42,0,0,0,37.0,100
42,1,1,0,36.9,110
42,1,13,30,37.7,105
Another requirement of the script is that the data must be ordered by patient ID. This isn't difficult but affects how the script works.
The script makes good use of Perl hashes to store things that change from data set to data set, namely: namespaces; protocols, datatypes (float, int, etc; they are assumed to be XSD datatypes) and units of measure for each column (ie type of data point).
The rest of the script should be sufficiently well commented to be understandable upon reading.
Some useful SPARQL queries
The Virtuoso SPARQL endpoint is your friend. I have found these queries to be handy.
//TODO put in queries
1. The blank insert query -- can use to check whether Virtuoso is set up properly
2. The triple count query -- how large is the data set?
3. The patient list query -- which patients are in our data set?
Results
Virtuoso seems to be a good triple store to use, faster than the others. No numerical comparison has been made but
A benchmark of FaCT++ and Pellet shows that FaCT++ is faster and uses less memory. (See Fig. ?) This may be the end of the out-of-memory errors but further tests need to be done.
The SIRS data set is too disorganized and poorly curated to use. There have been no problems with the Biomarkers data set so far, but ideally we would like more data sets.
Future Directions
The work will consist of importing a sufficiently large chunk of some data set to say that the technology is scalable. Ideally we would like to reproduce analysis of the data set as was done by the researchers who generated it, and if approved by ethics/etc, to do new analyses or at least new classifications of the data using our superior reasoning capabilities.
The script can't handle any non-flat data yet (eg, if we had 2 different tables). It should be an easy modification but in the future this functionality will need to be added.
»
- Login to post comments
