Problem: What is the extent of the over lap (between four biomedical ontologies)? How can merged terms show provenance?
Huge databases with thousands of classes - how do we get the synonyms?
Which terms are similar to each other in the different databases?
Straightforward enough to compare two lists… but not enough to compare terms; terms will have synonyms
Synonyms should find an initial overlap
Provenance of terms
Where it came from? History? Was it refined?
Looking at versions over time, timeline interface
Language for how something has changed - ontology versioning
Temporal database that captures changes with timestamps
Provenance can be very shallow or very deep
Some datasets track evolution, others just show changes
PROV-O might be nice to leverage here
Agent and activity (was influenced by…)
Bioportal to see biomedical/healthcare ontologies
Use to see overlapping mappings between ontologies (problem space already addressed)
Bioportal mappings have a source - computer vs. users
Can see mappings between two, but what about three or four? What goes in the middle of the Venn diagram?
Know overlap, know terms, go back and see if they're the same… is it really the same, or is it some other relation?
Overlaps
LOINC/SNOMED: 29,676
LOINC/RxNORM: 5,714
LOINC/ICD10: 1,291
SNOMED/RxNORM: 23,594
SNOMED/ICD10: 21,568
Solution: CUI number for same concept
Identify the unique concepts and collect CUI numbers
Identify which terms in datasets do not have CUI numbers
Once you know which terms don't have CUI numbers, you know the size/scope of the problem
Assign CUI numbers to those terms
Some string searches on descriptions, but really a human activity
Semantic similarity/proximity analysis instead of string search
UMLS as an upper ontology to map to (either you create something or you reuse something… Bioportal used UMLS)
CUIs are well maintained
2 versions, need named graph of version CUI came from
UMLS has a vocabulary change file released with each update, seems to be in proprietary format but may be able to translate into RDF for versioning
Tracks CUIs that dissapear
*Utility for better tools*
MetaMap used to index PubMed, biomedical journals
NLM has phrase disambiguation, entity resolution… spits out a CUI for various parts of processed text
Solution: Automated approach
Identify what automation and tools already exist
Compare two concepts and do a text string similarly check, anything about a certain % similar will be mapped under same CUI
Structural similarities
Need anchor points for structure - if you know two terms are similar…
Have anchor points in a couple places
Topic modeling algorithm as a better way to match up terms
Latent semantic indexing (LSI)
Wikipedia -> run a latent semantic indexing algorithm against any term of interest
LSI will find how close two terms are
Runs against textual definitions
Another technique to use in cases of huge text
Humans have already applied usage of text
One term in LOINC, one term in another, apply LSI to see correspondence
New meanings would be very difficult to find
Look at mappings between ICD9 and ICD10 - major differences
If you update an ontology you might lose the mapping… unless you have provenance
Solution: SWRL rules can do mapping as well
If it changes, you modify the rule (extensive work)
Provides more complex mappings, if you want to find details
Mapping and solution is context dependent
Problems with granularity in ontologies
See ICD10 injuries
Absence of axioms that constrain, QA difficult in these huge ontologies
Semantic cotopy - comparing similarities
Venn Diagram Solution
Look at the CUI numbers (from Bioportal) and put them into "buckets" (with the right motivation)
Provenance of Terms Solution
The table could show the distance between nodes when measuring similarity. Need to define what "similarity" is, in the case of highly similar terms. UMLS-Similarity/UMLS-Interface allows you to enter two terms and returns semantic similarity.
Overview
There are tools that exist and work that can be leveraged to aid. Bioportal already completed the mappings with CUI numbers. The biggest challenge is with the provenance; it goes all the way to ontology maintenance.
Theoretically, if it was necessary to do the mappings, an automated approach with latent semantic indexing is possible, but the mappings still need to be validated by a human. Metadata is also necessary in order to understand the intent, context, and perspective of terms in an ontology. A term that seems to be the same concept as another may be different in a different context.
For Addition in the Communique
There are other pressing issues that were revealed during the medical challenge discussion. Specifically, the application of ontology terms in hospitals and the ability of doctors to utilize them in treatments. There is a disconnect between similar terms based on context and intent. The focus of the ontologies is on definitions, and not on the delivery of medical services. In the case of definition for scientific use in the research domain, these definitions may not be accessible for doctors. There is also a disconnect with terms for clinical decision support; there are no rules that validate the correlation of a treatment procedure with a diagnosis. Finally, the same ontology is used for different use cases: treatment, research, and finance in the medical domains and may not be appropriate in all these cases.
There are fundamental problems with the structures of the ontologies in the biomedical domain. These structures are vastly different when looking at one compared to another, and on an upper level, similar terms may not match up at all. This makes the structural similarity (cotopy) very different.