Cracking the Code: Mapping the Complex World of Rare Diseases
Hey there! Let’s chat about something pretty important, especially if you or someone you know has a rare disease. It’s a bit technical, I’ll admit, but stick with me because it makes a huge difference in how we understand, diagnose, and ultimately, treat these conditions. We’re talking about the language we use to describe illnesses, specifically those that are, well, *rare*.
The Rare Disease Riddle
So, imagine this: you’ve got a health issue, but it’s not one of the common ones. It’s something that affects maybe only a handful of people in your country, or even fewer. This is the reality for folks with rare diseases. There are thousands of these distinct illnesses out there – somewhere between 5,000 and 8,000, apparently! Many are genetic, some are infectious, others are cancer-related. Individually, they’re rare birds, affecting less than 1 in 2,000 people in the EU or 1 in 1,250 in the US. But collectively? Wow, they impact a massive number of people globally – up to 446 million!
The journey for these patients is often called the “Diagnostic Odyssey.” Sounds adventurous, right? Not so much. It’s more like a frustrating, often heartbreaking, trek. Diagnosis can be painfully slow. I read that 40% initially get the wrong diagnosis, and a staggering 50% remain undiagnosed altogether. This delay isn’t just inconvenient; it leads to more severe illness and sadly, higher mortality rates.
And even when a diagnosis is made, specific tests and treatments are available for less than 10% of rare diseases. Plus, they can be super expensive. Add to that the fact that experts are few and far between, often scattered geographically, and you’ve got a real challenge on your hands. A huge part of the problem? A serious lack of reliable data and consistent information.
Enter the Terminologies
Historically, documenting these diseases often relied on systems like ICD-10 (the 10th version of the International Statistical Classification of Diseases). It’s great for things like reimbursement and basic reporting, but here’s the kicker: ICD-10 can only accurately describe about 500 of those potential 8,000 rare diseases with a specific code. That leaves a massive gap! We desperately need better ways to talk about and record these conditions.
This is where different terminologies, classifications, and ontologies come into play. Think of them as different languages or dictionaries for describing medical concepts. In the rare disease world, we encounter several:
- ORPHAcode: This is Orphanet’s system, giving a unique ID to each rare disease. It’s organized hierarchically and links to tons of other medical databases.
- SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms): A huge, detailed terminology standard for medicine, using unique concepts for diseases, procedures, etc.
- ICD-10 (WHO and GM): The international standard (WHO) and its German modification (GM). Still used, but limited for rare diseases.
- Alpha-ID-SE: A German system linking diagnostic terms to ICD-10-GM and ORPHAcodes, acting a bit like a bridge.
- HPO (Human Phenotype Ontology): Describes phenotypic characteristics (symptoms, findings), linking ORPHAcodes to clinical presentation and genetics.
- OMIM (Online Mendelian Inheritance in Man): Focuses on human genes and genetic phenotypes.
Right now, it’s common practice to use both ICD-10 and ORPHAcode together, especially in places like Europe. This helps cover both billing needs (ICD-10) and more precise medical coding (ORPHAcode).
Why Mappings Matter
Okay, so we have all these different languages. How do we translate between them? That’s where “mappings” come in. A mapping simply shows the relationship between concepts in two different systems. Orphadata provides mappings from ORPHAcode to ICD-10-WHO or OMIM. The German Federal Institute for Drugs and Medical Devices (BfArM) offers mappings between ORPHAcode, Alpha-ID, and ICD-10-GM.
Why are these translations so important? Because they enable semantic interoperability. Fancy term, I know, but it just means getting different computer systems and different people (doctors, researchers, data scientists) to understand the same medical concept, no matter which coding system they’re using. This is absolutely essential for sharing data across hospitals, regions, and even countries. It turns raw data into usable information for research.
Here’s the catch: existing mappings were often created for specific purposes. A mapping designed for billing might be different from one designed for detailed clinical documentation or research. Combining these different mappings can be tricky – you risk losing information or creating misalignments. However, comparing them is super valuable! It helps us expand the existing translations and, crucially, improve their quality and consistency. Getting this right is vital for reliable research and better decision-making.

Our Mission: Harmonizing the Chaos
So, this study I’m telling you about? Its main goal was to dive deep into these rare disease terminologies, harmonize and compare the existing mappings, and then extend them, specifically focusing on SNOMED, ICD-10-GM, and ICD-10-WHO. Why? For two big reasons:
- To create a better foundation for clinical documentation.
- To build vocabularies ready for research data repositories.
There are already some cool initiatives out there, like a parameterized form at a German university hospital that helps document rare diseases using ORPHAcodes. The data from Orphanet is being integrated into a “Transition Database” there. But that database mainly uses ICD-10-WHO and doesn’t fully account for the German modifications (ICD-10-GM) or other systems like SNOMED. Our work aimed to quality-assure and expand this data foundation.
The ultimate vision is to use this data for retrospective observational studies based on electronic health records (EHRs). This is where the Observational Medical Outcomes Partnership (OMOP) Common Data Model from the OHDSI community comes in. OMOP relies on standardized terminologies (they call them “vocabularies”), like SNOMED, to describe medical concepts clearly. OHDSI has a tool called ATHENA that provides these vocabularies and mappings, but ORPHAcode wasn’t fully integrated yet. A key part of our study was preparing ORPHAcodes to be added to ATHENA as a new vocabulary.
The Eight-Step Journey
To tackle this, we followed a structured process, broken down into eight steps, bundled into three main phases:
- Harmonize and Structure: Getting all the necessary datasets and mappings together and organized.
- Identify: Finding where the overlaps, mismatches, and gaps are between the different sources.
- Combine and Prepare: Merging the mappings and getting them ready for practical use.
We used some neat tools to make this happen, like Pentaho Data Integration for transforming the data and SQL scripts for evaluating it. It’s a bit like being a data detective, cleaning up messy information and making connections.
Diving into the Data
Our data came from three main places:
- Orphadata: The official source from Orphanet. We used their Nomenclature Pack (Jan 2024) for ORPHAcode to ICD-10-WHO mappings and their SNOMED map package (July 2023) for ORPHAcode to SNOMED mappings.
- BfArM: The German source, providing mappings between ORPHAcode, Alpha-ID, and ICD-10-GM (using the 2024 version).
- OHDSI/ATHENA: Providing concepts and relationships for ICD-10-WHO, ICD-10-GM, and SNOMED.
We focused only on the active, currently valid ORPHAcodes and relevant mappings from ATHENA (specifically, valid “Condition” concepts with a “Maps to” relationship). We had to deal with different file formats and structures from these sources, which brings us to the next steps.

Getting the Data Ready
Step three involved getting the data into a usable format. Some sources, like Orphadata’s XML files, have nested structures. We had to “flatten” these into simple tables, like you’d see in a spreadsheet or a relational database. Then, in step four, we imported these flattened mappings into a database (PostgreSQL). The goal was to organize the data super cleanly, following something called Boyce-Codd normal form (BCNF). Don’t worry too much about the name, but the idea is to make sure the data is consistent, avoids redundancy, and is easy to update without causing problems.
Comparing Apples and Oranges (and Finding the Similarities)
With the data cleaned up and structured, we could start comparing. Step five was joining the different mapping tables together based on the ORPHAcode. Step six was the comparison itself. We calculated a “diff-score” for each ORPHAcode where we had multiple mapping sources. This score told us:
diff=null: At least one mapping source was missing for that ORPHAcode.diff=1: The mappings from different sources *differed* for that ORPHAcode.diff=0: The mappings from different sources were *exactly the same* for that ORPHAcode.
Creating an overview of these diff-scores was a big help for quality assurance. When we saw a difference (`diff=1`), it flagged that ORPHAcode for review, potentially by medical experts, to figure out why the mappings weren’t consistent.
Building the Master Map
Step seven was all about enrichment and combination. We took all the active ORPHAcodes and joined the mappings from Orphadata (ORPHA-to-SNOMED, ORPHA-to-ICD-10-WHO) and BfArM (ORPHA-to-ICD-10-GM). If an ORPHAcode had an ICD-10 mapping from one of these sources, we then looked up the corresponding SNOMED code using the mappings available in ATHENA. If *no* mapping was available from any source for a particular ORPHAcode, we rolled up our sleeves and did some manual mapping using a tool called Usagi, which helps link source codes (our ORPHAcodes) to standard vocabularies (like SNOMED).
We also added some extra polish:
- If an ORPHAcode had a SNOMED mapping but no ICD-10 mapping listed in our sources, we checked ATHENA to see if that SNOMED code mapped to an ICD-10 code, and added it if it did.
- If an ORPHAcode only had an ICD-10 code for one variant (GM or WHO), we checked if that code could also be used for the other variant.
This cascading approach, using available data first and then filling gaps manually, helped us build a much more complete picture.
Putting it to Work
The final step, utilization, was about getting the data ready for its intended homes. For the Transition Database, having all the mappings together is useful. But for transferring ORPHAcodes to ATHENA to become an OMOP-compliant vocabulary, we needed to format the data specifically for the OMOP standard tables (like `CONCEPT` and `SOURCE_TO_CONCEPT_MAP`). Our final, harmonized Transition Database served as the base for this. The great news is, these OMOP-conform mappings are now publicly available on Zenodo!

What We Found (The Numbers Game)
So, what were the results of all this work? First off, we successfully converted the source files into comparable tables. Looking at the initial sources, they contained different numbers of mapped ORPHAcodes:
- BfArM (ICD-10-GM to ORPHAcode): Mappings to 7,009 unique ORPHAcodes.
- Orphadata (ICD-10-WHO to ORPHAcode): Mappings to 7,307 unique ORPHAcodes.
- Orphadata (SNOMED to ORPHAcode): Mappings to 6,555 unique ORPHAcodes.
Our harmonized database, following BCNF, captured all these different mappings. We successfully joined them and calculated the diff-scores. The comparison showed that the mappings from *all* sources matched for only 368 ORPHAcodes. For 389 ORPHAcodes, there was no SNOMED code in the sources, but an ICD-10 code was available. A significant chunk – 2,447 ORPHAcodes – lacked mappings from *any* source and required manual mapping.
Real-World Examples
Let’s look at a few examples to see these differences in action:
- Phacoanaphylactic uveitis: BfArM maps this to *two* ICD-10-GM codes (H20.2 for the inflammation and H40.5 for secondary glaucoma). Orphadata maps it to only *one* ICD-10-WHO code (H20.2). This shows a coding inconsistency (`Diff of ICD code = 1`). The SNOMED mappings also differed based on which ICD-10 code they were derived from (`Diff of SNOMED code = 1`). This highlights how different systems capture disease complexity differently.
- Congenital pulmonary veins atresia or stenosis: Only ICD-10-GM codes were available in the sources. No ICD-10-WHO or SNOMED mapping was provided, leaving the difference as undefined (`Diff of ICD code is null; Diff of SNOMED code is null`). This is a clear gap.
- Autosomal dominant Charcot-Marie-Tooth Disease Type 2U: Here, both ICD-10-GM and ICD-10-WHO codes were the same (G60.0), and the SNOMED mapping was also consistent. No discrepancies here (`Diff of ICD code = 0; Diff of SNOMED code = 0`). Nice when that happens!
- Wound myiasis: Another example where the name and mappings across all systems (ICD-10-GM, ICD-10-WHO, SNOMED) were identical (`Diff of ICD code = 0; Diff of SNOMED code = 0`).
During the manual mapping phase, we managed to map about 31% of the previously unmapped ORPHAcodes (757 out of 2,424). Some were exact matches, others were broader concepts. About 156 still need expert validation, and roughly 69% (1,667) couldn’t be mapped at this time.
Expanding Our Reach
The combination of existing mappings and our manual work significantly boosted the number of ORPHACodes we could map. The SNOMED mapping was expanded by 1,321 codes, meaning 82.7% of active ORPHAcodes (7,876) are now included. The ICD-10-GM mapping increased by 654 codes (80.5% mapped, 7,663 codes), and ICD-10-WHO increased by 354 codes (also 80.5% mapped, 7,661 codes). Importantly, 100% of the active ORPHAcodes (9,520) were prepared for integration into OHDSI/ATHENA.

Quality is Key
To ensure trustworthiness, we recorded the provenance for every mapping – essentially, where each piece of translation information came from. This is super important for quality assurance and also a requirement for adding the data to OHDSI/ATHENA.
The Big Picture and the Road Ahead
So, we did it! We successfully harmonized, compared, and expanded the existing mappings for rare disease terminologies. This process is a solid foundation for managing these complex codes. It helps us:
- Organize and structure datasets and mappings.
- Spot overlaps, mismatches, and gaps.
- Combine mappings and get them ready for sharing and use.
By expanding the data, we can now map about 83% of active ORPHACodes to SNOMED, ICD-10-GM, or ICD-10-WHO. Most of the ORPHACodes that *couldn’t* be mapped represent “Groups of Diseases” rather than specific conditions. These aren’t usually recommended for coding anyway, so this gap isn’t a major issue for practical use.
We also learned more about *why* mappings differ. For instance, the German ICD-10-GM allows for more specific localization descriptions or even using two primary codes, which isn’t possible with ICD-10-WHO. The BfArM file itself is structured differently – it links a diagnostic term (with an Alpha-ID) to *both* an ORPHAcode and an ICD-10-GM code, rather than being a direct mapping *between* ORPHAcode and ICD-10-GM. These nuances are fascinating but also challenging when trying to combine everything.
Our Transition Database structure is similar to the “Simple Standard for Sharing Ontological Mapping” (SSSOM), which uses tables for structured storage. A current limitation is that we can’t always specify the *accuracy* of a mapping (like if it’s a “wider” or “narrower” concept match), because some sources don’t provide this detail. However, even mappings that aren’t a perfect “equal” match are still valuable, especially for identifying patients based on the codes used in their records.
Looking Forward
What’s next? We’re thinking about creating a graphical tool (a GUI) to make it easier for data scientists and medical experts to explore these terminologies and mappings. This tool could support common research tasks like defining patient groups (phenotyping), maybe even integrating with tools like ATLAS from the OHDSI community.
The coolest part? Our entire process is automated. If the source data from Orphanet, BfArM, or OHDSI gets updated, we can quickly refresh our Transition Database. This significantly reduces the need for tedious manual work and keeps the information current.
Why This Matters for Rare Diseases
Ultimately, this work is about making data on rare diseases more usable and understandable. By harmonizing and standardizing how we code and map these conditions, we lay the groundwork for better data modeling and management. This means German data, often coded with ICD-10-GM and ORPHAcode, can be more easily used in international studies that might rely on SNOMED. Using international standards simplifies defining patient cohorts for research – everyone is speaking the same language about patient characteristics.
This is crucial for EHR-driven phenotyping – using electronic health records to identify and characterize patients, which is essential for observational studies. However, I gotta be real with you, combining mappings created with different methods and for different purposes *does* have shortcomings. It’s super important to be aware of these differences when analyzing data that comes from combined sources. Especially in rare diseases, where patient numbers are small, even tiny errors in data usage can lead to significantly false results. Knowing the limitations of the combined mappings is key to interpreting results correctly.
Despite the challenges, using Real World Data from EHRs is incredibly powerful for rare diseases. It helps bridge gaps between countries, medical professions, and data sources, maximizing the use of the limited data available. This increased visibility helps shine a light on the “Health Orphans” – those patients whose conditions are so rare they often get overlooked – and the overall burden of rare diseases. A consistent approach to terminology, mapping, and quality assurance amplifies the positive impact of using standards in data-driven research. It’s a big step towards making sure no rare disease patient gets left behind in the data revolution.
Source: Springer
