Docs/Vocabulary Setup

Vocabulary Setup Guide

How to prepare and index standard clinical vocabularies for local concept mapping. This guide covers downloading OMOP vocabularies from Athena, building knowledge layer indexes, and configuring Portiere to use them.


Table of Contents


Overview

Portiere maps clinical codes (ICD-10, SNOMED, LOINC, etc.) to standard OMOP concepts using a knowledge layer -- a searchable index of vocabulary data. The knowledge layer requires pre-indexed vocabulary data to function.

Three search backends are available:

BackendBest ForRequirements
BM25sQuick setup, keyword matchingNone (pure Python)
FAISSSemantic similarity, higher accuracypip install portiere-health[faiss]
HybridBest accuracy (combines both)FAISS + BM25s

The SDK provides build_knowledge_layer() to automate the entire setup from an OHDSI Athena download.


What Are Vocabularies?

Clinical vocabularies are standardized coding systems used in healthcare. OMOP CDM uses these as target vocabularies:

VocabularyFull NameDomainExample Codes
SNOMEDSNOMED CTClinical findings, procedures, anatomy73211009 (Diabetes mellitus)
LOINCLOINCLab tests, clinical observations4548-4 (Hemoglobin A1c)
RxNormRxNormMedications, drug ingredients860975 (Metformin 500mg)
ICD10CMICD-10-CMDiagnosis codes (US billing)E11.9 (Type 2 diabetes)
CPT4CPT-4Procedure codes (US)99213 (Office visit)
HCPCSHCPCSHealthcare supplies and servicesJ0170 (Adrenalin injection)
NDCNational Drug CodeDrug packaging identifiers0378-4000-01

When Portiere maps a source code like E11.9, it searches the knowledge layer for the matching standard concept (e.g., SNOMED concept 201826 "Type 2 diabetes mellitus").


Quick Start: Build a Local Knowledge Layer

If you already have an Athena download, you can build a knowledge layer in three lines:

from portiere.knowledge import build_knowledge_layer

# Parse Athena CSVs and create a BM25s index
paths = build_knowledge_layer(
    athena_path="./data/athena/",
    output_path="./data/vocab/",
    backend="bm25s",
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
)
# Returns: {"bm25s_corpus_path": "./data/vocab/concepts.json"}

Then use the returned paths in your project config:

import portiere
from portiere.config import PortiereConfig, KnowledgeLayerConfig
from portiere.engines import PolarsEngine

config = PortiereConfig(
    knowledge_layer=KnowledgeLayerConfig(backend="bm25s", **paths)
)
project = portiere.init(
    name="My Project",
    engine=PolarsEngine(),
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
    config=config,
)

If you don't have an Athena download yet, follow the steps below.


Step 1: Download Vocabularies from Athena

OHDSI Athena (athena.ohdsi.org) is the official source for OMOP standard vocabularies.

1. Create an Account

Go to athena.ohdsi.org and register for a free account.

2. Select Vocabularies

Click "Download" in the top navigation. You'll see a list of available vocabularies. Check the ones you need:

  • SNOMED CT -- Required for clinical findings and procedures. Requires a UMLS license (free for US users via uts.nlm.nih.gov).
  • LOINC -- Required for lab test mappings.
  • RxNorm -- Required for medication mappings.
  • ICD10CM -- Required for US diagnosis code mappings.
  • CPT4 -- Optional, for procedure codes. Requires an AMA license.
  • HCPCS, NDC, ATC -- Optional, for additional coverage.

3. Download the Bundle

Click "Download Vocabularies". Athena bundles your selection into a zip file and sends a download link to your registered email. This may take a few minutes.

4. Extract the Download

mkdir -p ./data/athena
unzip vocabulary_download_*.zip -d ./data/athena/

5. Verify the Contents

Your ./data/athena/ directory should contain these key files:

FileDescription
CONCEPT.csvAll concept records (concept_id, concept_name, domain_id, vocabulary_id, ...)
CONCEPT_SYNONYM.csvAlternative names for concepts
CONCEPT_RELATIONSHIP.csvRelationships between concepts (Maps to, Is a, etc.)
CONCEPT_ANCESTOR.csvHierarchical ancestry
VOCABULARY.csvVocabulary metadata

Note: These are tab-delimited CSV files, not comma-delimited. The SDK handles this automatically.


Step 2: Build a Knowledge Layer Index

Use the SDK's build_knowledge_layer() function to parse the Athena CSV files and create backend-specific indexes.

BM25s uses keyword-based search. No external services or GPU required.

from portiere.knowledge import build_knowledge_layer

paths = build_knowledge_layer(
    athena_path="./data/athena/",
    output_path="./data/vocab/",
    backend="bm25s",
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
)
# paths = {"bm25s_corpus_path": "./data/vocab/concepts.json"}

Option B: FAISS (Semantic Search -- Higher Accuracy)

FAISS uses dense vector embeddings for semantic similarity matching. Requires pip install portiere-health[faiss].

from portiere.knowledge import build_knowledge_layer

paths = build_knowledge_layer(
    athena_path="./data/athena/",
    output_path="./data/faiss/",
    backend="faiss",
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
)
# paths = {"faiss_index_path": "...", "faiss_metadata_path": "..."}

This encodes all concept names using the SapBERT biomedical embedding model (768-dimensional vectors). The first run downloads the model (~400 MB) and may take several minutes for large vocabularies.

Option C: Hybrid (Best Accuracy)

Combines BM25s (keyword) and FAISS (semantic) search via Reciprocal Rank Fusion (RRF). This is the recommended production setup.

from portiere.knowledge import build_knowledge_layer

paths = build_knowledge_layer(
    athena_path="./data/athena/",
    output_path="./data/hybrid/",
    backend="hybrid",
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
)
# paths = {
#     "bm25s_corpus_path": "./data/hybrid/concepts.json",
#     "faiss_index_path": "./data/hybrid/concepts.index",
#     "faiss_metadata_path": "./data/hybrid/concepts.meta.json",
# }

Step 3: Use the Knowledge Layer in Your Project

Pass the returned paths to KnowledgeLayerConfig:

import portiere
from portiere.config import PortiereConfig, KnowledgeLayerConfig
from portiere.engines import PolarsEngine

config = PortiereConfig(
    knowledge_layer=KnowledgeLayerConfig(
        backend="hybrid",    # or "bm25s", "faiss"
        **paths,             # Paths returned from build_knowledge_layer()
        fusion_method="rrf",
        rrf_k=60,
    )
)

project = portiere.init(
    name="Hospital Migration",
    engine=PolarsEngine(),
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
    config=config,
)

# Concept mapping now uses your local vocabulary index
source = project.add_source("patients.csv")
concept_map = project.map_concepts(source=source)

You can also persist this configuration in portiere.yaml:

knowledge_layer:
  backend: hybrid
  bm25s_corpus_path: ./data/hybrid/concepts.json
  faiss_index_path: ./data/hybrid/concepts.index
  faiss_metadata_path: ./data/hybrid/concepts.meta.json
  fusion_method: rrf
  rrf_k: 60

Cross-Vocabulary Mapping with VocabularyBridge

In addition to building a knowledge layer index for concept search, the Athena download includes CONCEPT_RELATIONSHIP.csv which contains pre-computed relationships between concepts across vocabularies. The VocabularyBridge class uses these relationships for direct cross-vocabulary mapping -- no search or embedding required.

When to Use VocabularyBridge vs Knowledge Layer

Use CaseToolWhy
Map source terms to standard conceptsKnowledge Layer (BM25s/FAISS/Hybrid)Fuzzy search, handles typos and synonyms
Map known concept IDs between vocabulariesVocabularyBridgeDirect lookup via Athena relationships
Build crosswalk tables (e.g., ICD10 → SNOMED)VocabularyBridgeComplete mapping from relationships
Cross-standard mapping transformsVocabularyBridgeTranslates concept IDs in field transforms

Setup

VocabularyBridge uses the same Athena download directory -- specifically CONCEPT.csv (for concept lookups) and CONCEPT_RELATIONSHIP.csv (for cross-vocabulary relationships).

from portiere.knowledge import VocabularyBridge

bridge = VocabularyBridge(
    athena_path="./data/athena/",
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],  # optional filter
)

The vocabularies parameter is optional. When provided, only concepts from those vocabularies are loaded into memory, reducing memory usage for large Athena downloads.

Key Files Used

FileUsed ByPurpose
CONCEPT.csvBoth Knowledge Layer and VocabularyBridgeConcept metadata (ID, name, vocabulary, domain)
CONCEPT_SYNONYM.csvKnowledge Layer onlyAlternative names for search
CONCEPT_RELATIONSHIP.csvVocabularyBridge onlyCross-vocabulary relationships
CONCEPT_ANCESTOR.csvNot yet usedHierarchical ancestry (future)

Relationship Types

VocabularyBridge indexes these relationship types:

  • Maps to / Mapped from -- Equivalence mappings (default for map_concept())
  • Is a / Subsumes -- Hierarchical relationships (used for broader/narrower lookups)

Examples

# Map an OMOP concept to SNOMED
results = bridge.map_concept(4329847, target_vocabulary="SNOMED")

# Build a full ICD10CM → SNOMED crosswalk
crosswalk = bridge.get_crosswalk("ICD10CM", "SNOMED")

# Convert to FHIR CodeableConcept
fhir_cc = bridge.concept_to_codeable_concept(201826)

# Convert to openEHR DV_CODED_TEXT
ehr_ct = bridge.concept_to_dv_coded_text(201826)

Memory Considerations

CONCEPT_RELATIONSHIP.csv can be very large (39M+ rows for a full Athena download). To manage memory:

  1. Filter vocabularies: Pass vocabularies=["SNOMED", "LOINC"] to only load relevant concepts and relationships
  2. Lazy loading: VocabularyBridge loads data on first use, not at initialization
  3. Subset your download: Only select the vocabularies you need from Athena

See Knowledge Layer -- VocabularyBridge for the complete API reference.


Advanced: Loading Concepts Programmatically

For custom processing or inspection, use load_athena_concepts() to parse Athena CSVs into structured records without building an index:

from portiere.knowledge import load_athena_concepts

concepts = load_athena_concepts(
    athena_path="./data/athena/",
    vocabularies=["SNOMED", "LOINC"],
)

print(f"Loaded {len(concepts)} concepts")
print(concepts[0])
# {
#     "concept_id": 201826,
#     "concept_name": "Type 2 diabetes mellitus",
#     "vocabulary_id": "SNOMED",
#     "domain_id": "Condition",
#     "concept_class_id": "Clinical Finding",
#     "standard_concept": "S",
#     "synonyms": ["diabetes type 2", "DM2", "T2DM"],
# }

You can then filter, transform, or index these concepts manually:

# Filter to conditions only
conditions = [c for c in concepts if c["domain_id"] == "Condition"]

# Index into a specific backend
from portiere.knowledge.bm25s_backend import BM25sBackend

backend = BM25sBackend(corpus_path="./data/conditions.json")
backend.index_concepts(conditions)

Concept Record Format

Each concept in the knowledge layer corpus has these fields:

{
    "concept_id": 201826,
    "concept_name": "Type 2 diabetes mellitus",
    "vocabulary_id": "SNOMED",
    "domain_id": "Condition",
    "concept_class_id": "Clinical Finding",
    "standard_concept": "S",
    "synonyms": ["diabetes type 2", "DM2", "T2DM"]
}
FieldTypeRequiredDescription
concept_idintYesOMOP concept ID
concept_namestrYesDisplay name for the concept
vocabulary_idstrYesSource vocabulary (SNOMED, LOINC, etc.)
domain_idstrRecommendedClinical domain (Condition, Drug, Measurement, etc.)
concept_class_idstrOptionalConcept type (Clinical Finding, Lab Test, Ingredient, etc.)
standard_conceptstrOptional"S" for standard concepts
synonymslist[str]OptionalAlternative names (improves search recall)

Adding Custom Vocabularies

To add a custom vocabulary (e.g., institution-specific codes), create a JSON file with your concepts in the standard format:

import json

custom_concepts = [
    {
        "concept_id": 2000000001,
        "concept_name": "Hospital Admission Score",
        "vocabulary_id": "CUSTOM_HOSPITAL",
        "domain_id": "Observation",
        "concept_class_id": "Clinical Observation",
        "standard_concept": "S",
    },
    # ... more concepts
]

with open("./data/custom_vocab.json", "w") as f:
    json.dump(custom_concepts, f, indent=2)

Then merge with existing vocabularies:

from portiere.knowledge import load_athena_concepts
from portiere.knowledge.bm25s_backend import BM25sBackend

# Load standard concepts
standard = load_athena_concepts("./data/athena/", vocabularies=["SNOMED", "LOINC"])

# Add custom concepts
all_concepts = standard + custom_concepts

# Build index
backend = BM25sBackend(corpus_path="./data/merged_vocab.json")
backend.index_concepts(all_concepts)

Include your custom vocabulary ID in the vocabularies parameter:

project = portiere.init(
    name="My Project",
    engine=PolarsEngine(),
    vocabularies=["SNOMED", "LOINC", "CUSTOM_HOSPITAL"],
    config=config,
)

Backend Comparison

FeatureBM25sFAISSHybrid
SetupEasiestModerateModerate
DependenciesNonefaiss-cpu, sentence-transformersBoth
Search TypeKeyword (BM25)Semantic (vector)Both (RRF fusion)
Best ForExact code matchesConceptual similarityProduction accuracy
SpeedFastFast (after model load)Moderate
Disk SpaceSmall (JSON only)Larger (vectors + metadata)Largest
AccuracyGoodBetterBest

Recommendation: Start with BM25s for development, switch to Hybrid for production.


See Also