Docs/Elasticsearch Backend

Elasticsearch Backend

The Elasticsearch backend connects Portiere to an external Elasticsearch cluster for concept search. It uses BM25 lexical matching with optional vocabulary and domain filtering, and supports batch search via the multi-search API (_msearch).


Table of Contents

  1. When to Use
  2. Installation
  3. Connecting to Elasticsearch
  4. Indexing Concepts
  5. Searching Concepts
  6. Batch Search
  7. Concept Lookup by ID
  8. Full Pipeline Integration
  9. Hybrid Mode: Elasticsearch + FAISS
  10. Authentication Options
  11. Index Schema
  12. Configuration Reference
  13. Performance Notes

When to Use

  • Your team already runs an Elasticsearch cluster
  • You need horizontal scalability for large vocabularies (millions of concepts)
  • You need production-grade infrastructure with monitoring, replication, and backups
  • BM25 lexical search is sufficient, or you plan to combine it with FAISS for hybrid search

When NOT to Use

  • You need a zero-dependency, offline-first setup → use BM25s instead
  • You only need semantic (dense vector) search → use FAISS instead
  • You want the simplest possible setup for a notebook → use BM25s or FAISS

Installation

pip install portiere-health[elasticsearch]

This installs the elasticsearch Python client (8.x). You also need a running Elasticsearch instance (8.x recommended).

Quick Start with Docker

docker run -d --name es-portiere \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  elasticsearch:8.11.0

Verify:

curl http://localhost:9200

Connecting to Elasticsearch

from portiere.knowledge.elasticsearch_backend import ElasticsearchBackend

# Local development (no auth)
backend = ElasticsearchBackend(
    url="http://localhost:9200",
    index_name="portiere_concepts",
    verify_certs=False,
)

The index_name controls which Elasticsearch index is used for concept storage and search. You can run multiple projects against different indices on the same cluster.


Indexing Concepts

Before searching, you need to load your vocabulary concepts into the Elasticsearch index. Concepts should be a list of dictionaries with standard OMOP fields:

import json

# Load concepts from a JSON file
with open("concepts.json") as f:
    concepts = json.load(f)

# Each concept should have at minimum:
# {
#     "concept_id": 201826,
#     "concept_name": "Type 2 diabetes mellitus",
#     "vocabulary_id": "SNOMED",
#     "domain_id": "Condition",
#     "concept_class_id": "Clinical Finding",
#     "standard_concept": "S"
# }

backend.index_concepts(concepts)
print(f"Indexed {len(concepts)} concepts")

The index_concepts() method:

  1. Creates the index with the appropriate mapping if it does not exist
  2. Bulk-indexes all concepts using the Elasticsearch helpers.bulk API
  3. Refreshes the index so documents are immediately searchable

Re-indexing

Calling index_concepts() on an existing index adds documents. If you need a clean re-index, delete the index first:

backend.es.indices.delete(index="portiere_concepts", ignore_unavailable=True)
backend.index_concepts(concepts)

Searching Concepts

The search() method executes a BM25 query across concept names and descriptions:

results = backend.search("diabetes", limit=5)

for r in results:
    print(f"{r['concept_name']:45s}  {r['vocabulary_id']:10s}  score={r['score']:.2f}")

Vocabulary Filtering

Restrict results to specific vocabularies:

results = backend.search("hypertension", vocabularies=["SNOMED"], limit=5)

Domain Filtering

Restrict results to a specific domain (Condition, Drug, Measurement, etc.):

results = backend.search("aspirin", domain="Drug", limit=5)

Combined Filters

results = backend.search(
    "glucose",
    vocabularies=["LOINC"],
    domain="Measurement",
    limit=10,
)

For mapping pipelines that need to search many terms at once, batch_search() uses the Elasticsearch multi-search API (_msearch) to execute all queries in a single round-trip:

queries = ["diabetes", "hypertension", "headache", "metformin", "glucose"]
batch_results = backend.batch_search(queries, limit=3)

for query, results in zip(queries, batch_results):
    print(f"Query: '{query}'")
    for r in results:
        print(f"  → {r['concept_name']:40s}  score={r['score']:.2f}")

This is significantly faster than calling search() in a loop, especially with network latency.


Concept Lookup by ID

Retrieve a single concept directly by its concept ID:

concept = backend.get_concept(201826)
print(f"{concept['concept_id']}: {concept['concept_name']} ({concept['vocabulary_id']})")

Full Pipeline Integration

Use Elasticsearch as the knowledge layer in the full Portiere pipeline by setting backend="elasticsearch" in the knowledge layer config:

import portiere
from portiere.config import (
    PortiereConfig,
    KnowledgeLayerConfig,
    RerankerConfig,
)
from portiere.engines import PolarsEngine

config = PortiereConfig(
    knowledge_layer=KnowledgeLayerConfig(
        backend="elasticsearch",
        elasticsearch_url="http://localhost:9200",
        elasticsearch_index="portiere_concepts",
    ),
    reranker=RerankerConfig(provider="none"),
)
# Portiere infers: effective_mode="local", effective_pipeline="local"

project = portiere.init(
    name="ES Pipeline Demo",
    engine=PolarsEngine(),
    target_model="omop_cdm_v5.4",
    config=config,
)

# Add data and map as usual
patients = project.add_source("patients.csv")
diagnoses = project.add_source("diagnoses.csv")

schema_map = project.map_schema(patients)
concept_map = project.map_concepts(source=diagnoses)

Hybrid Mode: Elasticsearch + FAISS

Combine Elasticsearch BM25 with FAISS dense vectors for maximum recall and precision. The hybrid backend runs both searches in parallel and merges results using Reciprocal Rank Fusion (RRF):

config = PortiereConfig(
    knowledge_layer=KnowledgeLayerConfig(
        backend="hybrid",
        # BM25 via Elasticsearch
        elasticsearch_url="http://localhost:9200",
        elasticsearch_index="portiere_concepts",
        # Dense vectors via FAISS
        faiss_index_path="faiss/concepts.index",
        faiss_metadata_path="faiss/metadata.json",
        # Fusion settings
        fusion_method="rrf",
        rrf_k=60,
    ),
    embedding=EmbeddingConfig(
        provider="huggingface",
        model="cambridgeltl/SapBERT-from-PubMedBERT-fulltext",
    ),
    reranker=RerankerConfig(provider="none"),
)

How RRF Works

Reciprocal Rank Fusion combines ranked lists from multiple retrieval methods:

RRF_score(d) = Σ  1 / (k + rank_i(d))

Where k is a smoothing parameter (default 60) and rank_i(d) is the rank of document d in retrieval method i. Higher k values give more weight to lower-ranked results.

When to Use Hybrid

  • Source terms are a mix of formal codes and free text
  • You need both exact term matching (BM25) and semantic similarity (FAISS)
  • Highest accuracy is more important than simplicity or latency

See also: 05-knowledge-layer.md for a full comparison of all backends.


Authentication Options

No Authentication (Local Development)

backend = ElasticsearchBackend(
    url="http://localhost:9200",
    index_name="portiere_concepts",
    verify_certs=False,
)

Basic Authentication

backend = ElasticsearchBackend(
    url="https://es.mycompany.com:9200",
    index_name="portiere_concepts",
    basic_auth=("elastic", "changeme"),
    verify_certs=True,
)

API Key Authentication

backend = ElasticsearchBackend(
    url="https://es.mycompany.com:9200",
    index_name="portiere_concepts",
    api_key="base64-encoded-api-key",
)

Elastic Cloud

backend = ElasticsearchBackend(
    url="https://my-deployment.es.us-east-1.aws.cloud.es.io:9243",
    index_name="portiere_concepts",
    api_key="cloud-api-key",
)

Index Schema

The Elasticsearch backend creates an index with the following mapping:

FieldTypeAnalyzerNotes
concept_idintegerUnique identifier
concept_nametextstandardBoosted 2x in search
vocabulary_idkeywordUsed for filtering
domain_idkeywordUsed for filtering
concept_class_idkeyword
standard_conceptkeywordS, C, or empty
concept_codekeywordSource vocabulary code
descriptiontextstandardBoosted 1x in search

The search query uses multi_match across concept_name (boosted) and description fields.


Configuration Reference

KnowledgeLayerConfig Fields for Elasticsearch

FieldTypeDefaultDescription
backendstr"bm25s"Set to "elasticsearch"
elasticsearch_urlstr"http://localhost:9200"Elasticsearch cluster URL
elasticsearch_indexstr"portiere_concepts"Index name for concepts

ElasticsearchBackend Constructor

ParameterTypeDefaultDescription
urlstrrequiredElasticsearch URL
index_namestr"portiere_concepts"Index name
verify_certsboolTrueVerify TLS certificates
basic_authtupleNone(username, password) for basic auth
api_keystrNoneAPI key for token-based auth

Methods

MethodSignatureDescription
searchsearch(query, limit=10, vocabularies=None, domain=None)BM25 search with optional filters
batch_searchbatch_search(queries, limit=10, vocabularies=None, domain=None)Multi-search in single round-trip
get_conceptget_concept(concept_id)Lookup by concept ID
index_conceptsindex_concepts(concepts)Bulk index a list of concept dicts

Performance Notes

MetricElasticsearchBM25sFAISS
Index sizeScales to billions<1M recommended<10M recommended
Search latency5-20ms per query1-5ms per query2-10ms per query
Batch latency~50ms for 100 queries~100ms~30ms
InfrastructureRequires ES clusterNoneNone
Offline capableNoYesYes
Best accuracyGood (lexical)Good (lexical)Better (semantic)
Hybrid capableYes (as BM25 arm)Yes (as BM25 arm)Yes (as dense arm)

For production deployments with large vocabularies, Elasticsearch with FAISS hybrid is the recommended configuration.


See Also