Elasticsearch Backend

The Elasticsearch backend connects Portiere to an external Elasticsearch cluster for concept search. It uses BM25 lexical matching with optional vocabulary and domain filtering, and supports batch search via the multi-search API (_msearch).

When to Use
Installation
Connecting to Elasticsearch
Indexing Concepts
Searching Concepts
Batch Search
Concept Lookup by ID
Full Pipeline Integration
Hybrid Mode: Elasticsearch + FAISS
Authentication Options
Index Schema
Configuration Reference
Performance Notes

When to Use

Your team already runs an Elasticsearch cluster
You need horizontal scalability for large vocabularies (millions of concepts)
You need production-grade infrastructure with monitoring, replication, and backups
BM25 lexical search is sufficient, or you plan to combine it with FAISS for hybrid search

When NOT to Use

You need a zero-dependency, offline-first setup → use BM25s instead
You only need semantic (dense vector) search → use FAISS instead
You want the simplest possible setup for a notebook → use BM25s or FAISS

Installation

pip install portiere-health[elasticsearch]

This installs the elasticsearch Python client (8.x). You also need a running Elasticsearch instance (8.x recommended).

Quick Start with Docker

docker run -d --name es-portiere \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  elasticsearch:8.11.0

Verify:

curl http://localhost:9200

Connecting to Elasticsearch

from portiere.knowledge.elasticsearch_backend import ElasticsearchBackend

# Local development (no auth)
backend = ElasticsearchBackend(
    url="http://localhost:9200",
    index_name="portiere_concepts",
    verify_certs=False,
)

The index_name controls which Elasticsearch index is used for concept storage and search. You can run multiple projects against different indices on the same cluster.

Indexing Concepts

Before searching, you need to load your vocabulary concepts into the Elasticsearch index. Concepts should be a list of dictionaries with standard OMOP fields:

import json

# Load concepts from a JSON file
with open("concepts.json") as f:
    concepts = json.load(f)

# Each concept should have at minimum:
# {
#     "concept_id": 201826,
#     "concept_name": "Type 2 diabetes mellitus",
#     "vocabulary_id": "SNOMED",
#     "domain_id": "Condition",
#     "concept_class_id": "Clinical Finding",
#     "standard_concept": "S"
# }

backend.index_concepts(concepts)
print(f"Indexed {len(concepts)} concepts")

The index_concepts() method:

Creates the index with the appropriate mapping if it does not exist
Bulk-indexes all concepts using the Elasticsearch helpers.bulk API
Refreshes the index so documents are immediately searchable

Re-indexing

Calling index_concepts() on an existing index adds documents. If you need a clean re-index, delete the index first:

backend.es.indices.delete(index="portiere_concepts", ignore_unavailable=True)
backend.index_concepts(concepts)

Searching Concepts

The search() method executes a BM25 query across concept names and descriptions:

results = backend.search("diabetes", limit=5)

for r in results:
    print(f"{r['concept_name']:45s}  {r['vocabulary_id']:10s}  score={r['score']:.2f}")

Vocabulary Filtering

Restrict results to specific vocabularies:

results = backend.search("hypertension", vocabularies=["SNOMED"], limit=5)

Domain Filtering

Restrict results to a specific domain (Condition, Drug, Measurement, etc.):

results = backend.search("aspirin", domain="Drug", limit=5)

Combined Filters

results = backend.search(
    "glucose",
    vocabularies=["LOINC"],
    domain="Measurement",
    limit=10,
)

Batch Search

For mapping pipelines that need to search many terms at once, batch_search() uses the Elasticsearch multi-search API (_msearch) to execute all queries in a single round-trip:

queries = ["diabetes", "hypertension", "headache", "metformin", "glucose"]
batch_results = backend.batch_search(queries, limit=3)

for query, results in zip(queries, batch_results):
    print(f"Query: '{query}'")
    for r in results:
        print(f"  → {r['concept_name']:40s}  score={r['score']:.2f}")

This is significantly faster than calling search() in a loop, especially with network latency.

Concept Lookup by ID

Retrieve a single concept directly by its concept ID:

concept = backend.get_concept(201826)
print(f"{concept['concept_id']}: {concept['concept_name']} ({concept['vocabulary_id']})")

Full Pipeline Integration

Use Elasticsearch as the knowledge layer in the full Portiere pipeline by setting backend="elasticsearch" in the knowledge layer config:

import portiere
from portiere.config import (
    PortiereConfig,
    KnowledgeLayerConfig,
    RerankerConfig,
)
from portiere.engines import PolarsEngine

config = PortiereConfig(
    knowledge_layer=KnowledgeLayerConfig(
        backend="elasticsearch",
        elasticsearch_url="http://localhost:9200",
        elasticsearch_index="portiere_concepts",
    ),
    reranker=RerankerConfig(provider="none"),
)
# Portiere infers: effective_mode="local", effective_pipeline="local"

project = portiere.init(
    name="ES Pipeline Demo",
    engine=PolarsEngine(),
    target_model="omop_cdm_v5.4",
    config=config,
)

# Add data and map as usual
patients = project.add_source("patients.csv")
diagnoses = project.add_source("diagnoses.csv")

schema_map = project.map_schema(patients)
concept_map = project.map_concepts(source=diagnoses)

Hybrid Mode: Elasticsearch + FAISS

Combine Elasticsearch BM25 with FAISS dense vectors for maximum recall and precision. The hybrid backend runs both searches in parallel and merges results using Reciprocal Rank Fusion (RRF):

config = PortiereConfig(
    knowledge_layer=KnowledgeLayerConfig(
        backend="hybrid",
        # BM25 via Elasticsearch
        elasticsearch_url="http://localhost:9200",
        elasticsearch_index="portiere_concepts",
        # Dense vectors via FAISS
        faiss_index_path="faiss/concepts.index",
        faiss_metadata_path="faiss/metadata.json",
        # Fusion settings
        fusion_method="rrf",
        rrf_k=60,
    ),
    embedding=EmbeddingConfig(
        provider="huggingface",
        model="cambridgeltl/SapBERT-from-PubMedBERT-fulltext",
    ),
    reranker=RerankerConfig(provider="none"),
)

How RRF Works

Reciprocal Rank Fusion combines ranked lists from multiple retrieval methods:

RRF_score(d) = Σ  1 / (k + rank_i(d))

Where k is a smoothing parameter (default 60) and rank_i(d) is the rank of document d in retrieval method i. Higher k values give more weight to lower-ranked results.

When to Use Hybrid

Source terms are a mix of formal codes and free text
You need both exact term matching (BM25) and semantic similarity (FAISS)
Highest accuracy is more important than simplicity or latency

See also: 05-knowledge-layer.md for a full comparison of all backends.

Authentication Options

No Authentication (Local Development)

backend = ElasticsearchBackend(
    url="http://localhost:9200",
    index_name="portiere_concepts",
    verify_certs=False,
)

Basic Authentication

backend = ElasticsearchBackend(
    url="https://es.mycompany.com:9200",
    index_name="portiere_concepts",
    basic_auth=("elastic", "changeme"),
    verify_certs=True,
)

API Key Authentication

backend = ElasticsearchBackend(
    url="https://es.mycompany.com:9200",
    index_name="portiere_concepts",
    api_key="base64-encoded-api-key",
)

Elastic Cloud

backend = ElasticsearchBackend(
    url="https://my-deployment.es.us-east-1.aws.cloud.es.io:9243",
    index_name="portiere_concepts",
    api_key="cloud-api-key",
)

Index Schema

The Elasticsearch backend creates an index with the following mapping:

Field	Type	Analyzer	Notes
`concept_id`	`integer`	—	Unique identifier
`concept_name`	`text`	standard	Boosted 2x in search
`vocabulary_id`	`keyword`	—	Used for filtering
`domain_id`	`keyword`	—	Used for filtering
`concept_class_id`	`keyword`	—
`standard_concept`	`keyword`	—	`S`, `C`, or empty
`concept_code`	`keyword`	—	Source vocabulary code
`description`	`text`	standard	Boosted 1x in search

The search query uses multi_match across concept_name (boosted) and description fields.

Configuration Reference

`KnowledgeLayerConfig` Fields for Elasticsearch

Field	Type	Default	Description
`backend`	`str`	`"bm25s"`	Set to `"elasticsearch"`
`elasticsearch_url`	`str`	`"http://localhost:9200"`	Elasticsearch cluster URL
`elasticsearch_index`	`str`	`"portiere_concepts"`	Index name for concepts