Docs/Database Connections

Database Connections

Connect Portiere directly to your database to map clinical data without exporting to files. This guide covers connection setup, supported databases, and full pipeline examples.


Table of Contents


Quick Start

import portiere
from portiere.engines import PolarsEngine

project = portiere.init(name="Hospital Migration", engine=PolarsEngine())

# Connect to a PostgreSQL database
source = project.add_source(
    connection_string="postgresql://user:pass@localhost:5432/ehr_db",
    table="patients"
)

# Map schema and concepts as usual
schema_map = project.map_schema(source)
concept_map = project.map_concepts(source=source)

Supported Databases

DatabaseURI SchemeExample
PostgreSQLpostgresql://postgresql://user:pass@host:5432/dbname
MySQLmysql://mysql://user:pass@host:3306/dbname
SQLitesqlite:///sqlite:///path/to/database.db
SQL Servermssql://mssql://user:pass@host:1433/dbname
Oracleoracle://oracle://user:pass@host:1521/service

Any database supported by your engine's connector can be used. The connection string follows standard SQLAlchemy URI format.


Connection String Format

Connection strings use the format:

dialect://username:password@host:port/database

Examples

PostgreSQL:

# Standard connection
connection_string = "postgresql://admin:secret@db.hospital.internal:5432/ehr_production"

# With SSL
connection_string = "postgresql://admin:secret@db.hospital.internal:5432/ehr_production?sslmode=require"

MySQL:

connection_string = "mysql://reader:password@mysql-host:3306/clinical_data"

SQLite (local file):

connection_string = "sqlite:///./data/local_ehr.db"

SQL Server:

connection_string = "mssql://sa:password@sqlserver-host:1433/EHR_DB?driver=ODBC+Driver+17+for+SQL+Server"

Reading Tables

Use the table parameter to read an entire database table:

source = project.add_source(
    connection_string="postgresql://user:pass@localhost:5432/ehr_db",
    table="patients"
)

The table is read by the compute engine (Polars, Pandas, or Spark) and produces the same Source object as file-based sources. Column names, types, and row counts are detected automatically.


Custom SQL Queries

Use the query parameter for filtered or joined data:

# Filter to recent admissions
source = project.add_source(
    connection_string="postgresql://user:pass@localhost:5432/ehr_db",
    query="SELECT * FROM patients WHERE admission_date >= '2024-01-01'",
    name="recent_patients"
)

# Join tables before mapping
source = project.add_source(
    connection_string="postgresql://user:pass@localhost:5432/ehr_db",
    query="""
        SELECT p.patient_id, p.gender, p.birth_date,
               d.diagnosis_code, d.diagnosis_description,
               m.medication_code, m.medication_name
        FROM patients p
        JOIN diagnoses d ON p.patient_id = d.patient_id
        JOIN medications m ON p.patient_id = m.patient_id
    """,
    name="patient_clinical_data"
)

When using query, the name parameter is required to identify the source in the project.


Multiple Sources from One Database

You can add multiple sources from the same database to map different clinical domains:

conn = "postgresql://user:pass@localhost:5432/ehr_db"

# Source 1: Patient demographics
patients = project.add_source(connection_string=conn, table="patients")

# Source 2: Lab results
labs = project.add_source(
    connection_string=conn,
    query="SELECT * FROM lab_results WHERE result_date >= '2024-01-01'",
    name="lab_results"
)

# Source 3: Medications
meds = project.add_source(connection_string=conn, table="prescriptions")

# Map each source independently
for source in [patients, labs, meds]:
    schema_map = project.map_schema(source)
    concept_map = project.map_concepts(source=source)

Full Pipeline with Database Source

A complete end-to-end example connecting to a hospital EHR database:

import portiere
from portiere.config import PortiereConfig, LLMConfig
from portiere.engines import PolarsEngine

# Configure with LLM for concept verification
config = PortiereConfig(
    llm=LLMConfig(
        provider="openai",
        api_key="sk-...",
        model="gpt-4o",
    )
)

project = portiere.init(
    name="Hospital OMOP Migration",
    engine=PolarsEngine(),
    vocabularies=["SNOMED", "LOINC", "RxNorm", "ICD10CM"],
    config=config,
)

# Ingest from database
source = project.add_source(
    connection_string="postgresql://etl_user:password@ehr-db:5432/hospital_ehr",
    query="""
        SELECT patient_id, gender, date_of_birth,
               diagnosis_code, diagnosis_description,
               lab_code, lab_value, lab_unit
        FROM clinical_encounters
        WHERE encounter_date >= '2024-01-01'
    """,
    name="clinical_encounters"
)

# Profile
profile = project.profile(source)
print(f"Columns: {len(profile.get('columns', []))}")

# Schema mapping
schema_map = project.map_schema(source)
for item in schema_map.items:
    if item.status.value == "needs_review":
        item.approve()

# Concept mapping
concept_map = project.map_concepts(source=source)
summary = concept_map.summary()
print(f"Auto-mapped: {summary['auto_mapped']}/{summary['total']}")

# ETL
etl = project.run_etl(
    source,
    output_dir="./omop_output",
    schema_mapping=schema_map,
    concept_mapping=concept_map,
)

# Validate
result = project.validate(etl_result=etl)
print(f"Validation: {'PASSED' if result['all_passed'] else 'FAILED'}")

Security Best Practices

Use Environment Variables for Credentials

Never hardcode database credentials. Use environment variables:

import os

connection_string = (
    f"postgresql://{os.environ['DB_USER']}:{os.environ['DB_PASSWORD']}"
    f"@{os.environ['DB_HOST']}:{os.environ.get('DB_PORT', '5432')}"
    f"/{os.environ['DB_NAME']}"
)

source = project.add_source(connection_string=connection_string, table="patients")

Or use a .env file with python-dotenv:

from dotenv import load_dotenv
load_dotenv()

connection_string = os.environ["DATABASE_URL"]

Use Read-Only Database Users

Create a dedicated read-only user for ETL extraction:

-- PostgreSQL
CREATE USER etl_reader WITH PASSWORD 'secure_password';
GRANT CONNECT ON DATABASE ehr_db TO etl_reader;
GRANT USAGE ON SCHEMA public TO etl_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO etl_reader;

Filter Data at Query Time

Use SQL queries to extract only the data you need, minimizing data exposure:

# Only extract the columns and rows needed for mapping
source = project.add_source(
    connection_string=connection_string,
    query="SELECT patient_id, diagnosis_code FROM encounters WHERE year = 2024",
    name="encounters_2024"
)

See Also