Deployment

Defacto is a Python library that runs the same code at every scale. The difference between development and production is configuration, not code. This guide covers the deployment progression from local development to sharded production.

Development

The default backend is SQLite. You don't need to configure anything.

from defacto import Defacto

d = Defacto("definitions/")
d.ingest("app", events, process=True)
d.table("customer").execute()

SQLite stores everything in a single file inside a .defacto/ directory in your project folder. This is ideal for developing definitions, running tests, prototyping queries, and learning the framework.

SQLite is single-writer, so it won't work for concurrent multi-process deployments. For that, you need Postgres.

Production with Postgres

To move to Postgres, pass a connection string. Everything else stays the same.

d = Defacto("definitions/", database="postgresql://user:pass@host:5432/mydb")

Postgres gives you durable storage, concurrent reads, JSONB indexing, and multi-process safe identity resolution. A single process against a properly tuned Postgres handles a high volume of events per day, which is sufficient for most production deployments.

Batch processing

If your events arrive in batches (from an orchestrator, a file drop, or a scheduled pipeline), use the append-then-build pattern.

d.ingest("app", events)
d.build()

This works well with orchestrators like Dagster or Airflow. A pipeline step calls ingest() with a batch of events, then build() processes them. The build auto-detects what mode to use based on whether definitions changed.

Streaming

For real-time processing, use process=True on ingest. Events are interpreted through the state machine as they arrive.

d = Defacto("definitions/", database="postgresql://...")
d.ingest("app", events, process=True)

To propagate state changes downstream in real time, add Kafka. Defacto publishes entity snapshots to a Kafka topic, and separate consumer processes write state history to analytical stores.

d = Defacto("definitions/",
    database="postgresql://...",
    kafka={"bootstrap_servers": "localhost:9092", "topic": "entity-state"})

Consumers

A DefactoConsumer reads entity snapshots from Kafka and writes SCD Type 2 history to a store. Consumers are stateless and recover from Kafka offsets.

consumer = Defacto.consumer(
    kafka={"bootstrap_servers": "localhost:9092"},
    database="postgresql://...",
    store="postgresql://analytics-db/...")
consumer.run()  # blocks and consumes continuously

You can run multiple consumers writing to different stores. Each consumer manages its own Kafka offsets independently. For example, one consumer writes to Postgres for operational queries and another writes to DuckDB for local analytics. They're completely independent.

Sharding

When a single process isn't enough throughput, you can shard across multiple processes. Each shard owns a deterministic subset of entities via SHA-256 hash partitioning.

d = Defacto("definitions/",
    database="postgresql://...",
    shard_id=0, total_shards=4)

All shards share the same Postgres ledger and identity table. Each shard has its own database connections and processes only its owned entities. This parallelizes the I/O, which is the bottleneck.

Cross-shard identity resolution works automatically. If a merge is detected on one shard, other shards pick it up through the shared merge log and correct their caches.

Shard assignment is currently manual (you set shard_id on each process). Automated shard claiming with heartbeats and failover is planned as a separate defacto-cluster library.

Tiered storage

For long retention or cost-sensitive deployments, the TieredLedger combines hot Postgres with cold Delta Lake on S3.

d = Defacto("definitions/",
    database="postgresql://...",
    cold_ledger="s3://bucket/cold/")

New events go to the hot tier (Postgres) for real-time operations. Periodically, events are flushed to the cold tier and pruned from Postgres. Replays stitch hot and cold storage transparently.

Cold storage compresses significantly compared to Postgres row storage and can hold years of event history cost-effectively.

Storage backends

Ledger

The ledger is shared across all processes. All shards read and write to the same ledger, which is what makes identity resolution and merge detection work across the cluster.

Backend	Use case
SQLite	Development, testing, single-process
Postgres	Production, multi-process, sharding
Postgres + Delta Lake	Long retention, cost-sensitive, compliance
Aurora / CockroachDB	Distributed writes, high shard counts (future)

State history

State history is the output of the pipeline. Your state history backend should match the scale and access pattern of your queries. With Kafka, you can fan out to multiple stores simultaneously, so you don't have to pick just one.

Backend	Use case
SQLite	Development, testing
Postgres	Production, operational queries, moderate volume
DuckDB	Local analytics, embedded, single-user
BigQuery / Snowflake	Heavy analytics, large teams, warehouse-scale (future)

Identity

Backend	Use case
SQLite	Development, testing
Postgres	Production, multi-process
Redis	Extreme scale, microsecond lookups (future)

Namespace isolation

If you need multiple independent defacto environments on the same Postgres database (for example, staging and production), use the namespace parameter.

d = Defacto("definitions/", database="postgresql://...", namespace="staging")

Infrastructure tables go in the {namespace} schema and state history goes in {namespace}_{version} schemas. The default namespace is defacto.

Configuration

The Defacto constructor accepts several configuration parameters.

Parameter	Default	Description
`database`	SQLite (auto)	Database URL or file path
`batch_size`	100	Events per processing batch
`workers`	1	Rust thread pool size for parallelism
`shard_id`	None	Shard index (0-based)
`total_shards`	None	Total number of shards
`namespace`	`"defacto"`	Postgres schema prefix
`kafka`	None	Kafka config dict (`bootstrap_servers`, `topic`)
`cold_ledger`	None	S3 path for Delta Lake cold storage
`dead_letter`	None	Dead letter sink config (`type`, `path` or `topic`)
`log_level`	`"INFO"`	Logging level
`log_format`	`"console"`	Log format: `"console"` or `"json"`

Dead letter sinks

When an event fails normalization or interpretation, defacto captures the failure in the result object (IngestResult.failures, BuildResult.failures). If you also want failed events written to a durable store for later inspection or reprocessing, configure a dead letter sink.

# Write failures to a JSONL file
d = Defacto("definitions/", dead_letter={"type": "file", "path": "/var/log/defacto/failures.jsonl"})

# Write failures to a Kafka topic
d = Defacto("definitions/", dead_letter={"type": "kafka", "bootstrap_servers": "...", "topic": "dead-letter"})

Each failure includes the raw event, the error message, which pipeline stage it failed at (normalization, interpretation, publishing), and whether it's recoverable. File sinks append one JSON line per failure. Kafka sinks publish to the topic with the entity ID or source name as the partition key.

If no dead letter sink is configured, failures are still available in result objects. They just aren't persisted anywhere beyond the caller's code.

Logging

Defacto uses structured logging with hierarchical loggers. In console mode, logs are formatted for readability. In JSON mode, all context fields are included for log aggregation.

d = Defacto("definitions/", log_level="DEBUG", log_format="json")

Loggers are organized by subsystem: defacto.pipeline, defacto.lifecycle, defacto.storage, defacto.dead_letter.

Recommendations

Starting out. Use SQLite. Develop your definitions, test with sample data, iterate on queries. No infrastructure needed.

First production deployment. Single Postgres process. Choose batch or streaming depending on your latency requirements. This handles most workloads.

Scaling up. Add sharding when single-process throughput isn't enough. Start with a small number of shards and monitor. The bottleneck is shared Postgres write capacity, so the ceiling depends on your Postgres instance.

Multiple downstream consumers. Add Kafka when you need real-time state propagation or when different teams need entity state in different stores. Each consumer is independent.

Heavy analytics. Don't run heavy analytical queries against your operational Postgres. Fan out via Kafka and use a separate consumer writing to an analytical store like BigQuery, Snowflake, or DuckDB.

Long retention. Add TieredLedger when Postgres storage costs become a concern or compliance requires years of event history.

Start simple, measure, scale when you need to. The same code runs at every scale.

Development​

Production with Postgres​

Batch processing​

Streaming​

Consumers​

Sharding​

Tiered storage​

Storage backends​

Ledger​

State history​

Identity​

Namespace isolation​

Configuration​

Dead letter sinks​

Logging​

Recommendations​