A DataFusion extension for reading and writing DuckLake catalogs. DuckLake is an integrated data lake and catalog format that stores metadata in a SQL database and data as Parquet files on disk or object storage.
The goal of this project is to make DuckLake a first-class, Arrow-native lakehouse format inside DataFusion.
This project is maintained by Hotdata with support from the community. Come talk to us on the Hotdata Discord.
- π¦ crates.io: https://crates.io/crates/datafusion-ducklake
- π API docs: https://docs.rs/datafusion-ducklake
- π§© Feature & backend support: see COMPATIBILITY.md
- π¬ Project chat: DataFusion+DuckLake Discord β development and usage discussion
- π§‘ Meet the team: Hotdata Discord
Add the crate:
cargo add datafusion-ducklakeThe default build includes the DuckDB catalog backend, statically bundled. Other backends and write support are opt-in via feature flags β see COMPATIBILITY.md for the full matrix.
# Cargo.toml β read PostgreSQL catalogs
# (for the experimental multi-catalog write path, use features = ["write-postgres"])
[dependencies]
datafusion-ducklake = { version = "0.3", features = ["metadata-postgres"] }The examples below also use datafusion, object_store, and url directly β add them
to your [dependencies] as well (this crate does not re-export them). The write example
additionally uses sqlx (with its postgres and runtime-tokio features) to open the
connection pool.
Run a query against an existing PostgreSQL catalog with the bundled example:
cargo run --example basic_query --features metadata-postgres -- \
"postgresql://user:password@localhost:5432/database" "SELECT * FROM main.users"(The example also accepts DuckDB, SQLite, and MySQL connection strings with the matching
metadata-* feature β see COMPATIBILITY.md.)
Register a DuckLakeCatalog with a SessionContext and query it with normal SQL as
catalog.schema.table:
use datafusion::execution::runtime_env::RuntimeEnv;
use datafusion::prelude::*;
use datafusion_ducklake::{DuckLakeCatalog, PostgresMetadataProvider};
use object_store::ObjectStore;
use object_store::aws::AmazonS3Builder;
use std::sync::Arc;
use url::Url;
// (inside an async fn)
// Read metadata from a PostgreSQL catalog
let provider = PostgresMetadataProvider::new("postgresql://user:pass@localhost:5432/db").await?;
// Register object stores for any non-local data (S3 / MinIO)
let runtime = Arc::new(RuntimeEnv::default());
let s3: Arc<dyn ObjectStore> = Arc::new(
AmazonS3Builder::new()
.with_endpoint("http://localhost:9000") // MinIO endpoint
.with_bucket_name("ducklake-data")
.with_access_key_id("minioadmin")
.with_secret_access_key("minioadmin")
.with_region("us-west-2") // any region works for MinIO
.with_allow_http(true) // required for http:// endpoints
.build()?,
);
runtime.register_object_store(&Url::parse("s3://ducklake-data/")?, s3);
let catalog = DuckLakeCatalog::new(provider)?;
let ctx = SessionContext::new_with_config_rt(
SessionConfig::new().with_default_catalog_and_schema("ducklake", "main"),
runtime,
);
ctx.register_catalog("ducklake", Arc::new(catalog));
let df = ctx.sql("SELECT * FROM ducklake.main.my_table").await?;
df.show().await?;PostgreSQL writes go through the experimental multi-catalog layout described in the
next section β treat them as a preview. Tables
are created through the writer API; once a table exists, you append to it with SQL
INSERT INTO. (SQL CREATE TABLE / CTAS is not supported on this path β DataFusion
cannot create the schema, so the first write goes through DuckLakeTableWriter.)
use datafusion::prelude::*;
use datafusion_ducklake::metadata_writer::MetadataWriter; // set_data_path
use datafusion_ducklake::{
DuckLakeCatalog, DuckLakeTableWriter, MulticatalogManager, MulticatalogProvider,
PostgresMetadataWriter, initialize_multicatalog_schema,
};
use sqlx::postgres::PgPoolOptions;
use std::sync::Arc;
let pool = PgPoolOptions::new().connect("postgresql://user:pass@localhost:5432/db").await?;
// One-time: bootstrap the multi-catalog tables, then create a named catalog
initialize_multicatalog_schema(&pool).await?;
let catalog_id = MulticatalogManager::new(pool.clone()).create_catalog("my_catalog").await?;
// Create a table by writing the first batch through the table writer
let writer = Arc::new(PostgresMetadataWriter::with_pool(pool.clone(), catalog_id).await?);
writer.set_data_path("/abs/path/to/data")?;
let object_store: Arc<dyn object_store::ObjectStore> =
Arc::new(object_store::local::LocalFileSystem::new());
let table_writer = DuckLakeTableWriter::new(writer.clone(), object_store)?;
table_writer.write_table("public", "events", &[batch]).await?; // `batch` is your RecordBatch
// Now append with SQL, reading the same catalog back through MulticatalogProvider
let provider = MulticatalogProvider::with_pool(pool.clone(), "my_catalog").await?;
let catalog = DuckLakeCatalog::with_writer(Arc::new(provider), writer)?;
let ctx = SessionContext::new();
ctx.register_catalog("ducklake", Arc::new(catalog));
ctx.sql("INSERT INTO ducklake.public.events VALUES (1, 'a')").await?.collect().await?;
ctx.sql("SELECT count(*) FROM ducklake.public.events").await?.show().await?;Writer output is configurable (Parquet compression, row-group sizing by row count and
byte size). See DuckLakeTableWriter for the
writer options.
Writing to a standard, single-catalog DuckLake store (the spec-compliant layout) is supported today for SQLite via
SqliteMetadataWriter(featurewrite-sqlite), where SQLCREATE TABLE AS SELECTandINSERT INTOboth work. Seetests/sql_write_tests.rs.
A single PostgreSQL metadata store can host multiple independent DuckLake catalogs β useful for multi-tenant deployments or keeping many logical lakehouses in one database.
β οΈ Experimental and library-specific. This multi-catalog layout is not part of the DuckLake specification and is not (yet) supported or accepted upstream. Catalogs written this way are read back only through this crate'sMulticatalogProviderβ they are not interchangeable with a standard single-catalog DuckLake store. The API and on-disk/in-catalog layout may change. PostgreSQL write support currently depends on this path, so treat it as a preview.
- Create and manage catalogs with
MulticatalogManager(featurewrite-postgres):initialize_multicatalog_schemabootstraps the shared tables, thencreate_catalog, anddrop_table_in_catalogmanage their contents. - Read a specific catalog with
MulticatalogProvider::with_pool(pool, "name")(featuremulticatalog-postgres), which plugs into aDuckLakeCataloglike any other metadata provider.
See examples/multicatalog_write.rs for an end-to-end
walkthrough (bootstrap β create catalogs β write β read back).
The maintenance API handles lakehouse upkeep from Rust: expiring old snapshots,
cleaning up superseded files, and reclaiming orphaned files. The concrete entry points
are backend-gated (write-sqlite / write-postgres). DROP TABLE is available through
MetadataWriter. See
examples/maintenance_demo.rs and
examples/orphan_cleanup_demo.rs.
For the full breakdown of catalog backends, object stores, types, capabilities, and current limitations, see COMPATIBILITY.md.
A few highlights worth knowing up front:
- Reads work on DuckDB, SQLite, PostgreSQL, and MySQL; writes are SQLite/PostgreSQL only.
- Object stores: local filesystem and S3-compatible (S3, MinIO).
- Snapshots can be selected programmatically (
DuckLakeCatalog::with_snapshot), but there is no SQL-level time travel (AS OF) yet, and no partition-based file pruning. - Data inlined by DuckDB's ducklake extension is not read β see COMPATIBILITY.md for
the
COUNT(*)undercount caveat and how to avoid it.
This project is in alpha and evolving alongside DataFusion and DuckLake. APIs may change as core abstractions are refined. See CHANGELOG.md for release history. Feedback, issues, and contributions are welcome.