DuckDB

DuckDB is an open-source, in-process SQL OLAP (Online Analytical Processing) database management system. Designed for analytical workloads, it aims to be the SQLite for analytics, offering high performance for complex queries directly within the application's process. Unlike traditional client-server databases, DuckDB runs embedded within your application, eliminating the need for a separate server process or network communication.

Key features of DuckDB include:
- In-process and Serverless: It runs directly within your application (e.g., Python script, R session), making it lightweight and easy to integrate without deployment overhead.
- OLAP Optimized: Built from the ground up for analytical queries, it utilizes a columnar-vectorized execution engine, which is highly efficient for aggregations, joins, and scans on large datasets.
- SQL Standard Compliance: Supports a rich set of SQL features, including complex joins, subqueries, window functions, and common table expressions (CTEs).
- High Performance: Its vectorized query execution and columnar storage make it exceptionally fast for analytical queries, often outperforming traditional databases for such tasks.
- ACID Transactions: Provides full ACID (Atomicity, Consistency, Isolation, Durability) properties, ensuring data integrity.
- Direct Data Integration: Seamlessly integrates with various data formats and sources. It can directly query data from Parquet files, CSVs, JSON, Apache Arrow tables, and Pandas DataFrames without prior loading.
- Python Integration: Has a first-class Python API, making it a popular choice for data scientists and analysts working with Python.
- Memory Efficiency: Designed to handle datasets larger than memory by efficiently spilling to disk when necessary.

Common use cases for DuckDB include:
- Local Data Analysis: Performing quick analytical queries on local datasets without spinning up a full database server.
- ETL (Extract, Transform, Load) Processes: Transforming and loading data efficiently.
- Data Science and Machine Learning Pipelines: As a fast intermediary for data manipulation.
- Prototyping and Ad-hoc Queries: Quickly testing ideas and exploring data.
- Edge Computing: Deploying analytical capabilities in resource-constrained environments.

DuckDB fills a niche similar to SQLite but for analytical workloads, offering simplicity and speed where a full-fledged data warehouse might be overkill.

Example Code

import duckdb
import pandas as pd

 1. Connect to an in-memory DuckDB database
 You can also connect to a file: duckdb.connect('my_database.duckdb')
con = duckdb.connect(database=':memory:', read_only=False)

print("--- Creating a table and inserting data ---")
 2. Create a table
con.execute("""
    CREATE TABLE sales (
        product VARCHAR,
        region VARCHAR,
        sales_amount DECIMAL(10, 2),
        sale_date DATE
    );
""")

 3. Insert data
con.execute("""
    INSERT INTO sales VALUES
    ('Laptop', 'North', 1200.50, '2023-01-15'),
    ('Mouse', 'South', 25.99, '2023-01-16'),
    ('Keyboard', 'North', 75.00, '2023-01-15'),
    ('Monitor', 'East', 300.75, '2023-01-17'),
    ('Laptop', 'West', 1150.00, '2023-01-18'),
    ('Mouse', 'North', 28.50, '2023-01-19'),
    ('Keyboard', 'South', 70.00, '2023-01-20'),
    ('Monitor', 'West', 320.00, '2023-01-21');
""")

 4. Query data
print("\n--- All sales records ---")
result = con.execute("SELECT - FROM sales;").fetchdf()
print(result)

 5. Perform an analytical query (e.g., total sales per region)
print("\n--- Total sales per region ---")
result = con.execute("""
    SELECT
        region,
        SUM(sales_amount) AS total_sales
    FROM
        sales
    GROUP BY
        region
    ORDER BY
        total_sales DESC;
""").fetchdf()
print(result)

 6. Query directly from a Pandas DataFrame (zero-copy integration)
print("\n--- Querying directly from a Pandas DataFrame ---")
df_products = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['Apple', 'Banana', 'Orange', 'Grape'],
    'price': [1.20, 0.50, 0.75, 2.10]
})

 Register the DataFrame as a view in DuckDB
con.register('products_df', df_products)

 Now you can query it like a regular table
result_df_query = con.execute("SELECT name, price FROM products_df WHERE price > 1.0;").fetchdf()
print(result_df_query)

 7. Close the connection
con.close()
print("\nDuckDB connection closed.")

Example Code

Related Topics