DataFrame Processing with Polars

Polars is a high-performance, Rust-powered DataFrame library for Python (and other languages like Rust, Node.js, R, and Ruby) designed for efficient in-memory and out-of-core data processing. It distinguishes itself from libraries like Pandas primarily through its focus on speed, memory efficiency, and parallel processing capabilities, often outperforming Pandas on large datasets.

Key features and advantages of Polars:

1. Performance: Built on Rust, Polars leverages its speed and safety features. It's often significantly faster than Python-native DataFrame libraries for many operations, especially on larger-than-RAM datasets or CPU-bound tasks.
2. Memory Efficiency: Polars uses memory-efficient data structures and optimizes operations to minimize memory footprint, making it suitable for larger datasets that might strain systems using other libraries.
3. Parallel Processing: Many Polars operations are multi-threaded by default, taking full advantage of modern multi-core CPUs without requiring explicit user intervention for parallelization.
4. Lazy Execution (LazyFrame): Polars supports both 'eager' (immediate execution) and 'lazy' (deferred execution) modes. LazyFrames allow users to define a sequence of operations, which Polars then optimizes and executes only when results are requested (e.g., via `.collect()`). This enables query optimization, predicate pushdown, and projection pushdown, leading to much faster execution and reduced memory usage for complex pipelines.
5. Expressive API: Polars provides an intuitive and functional API for data manipulation, often using 'expressions' which are optimized computational units.
6. Immutable Data Structure: Operations in Polars typically return new DataFrames rather than modifying existing ones in-place, which can lead to more predictable code and easier debugging.
7. Data Types: Supports a wide range of data types, including robust handling for numerical, string, boolean, date/time, categorical, and list types.

Common DataFrame processing operations with Polars include:
- Creating DataFrames: From various sources like Python lists, dictionaries, NumPy arrays, or by reading files.
- Reading/Writing Data: Efficiently handles formats like CSV, Parquet, JSON, and Arrow.
- Selecting Columns: Accessing specific columns by name or index.
- Filtering Rows: Subsetting rows based on conditional expressions.
- Adding/Modifying Columns: Creating new columns or transforming existing ones using expressions.
- Grouping and Aggregation: Performing split-apply-combine operations (e.g., sum, mean, count) based on one or more grouping keys.
- Joining/Merging DataFrames: Combining DataFrames based on common keys (e.g., inner, left, outer joins).
- Handling Missing Data: Filling, dropping, or interpolating null values.
- Reshaping Data: Pivoting, melting, and stacking operations.

Polars is an excellent choice for data scientists and engineers working with medium to large datasets where performance and memory efficiency are critical, offering a modern and powerful alternative to traditional DataFrame libraries.

Example Code

import polars as pl
import datetime

 1. Creating a DataFrame
data = {
    "product_id": [101, 102, 103, 101, 104, 102, 105],
    "category": ["Electronics", "Books", "Electronics", "Electronics", "Home", "Books", "Books"],
    "price": [120.50, 25.00, 300.75, 120.50, 45.99, 25.00, 15.20],
    "quantity": [2, 1, 1, 3, 2, 4, 1],
    "order_date": [
        datetime.date(2023, 1, 10),
        datetime.date(2023, 1, 15),
        datetime.date(2023, 1, 10),
        datetime.date(2023, 1, 20),
        datetime.date(2023, 1, 22),
        datetime.date(2023, 2, 1),
        datetime.date(2023, 2, 5)
    ],
    "is_discounted": [True, False, True, False, True, False, True]
}
df = pl.DataFrame(data)

print("Original DataFrame:\n", df)
print("\nDataFrame Schema:\n", df.schema)

 2. Selecting Columns
 Select specific columns
selected_df = df.select(["product_id", "price", "quantity"])
print("\nSelected Columns (product_id, price, quantity):\n", selected_df)

 3. Filtering Rows
 Filter for products with price > 100
expensive_products = df.filter(pl.col("price") > 100)
print("\nProducts with price > 100:\n", expensive_products)

 Filter for products in 'Electronics' category and quantity > 1
electronics_high_qty = df.filter(
    (pl.col("category") == "Electronics") & (pl.col("quantity") > 1)
)
print("\nElectronics products with quantity > 1:\n", electronics_high_qty)

 4. Adding/Modifying Columns
 Add a 'total_revenue' column
df_with_revenue = df.with_columns(
    (pl.col("price") - pl.col("quantity")).alias("total_revenue")
)
print("\nDataFrame with 'total_revenue':\n", df_with_revenue)

 Add a 'is_expensive' column based on price
df_with_flags = df_with_revenue.with_columns(
    (pl.col("price") > 150).alias("is_expensive")
)
print("\nDataFrame with 'is_expensive' flag:\n", df_with_flags)

 5. Grouping and Aggregation
 Calculate total quantity and average price per category
aggregated_df = df_with_revenue.group_by("category").agg(
    pl.sum("quantity").alias("total_quantity_sold"),
    pl.mean("price").alias("average_price")
)
print("\nAggregated by Category:\n", aggregated_df)

 Calculate total revenue per order date
revenue_by_date = df_with_revenue.group_by("order_date").agg(
    pl.sum("total_revenue").alias("daily_total_revenue")
).sort("order_date")
print("\nTotal Revenue by Order Date:\n", revenue_by_by_date)

 6. Joining DataFrames
 Create another DataFrame for product details
product_details_data = {
    "product_id": [101, 102, 103, 104, 105, 106],
    "product_name": ["Laptop", "Novel", "Monitor", "Vacuum Cleaner", "Cookbook", "Headphones"]
}
product_details_df = pl.DataFrame(product_details_data)

print("\nProduct Details DataFrame:\n", product_details_df)

 Join the two DataFrames on 'product_id'
joined_df = df.join(product_details_df, on="product_id", how="left")
print("\nJoined DataFrame (left join):\n", joined_df)

 7. Handling Missing Data (example with a hypothetical scenario)
 Let's intentionally introduce some nulls for demonstration
df_with_nulls = df.with_columns(
    pl.Series(["A", "B", None, "C", "D", "E", None]).alias("customer_rating")
)

print("\nDataFrame with potential nulls:\n", df_with_nulls)

 Fill nulls in 'customer_rating' with a default value 'N/A'
df_filled_nulls = df_with_nulls.with_columns(
    pl.col("customer_rating").fill_null("N/A").alias("customer_rating_filled")
)
print("\nDataFrame with nulls filled:\n", df_filled_nulls)

 Drop rows where 'customer_rating' is null
df_dropped_nulls = df_with_nulls.drop_nulls(subset=["customer_rating"])
print("\nDataFrame with nulls dropped:\n", df_dropped_nulls)

DataFrame Processing with Polars

Example Code

Related Topics