(3).png)
%20(1200%20x%20485%20px).png)
PyIceberg is an effective Python library for managing large tabular datasets using Iceberg open table format, providing robust table metadata and schema evolution capabilities. A common challenge arises, however, when users want to process data from these large tables.
The common path for many Python developers is table.scan(...).to_pandas(). This presents two significant problems:
to_pandas() method loads all data in a single node’s limited memory, resulting in an OutOfMemoryError for large enough data.The alternative, a distributed framework like Apache Spark, is designed for this scale but requires a complex setup and, crucially, a complete rewrite of existing Pandas code to a new API.
The PyIceberg 0.10 release introduces a new first-class integration to address this: Table.to_bodo(). With this method, PyIceberg now plugs directly into Bodo’s high performance distributed DataFrame engine, providing:
All without leaving the Pandas API.
Here is a typical single-node workflow that finds mean total_amount for high-distance trips in NYC Taxi data. This code will fail on any dataset that doesn't fit in RAM.
# Pulls all filtered data into a Pandas DataFrame in memory, causing OutOfMemoryError.
df = table.scan(
row_filter="trip_distance > 10.0")
.to_pandas()
# Aggregate in-memory on a single core.
result = df.groupby('vendor_id')['total_amount'].mean()
result.to_iceberg("namespace.table", catalog_name)To scale this to multiple cores, multiple nodes, and datasets larger than memory, you only need to change the to_pandas() call to to_bodo():
# 'df' is a lazy Bodo DataFrame.
df = table.to_bodo()
# Still lazy. Bodo is building a query plan.
df_filtered = df[df['trip_distance'] > 10.0]df_agg = df_filtered.groupby('vendor_id')['total_amount'].mean()
# Bodo's engine optimizes and executes the entire plan in parallel *now*.
# Data is streamed from the Iceberg table, processed, and aggregated# across all cores/nodes without running out of memory.
print(df_agg)This second script scales to process terabytes of data, simply by changing one line.
Bodo provides a high-performance, parallel DataFrame engine that is compatible with the standard Pandas API. The to_bodo() method creates a lazy Bodo DataFrame which is fully compatible with Pandas.
The data processing workflow follows these steps:
table.to_bodo() is a lazy operation. No data I/O occurs. It creates a bodo.pandas.DataFrame object that contains a reference to the Iceberg table. As you chain standard Pandas operations (e.g., df[...], .groupby()), Bodo constructs an internal query plan.print(df_agg) or df.to_iceberg()), Bodo leverages a robust database-grade query optimizer to analyze and rewrite the entire plan for maximum efficiency.df['trip_distance'] > 10.0 filter, and pushes these predicates down to the Iceberg scan potentially avoiding reading entire partitions of data.groupby and mean, are executed by Bodo's parallel operators. The engine streams data through this pipeline, minimizing memory footprint and enabling the processing of datasets much larger than available RAM.
pip install "pyiceberg[bodo]"
By leveraging Bodo's lazy evaluation, query optimization, and parallel execution engine, developers can overcome common memory limitations and achieve substantial performance gains while maintaining the familiar Pandas API.
To get started using Bodo yourself: