Bodo Native Integration in PyIceberg 0.10: Bringing Scalability to PyIceberg with Pandas APIsBodo Native Integration in PyIceberg 0.10: Bringing Scalability to PyIceberg with Pandas APIs

Bodo Native Integration in PyIceberg 0.10: Bringing Scalability to PyIceberg with Pandas APIs

Date
November 12, 2025
Author
Isaac Warren

PyIceberg is an effective Python library for managing large tabular datasets using Iceberg open table format, providing robust table metadata and schema evolution capabilities. A common challenge arises, however, when users want to process data from these large tables.

The common path for many Python developers is table.scan(...).to_pandas(). This presents two significant problems:

  1. It's single-core: data transformations run on a single thread which is very slow for large data and computations.
  2. It's memory-bound: The to_pandas() method loads all data in a single node’s limited memory, resulting in an OutOfMemoryError for large enough data.

The alternative, a distributed framework like Apache Spark, is designed for this scale but requires a complex setup and, crucially, a complete rewrite of existing Pandas code to a new API.

The PyIceberg 0.10 release introduces a new first-class integration to address this: Table.to_bodo(). With this method, PyIceberg now plugs directly into Bodo’s high performance distributed DataFrame engine, providing:

  • Automatic multi-core and multi-node parallelism with HPC/MPI C++ backend
  • Optimized query plans with database-grade optimizer
  • Streaming operator execution for larger than cluster memory datasets
  • JIT compiler for accelerating UDFs

All without leaving the Pandas API.

From Single-Core to Scalable: A Code Comparison

Here is a typical single-node workflow that finds mean total_amount for high-distance trips in NYC Taxi data. This code will fail on any dataset that doesn't fit in RAM.

Standard Eager Processing (to_pandas())

# Pulls all filtered data into a Pandas DataFrame in memory, causing OutOfMemoryError.

df = table.scan(

	row_filter="trip_distance > 10.0")
    
.to_pandas()

# Aggregate in-memory on a single core.

result = df.groupby('vendor_id')['total_amount'].mean()

result.to_iceberg("namespace.table", catalog_name)

To scale this to multiple cores, multiple nodes, and datasets larger than memory, you only need to change the to_pandas() call to to_bodo():

Bodo Parallel Processing (to_bodo())

# 'df' is a lazy Bodo DataFrame.

df = table.to_bodo()

# Still lazy. Bodo is building a query plan.

df_filtered = df[df['trip_distance'] > 10.0]df_agg = df_filtered.groupby('vendor_id')['total_amount'].mean()

# Bodo's engine optimizes and executes the entire plan in parallel *now*.

# Data is streamed from the Iceberg table, processed, and aggregated# across all cores/nodes without running out of memory.

print(df_agg)

This second script scales to process terabytes of data, simply by changing one line.

How it Works: From Pandas API to Parallel Execution

Bodo provides a high-performance, parallel DataFrame engine that is compatible with the standard Pandas API. The to_bodo() method creates a lazy Bodo DataFrame which is fully compatible with Pandas.

The data processing workflow follows these steps:

  1. Lazy Query Construction: table.to_bodo() is a lazy operation. No data I/O occurs. It creates a bodo.pandas.DataFrame object that contains a reference to the Iceberg table. As you chain standard Pandas operations (e.g., df[...], .groupby()), Bodo constructs an internal query plan.
  2. Plan Optimization: When an action is triggered (e.g., print(df_agg) or df.to_iceberg()), Bodo leverages a robust database-grade query optimizer to analyze and rewrite the entire plan for maximum efficiency.
  3. Automatic Filter Pushdown: The optimizer identifies operations, such as the df['trip_distance'] > 10.0 filter, and pushes these predicates down to the Iceberg scan potentially avoiding reading entire partitions of data.
  4. Optimized, Parallel I/O: Bodo uses PyIceberg's metadata to identify only the specific data files and Parquet row groups that match the predicates. This data is then read in parallel across all available cores and cluster nodes.
  5. Parallel Compute: Subsequent operations, like groupby and mean, are executed by Bodo's parallel operators. The engine streams data through this pipeline, minimizing memory footprint and enabling the processing of datasets much larger than available RAM.

Usage and Requirements

  • Requires: PyIceberg 0.10.0 or later.
  • Installation: The Bodo integration is an optional dependency of PyIceberg.
pip install "pyiceberg[bodo]"

Wrapping Up

By leveraging Bodo's lazy evaluation, query optimization, and parallel execution engine, developers can overcome common memory limitations and achieve substantial performance gains while maintaining the familiar Pandas API.

To get started using Bodo yourself:

Ready to see Bodo in action?
Schedule a demo with a Bodo expert

Let’s go