(4).png)
%20(1200%20x%20485%20px).png)
Pandas 3 is one of the most significant releases in the project’s history. It modernizes core behavior, improves developer ergonomics, and expands interoperability with the broader data ecosystem.
This major release introduces:
Beyond these foundational improvements, Pandas 3 also takes a major step forward in performance and scalability by introducing:
In this post, we’ll walk through both integrations and show how to get started.
User-defined functions (UDFs) are very often a performance bottleneck in Pandas. When you use operations like DataFrame.apply() and Series.map(), Pandas executes your function in pure Python. This bypasses Pandas’ optimized C/NumPy vectorized routines and introduces per-element iteration and interpreter overhead, which can dramatically slow down large workloads.
Pandas 3 introduces an “engine” parameter that allows you to plug in Bodo JIT as the execution backend for UDFs. With Bodo:
The result can be orders-of-magnitude performance improvements, depending on the workload.
The following example (from Marc Garcia’s excellent Pandas 3 blog post) transforms room descriptions such as:
"Superior Double Room with Patio View"
into a structured string like:
“property_type=hotel, room_type=superior double, view=patio”.
On a local Mac laptop with 25 million rows, using the Bodo JIT engine this example runs 7× faster.
import pandas as pd
import bodo
def format_room_info(row):
result = "property_type=" + row["property_type"]
desc = row["name"].lower()
if " with " not in desc:
return result + ", room_type=" + desc.removesuffix(" room")
before, after = desc.split(" with ", 1)
result += ", room_type=" + before.removesuffix(" room")
if after.endswith(" view"):
result += ", view=" + after.removesuffix(" view")
elif after.endswith(" bathroom"):
result += ", bathroom=" + after.removesuffix(" bathroom")
return result
df = pd.read_parquet("rooms.parquet")
df2 = df.apply(format_room_info, axis=1, engine=bodo.jit())That’s it, no rewrites, no refactoring. Just add the engine argument.
In general, it’s also possible to use the Numba JIT engine in DataFrame.apply(), but it is limited to numerical code, doesn’t support Pandas data structures, and doesn’t parallelize the computation. The above example code fails with the Numba engine due to string data types.
JIT compilation is powerful, but there are a few considerations:
See the Bodo JIT documentation for full details on supported features and best practices.
While the native Pandas integration with Bodo JIT simplifies acceleration of UDFs, you can also scale all of your Pandas code by replacing:
import pandas as pdwith
import bodo.pandas as pdThis enables automatic parallel execution and scalable performance across CPUs and clusters, without rewriting your code.
Apache Iceberg is a modern open table format designed to provide a robust foundation for managing complex data at scale and has become the table format of choice for many data teams. It brings database-like features to data lakes such as ACID transactions, time travel and fast querying. Pandas 3’s native Iceberg support simplifies working with Iceberg data in Pandas substantially.
Pandas 3 provides pd.read_iceberg() and DataFrame.to_iceberg() for reading and writing Iceberg tables. For example, the simple code below writes an Iceberg table and reads it back using these APIs. Before running this code, create a temporary directory using “mkdir /tmp/warehouse” and make sure you have the PyIceberg package installed (available through pip and conda).
import pandas as pd
from pyiceberg.catalog import load_catalog
warehouse_path = "/tmp/warehouse"
catalog_properties = {
'type': 'sql',
"uri": f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
"warehouse": f"file://{warehouse_path}",
}
catalog = load_catalog("default", **catalog_properties)
catalog.create_namespace_if_not_exists("test")
df = pd.DataFrame({"A": [1, 2, 3], "B": ["x", "y", "z"]})
df.to_iceberg("test.test_table", "default", catalog_properties=catalog_properties)
df2 = pd.read_iceberg("test.test_table", "default", catalog_properties=catalog_properties)
print(df2)
Bodo DataFrames library supports compatible APIs that accelerate and scale Iceberg read and write on any number of available CPU cores (from laptops to clusters). In addition, Bodo supports writing tables with partition spec and sort order features of Iceberg, allowing predicate pushdown for large data sets. Moreover, Bodo supports a simple filesystem catalog to allow getting started quickly without catalog setup (not recommended for production).
Here is the same example in Bodo DataFrames code and using filesystem catalog:
import bodo.pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": ["x", "y", "z"]})
df.to_iceberg("test_table", location="./warehouse/")
df2 = pd.read_iceberg("test_table", location="./warehouse/")
print(df2)This version scales seamlessly from a laptop to a distributed cluster. See Bodo DataFrames documentation for more information.
Pandas 3 is a major leap forward for data processing with native UDF acceleration through Bodo JIT and native Iceberg integration. Bodo’s full compatibility with Pandas enhances the performance and scalability of Pandas code seamlessly and without code rewrites.
To get started using Bodo yourself: