DataFrames have become the go-to abstraction for working with structured data in Python—and for good reason. Lightweight, intuitive, and versatile, libraries like Pandas have empowered millions of developers to explore and transform data easily. But as the scale and complexity of workloads have grown, a new question has emerged:
Can DataFrames evolve to meet the demands of modern, large-scale data processing—without losing what made them great in the first place?
The reality is that most DataFrame libraries today have created a fractured ecosystem full of trade-offs and compromises. If you’ve ever tried to scale Pandas on a large dataset, you’ve likely encountered this frustration: the API is great, but performance and scalability aren’t there. So you reach for other libraries—Dask, Modin, Polars, Spark—but each comes with trade-offs. You might lose full Pandas compatibility. You might have to learn a new API. You might deal with weird scheduler behavior. The ecosystem today is fragmented, and there's no solution that nails usability, scalability, and performance all at once.
And why should scaling data processing mean leaving behind the tools people actually want to use?
We believe it’s time for a new kind of DataFrame library—one that combines the ease and elegance of Pandas, the performance of database warehouses, and the scalability of high-performance computing.
In this post, we outline the essential design features of a modern scalable DataFrame library, evaluate current solutions (such as PySpark, Dask, Polars, and Daft), and introduce our design for the upcoming Bodo DataFrame library, designed specifically to bridge these gaps.
To meet the demands of modern data workloads, a DataFrame library must go far beyond the single-threaded, in-memory tools of the past. It needs to feel as intuitive as Pandas, think like a query planner, and run like a high-performance compute engine. In our view, a robust DataFrame library must be built around three foundational components:
Let’s break each of these down—and how they come together in the Bodo DataFrame library.
The API layer is key to usability. Pandas’ intuitive APIs have become the gold standard in tabular data manipulation—known and loved by Python developers worldwide. We believe a high-performance DataFrame library must offer full Pandas API compatibility, enabling users to effortlessly transition without rewriting significant code.
That’s why the Bodo DataFrame library is fully compatible with the Pandas API. It’s designed to be a drop-in replacement. You should be able to switch from:
import pandas as pd
df = pd.read_parquet("s3://bucket/path/to/file.parquet")
df = df[df["A"] > 0]
df = df.apply(lambda x: x.B + 1, axis=1)
df.to_parquet("s3://bucket/path/to/output.parquet")
to:
import bodo.pandas as pd
df = pd.read_parquet("s3://bucket/path/to/file.parquet")
df = df[df["A"] > 0]
df = df.apply(lambda x: x.B + 1, axis=1)
df.to_parquet("s3://bucket/path/to/output.parquet")
and have your code just work.
This compatibility allows developers to immediately take advantage of Bodo’s scalability and performance, without refactoring their entire codebase or learning a new API.
But it’s not just about matching Pandas line-for-line—it’s about feeling truly Pythonic. That means no awkward wrappers around non-Python runtimes, no JVM dependency hell, and no cryptic errors.
And while lazy evaluation is essential for performance, it shouldn’t be something the user has to think about. With Bodo, the APIs behave as if they’re eager: you write your code normally, and execution happens automatically at the right time—no explicit .compute() calls needed.
Under the hood, Bodo lazily collects operations for optimization. But when it’s time to run, it does so automatically. If a feature isn’t implemented yet, Bodo gracefully falls back to Pandas or JIT execution to maintain full compatibility—so your workflow isn’t interrupted.
The result: a DataFrame experience that feels just like Pandas, but scales like an HPC system.
Efficient processing of large datasets requires sophisticated optimizations developed by data warehouses and databases, such as column pruning, filter pushdown, and join reordering. These optimizations are critical, since they can lead to several orders of magnitude difference in performance.
Industrial-grade query optimizers are complex and often require over five years of development and therefore, to accelerate development, the Bodo DataFrame library strategically leverages DuckDB’s mature, database-grade optimizer. It employs a phased approach with cost-based join reordering and has demonstrated high performance in practice (similar join ordering as Snowflake in our testing).
Here is part of DuckDB’s optimized query plan for TPC-H Q5, where it decided to join the small nation and region tables and use the output to essentially filter the larger customer table. This is an example of a great decision by the optimizer:
The backend execution engine must deliver superior performance at all scales—from local laptops to massive cloud clusters. Unlike task-based parallelism (common in current libraries), leveraging HPC-style data parallelism with Message Passing Interface (MPI) significantly boosts efficiency and scalability1.
To manage large-scale data without frequent memory issues (OOM errors), streaming execution and spill-to-disk mechanisms—termed "vectorized execution" in database literature—are essential.
The Bodo DataFrame library incorporates:
We evaluated existing solutions based on our three criteria of ease of use, query optimization capabilities, and execution backend. While many tools excel in one or two areas, none fully meet the demands of modern, large-scale analytics across all three. The table below summarizes these findings and illustrates why we are building the Bodo DataFrame library—to deliver a solution that combines the simplicity of Pandas with the power and scalability of a data warehouse.
The upcoming Bodo DataFrame library uniquely combines the simplicity and usability of Pandas with database-grade optimization and HPC-class performance. By fundamentally rethinking DataFrame libraries, we’re enabling data scientists and AI practitioners to seamlessly scale their workflows.
Interested in trying it out? Follow our GitHub repository for updates, and join our Slack community to engage directly with our team.
1 https://www.youtube.com/watch?v=DJ1sGQryoAc
2 https://15445.courses.cs.cmu.edu/fall2024/slides/13-queryexecution1.pdf