Bodo DataFrames vs PySpark and Dask on TPC-H BenchmarksBodo DataFrames vs PySpark and Dask on TPC-H Benchmarks

Bodo DataFrames vs PySpark and Dask on TPC-H Benchmarks

Date
February 5, 2026
Author
Todd A. Anderson and Scott Routledge

In our last blog on the NYC Taxi benchmark, we showed that Bodo DataFrames was much faster than other dataframe systems—all while requiring almost no changes to existing Pandas code. That benchmark highlighted a common tension in the Python analytics ecosystem: systems that preserve the Pandas programming model often struggle to scale efficiently, while systems that scale well typically require adopting new APIs and execution semantics.  

In this post, we evaluate that trade-off using the TPC-H benchmark, a widely used standard for large-scale analytical and decision-support workloads. TPC‑H stresses systems with complex joins, aggregations, and filtering across gigabytes of synthetic relational data. These characteristics make it a useful benchmark for understanding how different dataframe systems handle optimization, memory, and distributed execution.

To focus specifically on distributed behavior, we ran the full TPC‑H query suite at scale factor 1000 (~1000 GB). At this scale, single-node dataframe systems typically exhaust memory. Conversely, during distributed execution, performance is dominated by join costs which are highly dependent on join order, communication between nodes, and intermediate materialization costs. We compare Bodo DataFrames, Dask, and PySpark on a cluster of four Amazon EC2 instances (128 physical cores). All benchmark code, configurations, and TPC-H query implementations are available in our GitHub repository.

Executive summary

At scale factor 1000:

  • Bodo DataFrames completed all 22 TPC-H queries in a total runtime of 930 seconds. All Bodo runs used the standard Pandas-based TPC-H implementations without modification. 
  • PySpark (Pandas on Spark API) completed all the queries in 5,000 seconds, over 5x slower than Bodo.  Individual queries were 1.4–17× slower, with the largest slowdowns occurring in join-heavy queries, particularly multi-predicate joins or those with complex join conditions. These results reflect overheads in PySpark’s execution model, including JVM–Python serialization and intermediate materialization under complex query plans.
  • Dask completed all the queries in ~114,000 seconds, over 122x slower than Bodo. The primary driver of this significant slowdown was the cluster configuration, which was not optimal for Dask's architecture. Dask typically performs best on clusters with many smaller workers. Our setup, however, utilized four large EC2 nodes for an apples-to-apples comparison with Bodo and PySpark. On these large, multi-core nodes, Dask’s default deployment model—using a single Python worker per node with many threads—became a bottleneck.  A smaller factor was Dask’s lack of a global, cost-based relational optimizer and not being designed for communication-intensive, multi-table analytics workloads like TPC-H.

For Pandas developers looking to scale their applications without the cost and risk of a full refactor, Bodo DataFrames offers a compelling alternative, delivering HPC-level speed with the lowest migration effort by employing a database-grade query planner in combination with a streaming, C++, and MPI backend.

Here we breakdown the times per individual query.

The scale of the Dask results sometimes makes it hard to see the difference between Bodo and PySpark so we duplicate the above graph below absent the Dask data.

Setup

Cloud Provider AWS
Number of Nodes 4
EC2 Instance Type r6i.16xlarge
Total Physical Cores 128
Network Performance 250000 Megabit
Memory 512GiB
Instance Cost (On-Demand) $4.032

Each system was run using its typical distributed deployment on AWS:

  • Bodo DataFrames was executed on the Bodo platform, which provisions and manages the distributed environment automatically. All compute nodes participate directly in execution, with no separate scheduler node required.
  • Dask DataFrame was deployed using Dask CloudProvider, which creates a Dask cluster consisting of worker nodes and a separate scheduler instance. The scheduler runs on a c6i.xlarge instance and does not participate in computation, but is required to coordinate task execution.
  • PySpark was run on Amazon EMR, using Spark’s standard driver/executor model with JVM-based execution.  The cluster manager runs on a 1xc6i.xlarge instance and does not participate in computation.

Software Versions

Bodo 2025.12.2
PySpark 3.5.2
Dask 2025.11.0
Total Physical Cores 128

Benchmark Methodology

TPC‑H consists of 22 immutable SQL queries over a normalized relational schema modeling customers, orders, suppliers, and line items. The queries are designed to stress:

  • Multi-table joins
  • Large aggregations
  • Sorting and filtering across multiple large tables

Because these queries combine relational complexity with large data volumes, performance is highly sensitive to query planning, join strategy selection, memory management, and data movement.

Strictly speaking, any implementation that departs from the original, unchanged SQL queries cannot be considered true TPC-H. However, we refer to each of the translations of the SQL originals into the equivalent APIs of the various dataframe systems as if they were TPC-H.  The source code and measurement scripts are available at in our Github repo.  All of the Dask TPC-H codes were obtained from an official release from the Dask team. For PySpark, we used the Bodo queries with the Pandas on Spark API but had to modify 7 of the queries to work around gaps in that API. For example, Pandas on Spark does not support the Series.isin() operator so we replaced isin() with an equivalent join.  We also had to replace unsupported join types in Pandas on Spark with supported joins plus additional operations (e.g., filter).

Each query was executed independently, with no shared state or cached data between runs. We measured end-to-end runtime for all 22 queries, including the I/O of reading Parquet files from S3. This approach captures not only computation but also data movement costs, which are often excluded from benchmarks but frequently dominate real-world analytical workloads at scale.

DataFrame Library Overview

Although all three systems expose DataFrame-style APIs, their execution models differ fundamentally. The observed performance differences in TPC-H can largely be explained by how each system represents computations, optimizes query plans, and executes distributed workloads.

Bodo DataFrames

Bodo DataFrames delivers HPC-level speed and out-of-core, multi-node scale with a one-line import and no refactoring. Key architectural characteristics:  

  • Lazy capture of Pandas operations: Pandas operations are intercepted and represented as a logical relational plan rather than being executed eagerly. This enables whole-query optimization across chained operations, similar to a SQL engine.
  • Cost-based relational optimization: The logical plan is optimized using a database-style optimizer capable of reordering joins, pushing down filters and projections, selecting join strategies, and eliminating unnecessary intermediate materialization. These optimizations are particularly important for TPC-H queries, which often involve long operator chains and multiple joins over large tables.
  • C++, streaming execution backend: The optimized plan is lowered to a compiled C++ execution engine. Operators execute in a streaming, vectorized fashion, consuming and producing data incrementally rather than materializing full intermediate DataFrames. This significantly reduces memory pressure and improves cache locality.

Because execution is compiled and streaming, joins and aggregations execute quickly, and large intermediate results are avoided whenever possible. This execution model maps well to TPC-H, where joins and aggregations dominate runtime.

PySpark

PySpark serves as a widely-adopted cluster workhorse, leveraging the Apache Spark distributed processing engine, which runs on the Java Virtual Machine (JVM). Its strength lies in its maturity and ability to scale massive workloads. However, its Pandas on Spark API is distinct from the standard Pandas ecosystem, necessitating a steeper porting cost for developers. Migration from Pandas to PySpark requires substantial code rewrites to adopt Spark's execution semantics and API, a process demonstrated on 7 out of 22 TPC-H queries and often demanding specialized knowledge.

Once ported, PySpark automatically manages the distribution of data and computations across the cluster. Performance differences in benchmarks like TPC-H largely stem from its execution model and inherent overheads. Spark's core execution model is based on Resilient Distributed Datasets (RDDs) and an optimized DataFrame/SQL engine (Catalyst Optimizer). While this provides sophisticated query planning, the system often incurs overhead due to:

  • JVM/Python Interoperability: Data must be serialized and deserialized between the Python process (where PySpark code runs) and the JVM (where the actual Spark engine runs), which adds latency.
  • Intermediate Materialization: Spark can be prone to materializing large intermediate results, especially under complex query plans, leading to increased memory pressure and I/O costs. This overhead is particularly noticeable in join-heavy TPC-H queries, such as multi-predicate or complex-condition joins, where PySpark's individual query runtimes were observed to be 1.4–17× slower than Bodo DataFrames.

In summary, PySpark offers power and scale through a mature, distributed SQL engine, but it comes with a higher developer migration cost and performance overheads related to cross-language serialization and memory-intensive intermediate data handling, which was evident in the TPC-H results.

Dask

Dask DataFrame’s primary strength is largely preserving the Pandas API while distributing computations across partitions. This makes it easy for Python developers to scale familiar workflows. However, Dask is not a relational query engine and lacks the global, cost‑based optimization that systems like Bodo and PySpark rely on for multi‑table joins and large aggregations. TPC‑H stresses exactly those areas, and Dask’s architecture is not designed for them.

On our hardware configuration—four large EC2 nodes with many cores and large memory—Dask’s default deployment model becomes a bottleneck. A single Python worker per node with many threads increases scheduler overhead and exposes Python‑level limitations, including the GIL in parts of the execution path. These effects compound during TPC‑H’s shuffle‑heavy join patterns.

While Dask can perform well on clusters with many smaller workers or with careful tuning, our goal was an apples‑to‑apples comparison across identical hardware. Under these conditions, the results highlight the gap between a general‑purpose parallel Pandas system and engines explicitly designed for distributed SQL analytics.

Thus, Bodo DataFrames is best for teams with large Pandas applications because it delivers HPC-level performance and out-of-core, multi-node scale with no refactor, while alternatives either demand rewrites (PySpark) or leave performance on the table via shallow optimizers and Python-bound UDFs (Dask).

Conclusion

The TPC-H benchmark is a demanding test of distributed analytical capabilities, and the results clearly highlight a critical trade-off: Bodo DataFrames delivers the best combination of high performance and low migration effort among the tested cluster-computing systems. PySpark, while a cluster workhorse, was significantly slower and required extensive code rewrites. Conversely, Dask preserved the low-effort Pandas API but was orders of magnitude slower, with performance decreasing drastically on the cluster.

Bodo DataFrames uniquely solves this dilemma. By providing HPC-level speed, out-of-core, and multi-node scale through a simple one-line import, Bodo allows teams with extensive Pandas legacy to immediately access high-performance analytics without the high cost, risk, and time sink of a complete code refactor. This combination of top-tier distributed performance with the lowest friction is what truly makes Bodo the top choice, production-ready solution for serious analytical workloads in the cloud.

If you want to evaluate Bodo DataFrames on your own workloads, you can start by running existing Pandas code with Bodo by replacing:

import pandas as pd

with:

import bodo.pandas as pd

The TPC-H benchmark code and other examples are available on our GitHub repository,  and we’d love to hear what you’re working on or answer questions in our community Slack!

Ready to see Bodo in action?
Schedule a demo with a Bodo expert

Let’s go