Fast and efficient parallel data processing is more important than ever as the scale of datasets grows every day. As a result, data engineers are always looking for better compute solutions for their ETL, data prep, and featurization needs. Still, it is not easy to find the best compute platform since all technologies claim their platforms are “fast” and “scalable”, often without enough evidence.
Benchmarks can often help demonstrate the performance and applicability of various technologies for target use cases, though they are admittedly imperfect. Actual deployments are what matter most and where we at Bodo put our focus. Having said that, benchmarks and basic economic evaluations can often indicate real-world economics and performance. Thus, we'd like to share our results to give an idea of what real-world users can expect.
We are regularly asked how Bodo compares to Spark, Dask, and Ray, particularly for large-scale workloads, e.g., terabytes of data and billions of rows, and requiring 100’s (or more) of compute cores.
As-of December, 2021, we developed a workload, derived from the TPC-H benchmark, using all 22 queries to compare Bodo’s economic and speed performance to Spark, Dask, and Ray for data processing workloads. (N.B. The authors of this blog have no affiliation to the official TPC Council.This performance and cost comparison is derived from TPC-H, and as such, is not comparable to published TPC results.) Cost calculations were based on using the Sept. 2021 list price of AWS EC2 instances for compute only, omitting I/O costs (assumed to be roughly equivalent) as well as software licensing.
Overall, we found that Bodo provided a 22.9x median speedup over Spark with an associated 95%+ compute cost reduction and 148x median speedup over Dask with an associated 99% compute cost reduction. We also found Ray could not handle these workload sizes yet, so we did not include it in our results. Note that we did not measure the time to create the cluster and data I/O time, as the cluster and the data source are the same and are independent of the data processing time.
At Bodo, we believe that transparent and reproducible benchmarking is an objective way to evaluate computational performance and scalability for the target use cases. Benchmarks are never perfect, but they help start a more informed conversation about a comparison of options. Well-chosen benchmarks are easy to understand, represent the target workloads well, and are easy to replicate.
We are indeed biased and recognize that we are not experts in Spark, Dask, or Ray. But we made our best effort to follow appropriate procedures and, we also made all the benchmark code available open-source here for others to replicate and build on. In addition, we plan to publish more third-party benchmarks of Bodo (see this recent post as an example). We invite all interested parties to report your findings and help us improve the accuracy of our benchmarking process.
Baseline: What is Bodo? Why is it Faster?
Bodo is a compute engine that automatically provides true parallel computing with associated High-Performance Compute (HPC) style speed. This is in contrast to other scaling technologies that follow a driver-executor task scheduling paradigm, a less resource-efficient distributed computing approach. Because Bodo utilizes the HPC paradigm for data operations, it is most often orders of magnitude faster for a class of very large data processing workloads, and an order of magnitude more cost-efficient from a computing resources perspective.
In contrast to using additional libraries and frameworks, Bodo is a new type of inferential compiler offering automatic parallelism and high efficiency, surpassing 10,000+ cores. The Bodo engine can be used with standard Python libraries such as Pandas, SciKit Learn and NumPy and popular cloud-based data lakes and warehouses. Anyone can try Bodo for free for up to 8 cores; contact us if you need to scale further.
Why Derive a Workload from TPC-H?
Data processing workloads such as ETL are complex, messy, and use case dependent, making it challenging to develop representative and straightforward comparisons for them. There are some benchmarks such as 1.1 Billion NYC Taxi and db-benchmark that are occasionally used. Still, they are too simplistic (e.g., just a single join or groupby operation) and they do not provide scalable dataset generation. We believe that some of the industry-standard TPC benchmarks, such as TPC-H, are much better starting points.
Although TPC benchmarks are traditionally used for SQL database use cases and do not capture many of the challenges of today’s analytics and ML data pipelines, some of the same database-like patterns also exist in more complex data workloads. We chose to derive our workload from TPC-H since its queries are simpler to convert to Python than others like TPCxBB and TPC-DS, but it still provides representative computations. According to the TPC website, TPC-H “illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions”. We wrote equivalent Python versions of TPC-H SQL queries to evaluate Python systems (Bodo, Dask, Ray/Modin) for this blog (PySpark DataFrames are just wrappers around SQL, so we use SQL to avoid surprises). We plan to evaluate other TPC benchmarks such as TPCx-BB and TPC-DS, and derive new and more representative data processing workloads in the future.
As an example, below is TPC-H query 18 in SQL and Python for comparison. It ranks customers based on their large quantity orders:
Our Test Setup
The software versions we used are:
- Bodo 2021.10
- AWS EMR 6.3.0
- PySpark 3.1.1
- Dask 2021.9.1
- Dask-MPI 2.21.0
- Ray 1.6.0
- Modin 0.11.0
For our comparison testing, we used a cluster of 16 c5n.18xlarge AWS instances with 576 physical CPU cores and 3 TB of total memory. This “compute-optimized” cluster provides high computing power and high network bandwidth which are critical for effective parallel computations. We chose on-demand instances for simpler cost calculations, but similar conclusions can be expected from other instances such as Spot as well. Our benchmark setup cluster details are as follows:
Bodo vs. Spark
Apache Spark is a well-known framework for large-scale data processing. Bodo targets the same large-scale data processing workloads such as ETL, data prep, and feature engineering. We benchmarked Bodo vs. Spark using the scale factor 1,000 of TPC-H (~1 TB dataset). For the Spark testing, we used default AWS EMR without autoscaling to spawn the 16 c5n.18xlarge cluster (1 driver and 15 workers) equivalent compute step to Bodo. AWS EMR default also allocates 1152GB of EBS storage for Spark, but we omitted the EBS cost since it is small compared to compute cost.
The graph below illustrates the compute times of Bodo and Spark for our workload derived from TPC-H (excluding I/O, which has roughly similar performance). We ran each sets of queries thrice and report the average of those three runs in the table below. Across the queries we found that Bodo consumed 83%-98% less computing time than Spark (based on equivalent compute clusters), and was 6x-65x faster than Spark with a median 22.9x overall speedup. We found that in general, Bodo’s speed-up is larger for more complex queries and smaller for the simpler queries. Corresponding compute cost on AWS EC2 was approximately $14.73 for Spark, and approximately $0.68 for Bodo.
Bodo vs. Dask
Dask is a Python library that provides “Pandas-like” and other APIs for scaling Python workloads through a centralized task scheduler. By their own admission, Dask still needs to resolve several of its shortcomings to handle large-scale data-intensive workloads such as ETL (better explained by Dask contributors). It is not a direct replacement for pandas. Nevertheless, Dask is sometimes considered for large-scale data workloads by developers, making our quantitative comparison necessary.
We first tried running Dask with the same benchmark setup as the previous Bodo/Spark comparison (16 c5n.18xlarge cluster, 1TB input data). Dask failed to load the scaling factor 1,000 data (even though the cluster’s memory capacity is three times the input data). Therefore, we switched to using scaling factor 100 for input data (~100GB). We deploy Dask with 1 scheduler and 576 workers using dask-mpi.
The graph below illustrates the compute times of Bodo and Dask when using the workload derived from TPC-H (excluding I/O). Dask ran out of memory for query 21, even though the cluster memory capacity is 30 times larger than input data. (note: While Dask can be configured to spill into disk, this significantly impacts performance. So we omitted this config altogether to avoid negatively skewing Dask results.)
Across all queries, Bodo was found to be 8x to 496x faster than Dask, with a median of 148x median speedup (we excluded Q21 from the Dask runtime). Corresponding compute costs on AWS EC2 were $23.14 for Dask, and $0.12 for Bodo.
Based on these results, we found that Dask is currently not suitable for large-scale data processing workloads. We further validated with 3rd party Dask experts to confirm that our code follows standard Dask dataframe practices. Further optimization can deliver an additional 2-3x speed increase, but would break the code away from standard Pandas semantics and is not trivial to implement for anyone who is not a Dask expert. Since Dask’s architecture is fundamentally similar to Spark (driver-executor task scheduling), we believe that Dask can, at best, match Spark’s performance (after significant development effort) but not Bodo’s.
Bodo vs. Ray/Modin
Ray is a Python framework for scaling Python workloads, originally developed for reinforcement learning applications. Ray can support data processing through the Modin package, which claims to scale Pandas by “changing a single line of code”. We tried to run our workload derived from TPC-H on Ray/Modin on the same 16 c5n.18xlarge cluster, but it ran out of memory for any dataset larger than 10 GB (even though the cluster has 3 TB of memory). This is highly unexpected since Modin claims to scale to 1TB+ datasets. For the 10 GB dataset, Ray/Modin was slower than Pandas in many cases. We checked the Modin logs to ensure it doesn’t fall back to Pandas for large computations (only one query did).
Therefore, we found Ray/Modin not to be currently capable of processing large datasets for the target use cases, and we plan to try again later. Architecturally, Ray uses similar task scheduling mechanisms as Spark and Dask, so we don’t expect it to reach Bodo’s performance levels.
Simple-but-representative benchmarks can help compare and contrast data processing system performance for workloads like ETL and data prep. We derived our workload from the TPC-H benchmark, translated to Python to compare the performance of Bodo’s HPC-based inferential compiler approach to distributed task scheduling libraries like Spark, Dask, and Ray. Because Bodo utilizes a truly parallel HPC-style paradigm for data operations, it is often orders of magnitude faster for classes of very large data processing workloads.
Indeed, our benchmark efforts found Bodo to be on median 23x (in total 21.6x) faster than Spark and on median 148x (in total 191.3x) faster than Dask on our workload derived from TPC-H benchmark, which represents large-scale data processing workloads such as ETL. The speed-ups also translate to over 95% infrastructure cost savings as well. We invite the data engineering community to replicate our results and educate us about their most critical workloads.