Have you implemented a data science project in Python, but the input data is too big for your laptop, or have you faced issues scaling your model on larger machines with bigger data sets? I have recently been trying out Bodo.ai, a novel scalable analytics platform that offers extreme performance and the simplicity of native Python. The Bodo engine includes a compiler that automatically optimizes and parallelizes Python code without adding new API layers.
Previously, I’ve benchmarked Bodo using the popular example: The Monte Carlo approximation of Pi. In this post, I wanted to test how Bodo performs on another popular data analytics example benchmark: Word Count of Beer Review. The beer review data are downloaded from this Kaggle link. I use the code in this git repository to count the words in the “text” column of the dataset stored in reviews.csv. The input data is 2.3 GB and slightly modified as it had some character issues in some rows. The table below shows an example row of the dataframe:
Table 1. Sample input data for word count benchmark. Source: beer reviews.csv on Kaggle.
I used an AWS EC2 instance type c5.9xlarge with 18 physical cores to run the benchmark, running one thread per core. As shown in Figure 1, running this code with pandas only uses one CPU, and the rest are idle, which is a waste of hardware resources. One of the main issues with using Python for big data and parallelizing code is not trivial due to the global interpreter lock. However, when you parallelize the code with Bodo and command it to use all 18 CPUs, you can see that all of the processors are executing the job.
Figure 1. CPU utilization in word count example using AWS c5.9xlarge (top) pandas with 610 sec run time (bottom) Bodo running with 18 CPUs and 30 sec run time
In terms of computation run time, similar to the Monte Carlo example, I found it is inversely proportional to the number of CPUs used (see Figure 2). For example, running the code with regular pandas took 610s while parallelizing it with Bodo to run on two cores cut the run time in half. Consequently, running with all 18 cores only takes 34s.
Figure 2. Word count benchmark results, Pandas vs. Parallelized Pandas using the Bodo engine.
Another critical observation here is that parallelization with Bodo scales linearly with negligible loss in efficiency. Unlike other driver-executor parallelization methods like Spark and Dask that would suffer diminishing returns with a larger number of cores (as benchmarked here), Bodo’s true potential to scale linearly beyond thousands of cores, as demonstrated here. I’m anxious to see more for myself.
Bodo is a breakthrough in the field of Data Science and Data Engineering. Most developers in these fields use Python, but they struggle to scale their applications to process big data. Having to rewrite the Python application code to Spark to enable it to scale adds a layer of difficulty that often slows down the data science development process. With Bodo, Python code no longer needs to be converted to Scala or PySpark. It can stay as is or with just minor refactoring for type stability. My amazing experience with this benchmark made me want to put Bodo to the test with some more advanced benchmarks, and I will share those in future blog posts.
About the Author: Ali Reza Farhidzadeh is an Enterprise Artificial Intelligence Architect at Wipro Limited with 12 years of experience in data science, machine learning, business intelligence, and numerical computations. He is also a Former Professor of Probability and Statistics at the University of Buffalo.