Natively Integrated Python and SQL with Extreme Performance and Scale: Introducing BodoSQL

May 25, 2022

Nick Riasanovsky

Python is the language of choice for AI and machine learning (ML), but SQL has been used for data processing for decades. Many data applications are often a mix of the two languages, which makes development and deployment cumbersome for data teams. While Bodo makes it possible to use native Python/Pandas for large-scale data processing without the need for SQL in many cases, legacy code often uses SQL and rewriting to Python is not practical. Furthermore, many developers prefer SQL for certain use cases for a variety of reasons, and developer productivity is key for effective data organizations.

To address this need, we are announcing Beta access for BodoSQL: a SQL engine with a supercomputing-style parallel architecture that is fully integrated with high-performance Python (with Bodo). BodoSQL is designed to bridge the gap between the Python and SQL worlds by providing type checking, error checking, end-to-end optimization, and parallelization across the two languages. This addresses the “two-language” problem, eliminating the need for code rewrites or additional developer training.

Most distributed SQL engines typically focus on database-style optimization but may use a suboptimal distributed backend with high parallel overheads. While this approach may work for smaller data sizes and clusters, parallel efficiency becomes critical for 500gb+ scale datasets on 100+ core clusters. BodoSQL uses a high performance computing (HPC) parallel architecture with MPI for execution, delivering extreme performance and scalability for SQL workloads. It’s efficient parallel architecture ensures scalability as datasets and core counts grow in data applications.

‍

Using BodoSQL

The example function below demonstrates using Python and SQL together with BodoSQL. It reads parquet files into Pandas dataframes inside a Bodo JIT function (which means that the dataframes are distributed), and creates a BodoSQL context with them. The context is an immutable object that holds a mapping between SQL table names and Pandas dataframes. The program runs SQL code with “bc.sql” and returns the results as a dataframe to Python for further processing.

‍

This example also takes advantage of cross language optimization capabilities of BodoSQL. The input dataframes have over 40 columns combined, but only 4 input columns are actually used in the program. Therefore, Bodo eliminates the unnecessary columns from computations like table joins, and avoids reading them from storage altogether. This optimization is standard across SQL engines but BodoSQL extends this optimization across the SQL and Python computation boundary, which is not possible in other SQL engines.

The Bodo and BodoSQL combination allows end-to-end error checking at compilation time which improves robustness. For example, if we introduce a typo in the previous example by replacing "ss_customer_sk” with “ss_customer_si” in the Python code (see below), BodoSQL throws the right error at compilation time: “groupby(): invalid_key [‘ss_customer_si’] for ‘by’ (not available in columns (‘ss_item_sk’, ‘ss_customer_sk’, ‘i_item_sk’, ‘i_class_id’, ‘i_category’)'”. However, other solutions would throw an error during runtime. This could potentially cause production jobs of large applications to fail if the error is not captured during testing and even worse waits for the entire SQL query to execute, which may result in additional costs.

‍

BodoSQL Performance

Our goal is to make sure BodoSQL is faster and more scalable than other SQL engines at large scale, and can match performance of efficient Python codes using Bodo. While there is a long way to go to handle all complex cases efficiently, initial results are encouraging.

In May 2022, we evaluated BodoSQL using Query 10 derived from the standard TPC-H benchmarks and compared it to Python code in Bodo and SQL code using SparkSQL (AWS EMR service). According to the TPC organization, Query 10 is used to determine “the top 20 customers, in terms of their effect on lost revenue for a given quarter, who have returned parts”. The Python version using Bodo is a manual translation of SQL into Python. We generate and use scale factor 1000 (corresponding to roughly 1TB dataset in original TPC file format) for benchmarking. We use a 4 node cluster on AWS with a total of 192 physical CPU cores. Note: This performance and cost comparison is derived from TPC-H, and as such, is not comparable to any published TPC results.

The graph below demonstrates the performance results. BodoSQL provides similar performance to Bodo code but is almost 10x faster than SparkSQL. The performance advantage of BodoSQL over others usually widens with more complex queries on larger datasets and clusters.

Figure 2: Results derived from TPC-H Q10 for BodoSQL, Bodo, and SparkSQL (excluding I/O, lower is better).

‍

BodoSQL Internal Design

BodoSQL supports SQL by converting queries into Python functions for Bodo to compile and parallelize. At a high level, BodoSQL has three phases:

Logical Plan Generation
Bodo Code Generation
Bodo Execution

This figure illustrates these phases:

‍

‍

Logical Plan Generation

BodoSQL starts by parsing the SQL query into relational algebra and generates a logical representation of the query using Apache Calcite. BodoSQL then optimizes this logical plan inside Calcite to enable generation of efficient code.

‍

Bodo Code Generation

Once BodoSQL has produced an optimized logical plan, it generates a physical plan that Bodo can compile. Currently, this is done by generating a Python function which is equivalent to the query (but may change in the future). Typically this generated Python code is just standard Pandas except when SQL behavior is different (e.g. null handling). In these situations, we use specialized library functions that provide matching SQL behavior (written in Python and JIT compiled).

‍

Bodo Execution

The final step in BodoSQL is to compile the generated code in Bodo and execute it. The generated function is inlined in the outer JIT function to enable end to end optimization and error checking. Essentially, this is as if the whole function was written in Python which eliminates language boundaries.

‍

Next Steps

BodoSQL is in early stages but is already showing great potential to address major data processing pain points. It provides extreme performance and scalability, full integration with Python, and simple deployment at the same time. Our customers have started migrating various SQL workloads to it, which are typically long-running data engineering applications.

We plan to optimize BodoSQL’s performance further using more database-style query optimization techniques. Furthermore, we plan to develop integration with BI/visualization tools and storage backends to provide a full SQL solution for a variety of use cases.

BodoSQL Beta can be installed from Conda (conda install bodosql -c bodo.ai -c conda-forge). Please contact us if you want to discuss a production workload. We welcome any feedback that helps us improve BodoSQL!

‍

₁_{: The code can be found}_here

₂_{: We use Bodo 2022.4, BodoSQL 2022.4 SQL. The Bodo/BodoSQL runs were on Bodo Cloud Platform using 4 r5n.24xlarge instances (EFA enabled). The SparkSQL runs were on AWS EMR 6.6.0 (default configuration) using a r5dn.24xlarge master and 3 r5n.24xlarge workers.}

‍