Bodo & Iceberg: The Simple and Fast Open Data Warehouse of the Future

August 4, 2022

Ehsan Totoni

Open data warehouse architectures combine the benefits of data lakes and traditional data warehouses. They are flexible and low-cost (similar to a data lake), while supporting structured data management and governance (like a traditional data warehouse). This allows a single storage infrastructure to store structured, semi-structured, and unstructured data—supporting data science, machine learning, and business intelligence (BI) workloads at the same time. In this blog, we announce the first alpha version of our Bodo-Iceberg connector, which is the first step in building simpler and faster open data warehouses of the future!

‍

Apache Iceberg is a leading open source table metadata format that allows building an open data warehouse on standard low-cost and scalable storage like AWS S3. Here we outline our vision for Bodo/Iceberg integration and explain why we think Bodo on Iceberg will make data management simple and fast at scale for data engineers.

‍

Apache Iceberg

Iceberg is a table format that was originally developed at Netflix to handle their petascale datasets stored on S3. A table format provides a metadata layer on top of data files to bring managed tables to big data. Iceberg tables hide the details of data storage (e.g. Parquet files), while allowing nearly all data operations such as SQL queries and ACID (atomic, consistent, isolated, durable) transactions at table level.

‍

Iceberg uses metadata files internally to track the state of the Parquet files for all snapshots of the data. Every update results in a new snapshot that simply modifies the metadata, providing:

Reliability: Iceberg supports many concurrent writers with serializable isolation.
Data warehouse evolution: end users are able to add, drop, or rename columns without breaking any existing data (schema evolution).
Hidden Partitioning: end users can query the data efficiently without knowing data partitioning details. Users don’t need to write extra or special filters while reading and there is no need to precompute values (such as precomputing year for a timestamp or date column) when writing. Also, partitioning can change at any point (partitioning evolution).
Advanced query planning: Iceberg allows advanced planning and filtering in high-performance queries on large data sets.
Time travel: Iceberg allows read or rollback to previous versions of the table (e.g. if bad data has been introduced after any transactions).
Multi-engine access: the same Iceberg dataset can be accessed by multiple engines like Bodo, Spark, Flink, and Trino simultaneously, allowing maximum infrastructure flexibility.

You can learn more from the Iceberg docs.

Iceberg is built for large data in the cloud from the ground up. For example, unlike Hive partitioning, it does not rely on directory structures of files. Therefore, it does not require operations like file listing that are expensive in cloud object storage like S3. In addition, Iceberg can enable full data warehouse functionality and does not have the limitations of “lakehouse” solutions. Overall, Iceberg is well-designed and can support demanding data workloads at scale.

‍

We believe Iceberg is the table format of the future since, in addition to its future-proof design, it is a community-driven open source project (see Bodo’s open source approach). Iceberg is an Apache governed project with a diverse maintenance committee from Apple, LinkedIn, Salesforce, AWS, Tabular, and others. Iceberg is also supported commercially by several companies such as Snowflake, AWS, Starburst, Alibaba, Tencent, Cloudera, Tabular, and Dremio already. Therefore, Iceberg is fostering a new vibrant ecosystem of tools and services and avoids vendor lock-in, benefiting all users.

‍

Bodo with Iceberg

Iceberg needs a capable compute engine to implement the actual data operations for tables. Bodo is a great match due to its speed, resource efficiency, and flexible parallel architecture. For example, for reading tables, Iceberg provides the engine a list of data files and transformations on them to form the table in memory. This structure takes advantage of Bodo’s efficiency for both parallel I/O and compute.

‍

Furthermore, while other engines like Spark read data files into rigid data partitions, Bodo’s parallel architecture allows fast communication between cores to balance the table data automatically during read, which enables faster and more scalable computation afterward. Therefore, the user doesn’t need to tune data partitions and other parameters to scale their application and achieve higher performance.

‍

Overall, Bodo on Iceberg simplifies data management for users substantially by supporting native Python (in addition to SQL). Bodo also eliminates the need for an involved setup process and parameter tuning to make sure data engineers and data scientists are solving business problems instead of fighting data infrastructure.

‍

Bodo-Iceberg Connector

We are excited to announce the availability of the first alpha version of the Bodo-Iceberg connector! It supports basic reading of Iceberg tables with automatic filter pushdown (doesn't support some features like making schema changes, tables with deleted rows, … yet). See Bodo in action.

To read an Iceberg table:

‍

This code connects to the Iceberg catalog like Hive Metastore to get table metadata, applies the filters in the code at Iceberg metadata level (e.g. date filtering here), and loads the table as a distributed dataframe automatically. Our goal is to automate as many details as possible and provide a simple Pandas experience. BodoSQL will support reading Iceberg in upcoming releases as well.

‍

We have a robust roadmap to support all Iceberg features in Bodo engine and platform over the upcoming months. For example, the next few versions of the connector will provide write support for Iceberg tables, as well as merging new data into tables (“merge into”). In addition, we will support various catalogs such as Tabular, AWS Glue, Dremio Arctic, Nessie, and Iceberg REST catalog.

‍

Summary

Iceberg is an emerging open source standard table metadata format that provides managed big data tables on top of standard cloud storage (e.g. Parquet files in S3) as an alternative to traditional data warehouses. We believe Bodo on Iceberg can simplify big data management for data engineers and data scientists in an efficient and cost-effective way. Bodo engine’s simplicity, speed, and flexibility is necessary to take full advantage of Iceberg tables at scale and achieve high performance. Our current alpha connector is the first step in our roadmap to realize this vision, and we look forward to getting your feedback to guide us.

‍

Ready to simplify big data management? See Bodo in action!

Questions or feedback? Get in touch with our team!

‍