How the Data ‘Lakehouse’ Might Usurp the Warehouse and the Lake
by Joel Hans
This is originally appeared on RTInsights’ sister publication CloudDataInsights.com
Let’s assume that you’re well-off enough to have an entire lake in your possession. What do you do next? Build a lakehouse, of course.
You know, an open architecture for managing your organization’s data that combines the scale of data lakes and the ACID-friendly queries of data warehouses on a single, flexible, and cost-effective platform.
We’re talking about a platform to handle the vast quantities of an organization’s data here, not your second (or third) house where you store your pontoon boat and only visit two weekends every year.
See also: Governance in the Age of Cloud Databases
The data lakehouse is a growing market segment, with companies like Dremio, Databricks, and Onehouse already elbowing for the best cloud implementation of open frameworks like Apache Hudi, Apache Iceberg, and Delta Lake. But before jumping straight into the supposed benefits of the lakehouse, let’s talk about how the industry got here, to a new product category, just as it seemed like data lakes were catching on.
Years ago, the data warehouse was the standard for business intelligence and analytics. Organizations stored their structured data in an ACID-compliant environment, which refers to the atomicity, consistency, isolation, and durability of the warehouse’s data. For all the benefits they created in terms of data quality and driving business analytics, they were costly, and their inflexibility tended to create silos.
The data lake was developed as an answer to these problems. As a central, “flat” repository of all raw structured and unstructured data in object form, the data lake was designed to make data more accessible to more employees without the risk of siloing. Data lakes tend to run cheaper than warehouses since most public clouds support the object storage model.
But many organizations, especially those at the leading edge of data storage and analysis, started to notice problems with data warehouses and lakes, even after trying to solve their individual cons by combining them into a single management and analysis infrastructure.
Back in 2014, Uber was struggling with their data warehouse, according to Vinoth Chandar, who managed the company’s data team at the time. They realized that different business units had different “versions” of the company’s data. Some analyses included the most recent updates, while others didn’t, which meant their people made critical decisions based on false or outdated assumptions.
Uber’s engineers started building a custom Hadoop infrastructure around their warehouse, effectively combining their data warehouse with a data lake, to help different teams run analytics and make decisions based on the data they were paying handsomely to collect and store. Internally, they called this project “Hoodie.”
In parallel with Uber, developers from Netflix, Apple, and Salesforce started working on a different open-source framework for democratizing the enormous volume of data they were all collecting about their customers. With both warehouses and lakes, these companies often needed to copy data to other systems to help their employees run analytics in comfortable, ACID-compliant environments where they didn’t have to worry about affecting durability. They were being overrun with complexity.
They started building what’s now called Iceberg, an open-source format for big data analytics that lets multiple engines work on the same tables, at the same time, with the “reliability and simplicity of SQL tables.”
Developers behind both projects eventually released them into open source, following a trend long-established in Silicon Valley tech giants. Back in 2011, Yahoo spun Hadoop out into its own company, and in 2014, LinkedIn did the same with Kafka. Both Hoodie — how called Hudi — and Iceberg are part of the Apache Software Foundation, where they’re maintained and built by a global network of volunteer contributors.
Hudi is now supported on AWS, Google Cloud, and Microsoft Azure and is used by companies like Disney, Twitter, Walmart, and more.
They’re also now the foundation of the data lakehouse industry. When deployed into production against new or existing data sets, these tools let organizations store all their structured and unstructured data on low-cost storage, just like data lakes do. They also combine data structure/management features in warehouses, like ACID-compliant transactions and simpler query development.
Read more on CloudDataInsights.com