How To Avoid Drowning In Your Data Lake

How To Avoid Drowning In Your Data Lake

data lake design principles

A few years ago, Gartner warned us about some significant risks in data lakes that could eventually lead to “data swamps.” For many companies, that prediction was all too accurate. Many of them that have invested millions are still looking for business value, frustrated by the fact that data lakes don’t deliver on the original promise of enabling actionable analytics on huge amounts of data.

The reality is that data lakes are useful as low-cost storage and for managing a variety of unstructured and semi-structured data, but they struggle as a true real-time analytics environment. Despite repeated attempts by open source and commercial solutions (e.g., Apache Hive, Apache Impala, Greenplum, and so on), most Hadoop- or cloud-based data lakes can’t support thousands of concurrent analytics users, sophisticated ad hoc queries, data-intensive reports, or any of the other demands of a true real-time analytics system.

Instead, the right answer is to augment the data lake’s cheap storage with a fully modern analytics environment that is purpose-built to support sub-second ANSI SQL queries, even for the most complex workloads and for up to thousands of concurrent users in their favorite BI and data science tools. That environment needs to understand required file formats (Orc, Parquet, JSON, etc.), ingest extremely quickly in batch or in a real-time stream, and make it all query-able instantly. And, it should simplify and streamline data management, and eliminate the need for specialized data engineering skills.

Finally, the word “modern” implies that you should also have the flexibility to run workloads wherever it makes the most sense: in an on-premises data center, in the cloud, or both. Today, neither traditional data warehouses, nor SQL-on-Hadoop engines, nor cloud-native data warehouses check all those boxes—but Yellowbrick does.

cloud based data lake

Read the white paper, “Unlocking Data Lake Value with Hybrid Cloud Analytics” that explains the design principles behind Yellowbrick that make it an ideal solution for augmenting (or even replacing) data lakes as described above.

Get the latest Yellowbrick News & Insights
Why Private Data Cloud?
This blog post sheds light on user experiences with Redshift,...
Data Brew: Redshift Realities & Yellowbrick Capabilities –...
This blog post sheds light on user experiences with Redshift,...
DBAs Face Up To Kubernetes
DBAs face new challenges with Kubernetes, adapting roles in database...
Book a Demo

Learn More About the Only Modern Data Warehouse for Hybrid Cloud

Run analytics 10 to 100x FASTER to achieve analytic insights that have never been possible.

Simpler to Manage
Configure, load and query billions of rows in minutes.

Shrink your data warehouse footprint by as much as 97% and save millions in operational and management costs.

Accessible Anywhere
Achieve high speed analytics in your data center or in any cloud.