Data Warehouse on Kubernetes

Yellowbrick Logo
Data Brew: Redshift Realities & Yellowbrick Capabilities – Part 1

Data Brew: Redshift Realities & Yellowbrick Capabilities – Part 1

My recent venture has involved a lot of conversations with Redshift users where they shared a wealth of experiences. These discussions, often accompanied by a comforting cup of coffee, revealed interesting patterns unique to Redshift. Some of these were surprising – try not to spill your coffee!  

Redshift has done an incredible job on AWS of establishing itself as the go-to choice, offering native AWS integration and a cost-effective entry point that resonates with startups and established businesses alike. A common theme I heard was, ‘If you are moving to AWS, nobody gets fired for choosing Redshift.’ But what happens as they scale? In this blog post, I aim to unpack their collective experience, offering a glimpse into its pitfalls and gotchas.  

Stay tuned for Part 2, where I will discuss how Yellowbrick’s architecture compares and how we have solved these challenges as the users tried our data warehouse. 

 

The Redshift Riddles

1. The 60-Database Question: Redshift’s Instance Limits

Redshift offers a starting point for many businesses with its cap of 60 databases per instance, which, while ample for some, may present a growth challenge for others. For example, a SaaS company specializing in marketing analytics found this limit approaching as their user base expanded. Their multi-tenancy model required isolation at the database level. Despite plenty of capacity, they needed additional instances earlier than anticipated due to the cap. This experience illuminates the importance of considering future growth when choosing a data warehousing solution that can adapt to a company’s scaling needs.  

Similarly, in the realm of ‘Data Products,’ where distinct teams manage their independent databases, the database cap can introduce complexity and potential cost considerations.  

While extra instances offer a solution, they also bring added layers of management and the need for additional data movement to satisfy cross-instance queries. Such scenarios highlight the value of a data warehousing solution that can provide both scalability and administrative simplicity to accommodate evolving business requirements. 

2. Riding the Performance Wave: Understanding Redshift’s Query Inconsistencies

Redshift’s advanced features, such as the automatic creation of materialized views and session-based result caching, are designed to optimize query performance, striving to meet a broad range of analytical needs. However, use cases, particularly those requiring real (or near) time analytics on data, may experience variability in query times. For instance, a global financial reporting and performance management software provider found that query execution times could vary wildly, negatively impacting customer experience. For example, some queries varied from a minute to under a second, or a job could take 5 minutes once but then just 5 seconds on the next run.  

Such performance variability, which also afflicts other cloud data platforms, emphasizes the importance of aligning architecture with business needs. Ensuring a consistent and reliable product experience was paramount for the financial software provider. When testing it’s essential to look at P99/P95 timings and standard deviation over several iterations and not just rely on the fastest or average times. Recognizing the challenges in predictability, they sought ways to enhance performance consistency to uphold their service standards. This narrative underscores the need for solutions that can offer both the agility of cloud scaling and the stability necessary for critical business operations. 

3. Concurrency Challenges: Uncovering Redshift’s Read-Write Limits

Redshift’s columnar storage can encounter challenges in scenarios demanding high levels of concurrency, especially for mixed read-and-write workloads. For instance, when a customer tested a scenario with 11 concurrent queries (Redshift can support 50/cluster), Redshift processed a decent number of insertions (~64K) but could not execute a single query. This can be attributed to its architecture design where concurrent reads and writes on the same DB object lead to conflicting locks.

Such findings are crucial for businesses – understanding data patterns and concurrency requirements is critical. It’s essential to recognize that columnar databases like Redshift have limitations in handling use cases that demand high levels of concurrent reads and writes. While Redshift is a good option for analytical needs, businesses with specific requirements for concurrent operations should be aware of these limitations. This insight is vital for aligning a data warehouse solution with businesses’ operational and real-time analytics needs, ensuring that the chosen technology effectively supports their business scenarios. 

4. Cost Consideration @Scale: Addressing $40K/customer/year Challenge in Redshift’s Model

The journey of scaling a data warehouse like Redshift involves not only initial investments but also understanding the operational and scaling costs tied to its architecture. While Redshift provides a robust starting point, use cases particularly around data isolation and performance, may lead to an increased number of instances, each adding to the overall cost footprint. 

A case in point is a B2C apps company evaluating Redshift. Their assessment uncovered that while Redshift efficiently supports user-specific views and small databases, the platform had a limited isolation capability below the instance level. This need arises not from a lack of resource efficiency but from its architecture, allowing users to see all the schemas within an instance. Although manageable in the early stages, they estimated this model could lead to $40K/year per instance for every new customer they add. Companies in such situations should weigh the long-term implications of this scaling approach, considering both the direct costs of additional instances and the indirect impact on administrative complexity. 

Yellowbrick’s Edge over Kubernetes

So how does Yellowbrick stand up in the face of the challenges we’ve explored with Redshift? Here is a high-level overview: 

Features 

AWS Redshift 

Yellowbrick 

Database Limits 

Limited to 60 databases/instance 

Supports up to 1K databases/instance 

Storage Architecture 

Column-only store 

Hybrid row-column store 

Concurrency 

Up to 50 concurrent queries per cluster 

Up to 150 concurrent queries per cluster 

Data streaming 

Limited capability 

Built-in support for data streaming (Kafka, Spark, etc.) 

Deployment Flexibility 

AWS-only, serverless, optional “in-your VPC” 

Kubernetes-native, On AWS & Azure, deployed “in-your-VPC” 

In the next blog post, “Part 2: Yellowbrick’s Performance Edge”, we’ll dive deeper into Yellowbrick’s architecture and user experience, highlighting the performance outcomes observed by customers who assessed Yellowbrick over Redshift. Will it be the strong espresso shot the data warehousing world needs?  

Let’s find out! 

 

 

Get the latest Yellowbrick News & Insights
Why Private Data Cloud?
This blog post sheds light on user experiences with Redshift,...
Data Brew: Redshift Realities & Yellowbrick Capabilities –...
This blog post sheds light on user experiences with Redshift,...
DBAs Face Up To Kubernetes
DBAs face new challenges with Kubernetes, adapting roles in database...
Book a Demo

Learn More About the Only Modern Data Warehouse for Hybrid Cloud

Faster
Run analytics 10 to 100x FASTER to achieve analytic insights that have never been possible.

Simpler to Manage
Configure, load and query billions of rows in minutes.

Economical
Shrink your data warehouse footprint by as much as 97% and save millions in operational and management costs.

Accessible Anywhere
Achieve high speed analytics in your data center or in any cloud.