My recent venture has involved a lot of conversations with Redshift users where they shared a wealth of experiences. These discussions, often accompanied by a comforting cup of coffee, revealed interesting patterns unique to Redshift. Some of these were surprising – try not to spill your coffee!
Redshift has done an incredible job on AWS of establishing itself as the go-to choice, offering native AWS integration and a cost-effective entry point that resonates with startups and established businesses alike. A common theme I heard was, ‘If you are moving to AWS, nobody gets fired for choosing Redshift.’ But what happens as they scale? In this blog post, I aim to unpack their collective experience, offering a glimpse into its pitfalls and gotchas.
Stay tuned for Part 2, where I will discuss how Yellowbrick’s architecture compares and how we have solved these challenges as the users tried our data warehouse.
The Redshift Riddles
1. The 60-Database Question: Redshift’s Instance Limits
Redshift offers a starting point for many businesses with its cap of 60 databases per instance, which, while ample for some, may present a growth challenge for others. For example, a SaaS company specializing in marketing analytics found this limit approaching as their user base expanded. Their multi-tenancy model required isolation at the database level. Despite plenty of capacity, they needed additional instances earlier than anticipated due to the cap. This experience illuminates the importance of considering future growth when choosing a data warehousing solution that can adapt to a company’s scaling needs.
Similarly, in the realm of ‘Data Products,’ where distinct teams manage their independent databases, the database cap can introduce complexity and potential cost considerations.
While extra instances offer a solution, they also bring added layers of management and the need for additional data movement to satisfy cross-instance queries. Such scenarios highlight the value of a data warehousing solution that can provide both scalability and administrative simplicity to accommodate evolving business requirements.
2. Riding the Performance Wave: Understanding Redshift’s Query Inconsistencies
Redshift’s advanced features, such as the automatic creation of materialized views and session-based result caching, are designed to optimize query performance, striving to meet a broad range of analytical needs. However, use cases, particularly those requiring real (or near) time analytics on data, may experience variability in query times. For instance, a global financial reporting and performance management software provider found that query execution times could vary wildly, negatively impacting customer experience. For example, some queries varied from a minute to under a second, or a job could take 5 minutes once but then just 5 seconds on the next run.
Such performance variability, which also afflicts other cloud data platforms, emphasizes the importance of aligning architecture with business needs. Ensuring a consistent and reliable product experience was paramount for the financial software provider. When testing it’s essential to look at P99/P95 timings and standard deviation over several iterations and not just rely on the fastest or average times. Recognizing the challenges in predictability, they sought ways to enhance performance consistency to uphold their service standards. This narrative underscores the need for solutions that can offer both the agility of cloud scaling and the stability necessary for critical business operations.
3. Concurrency Challenges: Uncovering Redshift’s Read-Write Limits
Redshift’s columnar storage can encounter challenges in scenarios demanding high levels of concurrency, especially for mixed read-and-write workloads. For instance, when a customer tested a scenario with 11 concurrent queries (Redshift can support 50/cluster), Redshift processed a decent number of insertions (~64K) but could not execute a single query. This can be attributed to its architecture design where concurrent reads and writes on the same DB object lead to conflicting locks.
Such findings are crucial for businesses – understanding data patterns and concurrency requirements is critical. It’s essential to recognize that columnar databases like Redshift have limitations in handling use cases that demand high levels of concurrent reads and writes. While Redshift is a good option for analytical needs, businesses with specific requirements for concurrent operations should be aware of these limitations. This insight is vital for aligning a data warehouse solution with businesses’ operational and real-time analytics needs, ensuring that the chosen technology effectively supports their business scenarios.
4. Cost Consideration @Scale: Addressing $40K/customer/year Challenge in Redshift’s Model
The journey of scaling a data warehouse like Redshift involves not only initial investments but also understanding the operational and scaling costs tied to its architecture. While Redshift provides a robust starting point, use cases particularly around data isolation and performance, may lead to an increased number of instances, each adding to the overall cost footprint.
A case in point is a B2C apps company evaluating Redshift. Their assessment uncovered that while Redshift efficiently supports user-specific views and small databases, the platform had a limited isolation capability below the instance level. This need arises not from a lack of resource efficiency but from its architecture, allowing users to see all the schemas within an instance. Although manageable in the early stages, they estimated this model could lead to $40K/year per instance for every new customer they add. Companies in such situations should weigh the long-term implications of this scaling approach, considering both the direct costs of additional instances and the indirect impact on administrative complexity.
Yellowbrick’s Edge over Kubernetes
So how does Yellowbrick stand up in the face of the challenges we’ve explored with Redshift? Here is a high-level overview:
Features | AWS Redshift | Yellowbrick |
Database Limits | Limited to 60 databases/instance | Supports up to 1K databases/instance |
Storage Architecture | Column-only store | Hybrid row-column store |
Concurrency | Up to 50 concurrent queries per cluster | Up to 150 concurrent queries per cluster |
Data streaming | Limited capability | Built-in support for data streaming (Kafka, Spark, etc.) |
Deployment Flexibility | AWS-only, serverless, optional “in-your VPC” | Kubernetes-native, On AWS & Azure, deployed “in-your-VPC” |
In the next blog post, “Part 2: Yellowbrick’s Performance Edge”, we’ll dive deeper into Yellowbrick’s architecture and user experience, highlighting the performance outcomes observed by customers who assessed Yellowbrick over Redshift. Will it be the strong espresso shot the data warehousing world needs?
Let’s find out!