Overview of Yellowbrick WLM
In Yellowbrick Data Warehouse, rich workload management (WLM) functionality is critical for enabling reliable, near-real time performance at scale. Whether your company is an enterprise with a business-critical data warehouse or a SaaS provider with a product that depends on reliable large-scale data processing(which is necessary for sectors such as financial services), leveraging robust workload management tools to ensure near-real time response across hundreds or even thousands of concurrent queries is non-negotiable for a good user experience.
In this post, we’ll explore the life of a Yellowbrick query by providing an overview of what Yellowbrick WLM is, what it’s used for, how it works, and how understanding query execution can help inform a good WLM strategy.
What is Workload Management?
In Yellowbrick Data Warehouse, resources can be shared in various ways for optimizing workload management. The main resources that can be allocated to database queries and other operations are CPU, memory, and temporary spill space. For example, on a system with a high degree of concurrency, you can allocate more CPU to specific queries by increasing their priority. For complex queries that require more memory or spill space, you can request more of these resources at different points during execution. To optimize query performance, long-running queries can be queued behind very fast queries to avoid starving short-running queries of adequate resources.
A workload is a set of queries or other database requests that is to some extent a known quantity. For example, if a group of users run ad hoc queries against the same set of tables every day, that set of queries, though somewhat arbitrary, may be thought of as a known and expected workload. A resource-intensive report that is run by one user at the same time every morning may also be anticipated as a separate (and probably high-priority) workload. A third example is database administration work, such as bulk loads and backups, which may occur during a maintenance window when end users have minimal access to the system.
Workloads may be defined across many different dimensions: in terms of when they are run, the application or user that runs them, the type of work, their expected duration, whether they are resource-intensive, and so on. These variables are typical for MPP database systems, which are rarely used for one type of query or by one type of user.
Query Execution Use Cases
To dive deeper into “what is workload management?” it is also helpful to explore some familiar query execution use cases:
- Runaway queries: identify and stop long-running queries that, for example, select all of the rows from a very large table (whether issued naively, by mistake, or at a “bad time”)
- Short-query bias: give priority to queries that run very fast (sub-second speed) and prevent them from being queued behind longer-running queries for which an instant response is neither expected nor required
- Ad hoc queries: place “browsing” or “discovery” queries at a lower priority in the queue than more critical queries that are needed to run the business
- Time-sensitive queries: apply different rules at different times of the day or week. For example, weekly business roll-ups have the highest priority until they are done. All other queries have lower priority.
- Admin queries: allocate resources to run superuser queries immediately, especially internally generated queries that maintain the database (for example, operations that flush and analyze new table rows).
- Loads and updates: write queries that do batch loads, deletes, and updates must not starve read queries.
- Logging, auditing, and reporting: log user-defined messages and tag queries as they are executed; learn about system usage in order to adjust future workload management behavior; create audit trails for separate applications and user groups.
To optimize resource allocation based on workloads, operators create workload management objects called rules, resource pools, and profiles. These objects define a flexible set of heuristics to translate typical workload management use cases into an optimal strategy for resource allocation and scheduling. You can set up workload management objects either in the Yellowbrick console or by using SQL commands.
An important facet of making decisions about system resource allocation is understanding query execution. Yellowbrick queries move through several finite states from submission to completion. A basic understanding of this process will help you develop an effective WLM strategy, especially with respect to the creation of WLM rules.
As a query passes through each state in its life cycle, runtime statistics are captured and logged. These statistics provide a measure of the time spent in each phase of query execution, giving administrators a means of monitoring and analyzing query performance. Wait times and actual processing times are measured at each stage.
The following diagram shows the life cycle of a query. Each query passes through several states on the manager node, while it is being prepared for execution, then it starts running (executing) on the worker nodes.
The diagram also identifies when Yellowbrick queries can be cancelled or restarted. Once submitted, a query runs to completion, is cancelled, or fails with an error (DONE, CANCEL, and ERROR states). If a query is restarted or returns an error, it may re-enter the cycle in the ASSEMBLE state, but ultimately, all queries finish in one of the three completion states. (If a query is cancelled, it cannot be restarted.)
Workload Management Tools
When researching and evaluating workload management tools, it is critically important to go beyond conventional offerings, and focus on products that provide advanced functionality. For example, Yellowbrick’s next-generation WLM tools deliver self-healing and automatic query restarts. Below are some common use cases that highlight the significant advantages of this advanced functionality:
- Self-healing cluster: in scenarios where a cluster experiences a hardware failure, Yellowbrick WLM enables the cluster to recover and restart all impacted queries — but without users ever being aware of any issue, or being obligated to write a single rule (although if desired, users can access sub-second control over query flow and execution, which includes cancellation and restarting with scriptable rules).
- Penalty boxing: in scenarios involving several groups executing short and interactive queries that cannot be interrupted, users can define rules on what constitutes a long running query, and govern automatic movement and resource adjustment on those queries (e.g., limit the number of this type of query that can run at any one time, and set lower priorities, specific memory, or other system resources limitations).
- Improved resource mapping: in scenarios where due to a shared environment each database belongs to an organizational unit, users can set disk quotas allowing control over how much space an individual database, schema, or table may consume. As a result, the DBA has the ability to prevent an individual database from consuming the full space of the cluster.
As you can see, workload management is an important reason why Yellowbrick Data Warehouse is as fast and reliable as it is. To learn more and see a demo of workload management in action, watch this on-demand webinar.