How Yellowbrick Stores Data – And Why It Makes Your Queries Fast

Rosa Lear
5 Min Read
/
/
/
How Yellowbrick Stores Data – And Why It Makes Your Queries Fast

Under the hood, Yellowbrick uses a hybrid of columnar storage on worker nodes and rowstore storage on manager nodes to deliver high performance on analytic workloads. Understanding this architecture helps DBAs and data engineers design tables that query faster and use space more efficiently. For a broader look at what makes Yellowbrick’s engine unique, see Secrets of Yellowbrick Database Architecture.

Manager Nodes vs. Worker Nodes: Different Jobs, Different Storage

Yellowbrick separates system responsibilities from user data.

  • Manager node storage holds the operating system, system catalogs, temporary rowstore data, and result‑set spool. Critical portions are protected by disk replication and failover.
  • Worker node storage holds user data, spill space for queries, metadata, and parity data using a Yellowbrick‑specific RAID‑6‑style erasure coding scheme.

Worker storage is made up of SSDs, and the appliance is effectively “out of space” when any single disk is full, so balancing usage matters.

Columnar vs. Rowstore: The Best of Both Worlds

Databases in Yellowbrick use two complementary storage formats.

  • Columnar storage on workers: User data is organized into columnar blocks, optimized for large analytic scans, subset‑of‑columns access, and bulk data load.
  • Rowstore on managers: Used for system catalogs and temporary data from small inserts or loads under about 30 MB.

A background process (yflush) automatically flushes rowstore data to columnar storage once tables reach a threshold (roughly 30 MB) and no conflicting write locks exist. That means DBAs typically don’t need to micro‑manage this flush behavior. This hybrid approach is central to how Yellowbrick supports streaming analytics — real-time inserts land in the rowstore and become immediately queryable while the column store handles large analytic scans.

Shards and Blocks: How Data Is Physically Organized

At the physical level, Yellowbrick uses shards and blocks to organize table data.

  • A shard is the top‑level storage unit for a table on a worker. Each shard holds up to about 1 GB of uncompressed data plus metadata such as row count and min/max values.
  • Within a shard, each column is stored in blocks, with each block containing up to about 32k rows of a single column, along with min/max and null counts.

This structure enables Yellowbrick to skip entire shards and blocks when queries filter on ranges, drastically reducing I/O. To see how these internals translate to real query execution, read Life of a Yellowbrick Query.

Automatic Maintenance: Garbage Collection and Rowstore Cleanup

Yellowbrick continuously maintains its storage structures so DBAs don’t have to run manual reorg jobs.

  • A garbage collection process runs roughly every five minutes, freeing space from deleted rows, reclaiming shards, and merging small shards.
  • Manager‑node rowstore data is periodically flushed to the column store, and system catalog tables are vacuumed on a separate schedule.

Because Yellowbrick doesn’t update rows in place, updates mark old rows as deleted and write new rows to fresh shards, which GC then cleans up when safe. This automatic maintenance is one reason enterprise data warehouse deployments can run with minimal DBA overhead.

Monitoring Space and Skew

The System Management Console (SMC) and system views give visibility into space usage and data distribution.

  • High‑level dashboards summarize total user data, temp space, and system spill space usage.
  • Detailed views such as sys.tablestorage expose per‑table compressed and estimated uncompressed sizes, row counts, and worker‑level distribution.

This makes it straightforward to spot skewed tables or workers before they become performance or capacity problems. For more on observability and tuning, see Workload Analytics and 7 Practical Ways to Improve Yellowbrick Performance.

Related Resources

More like this

Customers

Sign up for our newsletter and stay up to date

Search Our Data

competitive