In our previous post about using TPC-DS queries to illustrate how the Yellowbrick query optimizer works, we introduced TPC-DS Query 1 (Q01). This post tackles Query 2 (Q02), focusing on the subtle tricks required to handle the fact tables in TPC-DS.
TPC-DS splits its sales information into three different fact tables, an approach typical in the real/non-benchmark world, too. In TPC-DS, these tables are
web_sales
catalog_sales
—twice the size ofweb_sales
store_sales
—twice the size ofcatalog_sales
(obviously, the schema was designed pre COVID-19)
Many queries in TPC-DS will UNION
these tables together in various ways. Q02 is the first example:
WITH wscs AS (
SELECT sold_date_sk, sales_price
FROM (
SELECT ws_sold_date_sk sold_date_sk
, ws_ext_sales_price sales_price
FROM web_sales
UNION ALL
SELECT cs_sold_date_sk sold_date_sk
, cs_ext_sales_price sales_price
FROM catalog_sales
) AS x
)
, wswscs AS (
SELECT
d_week_seq
, SUM(CASE WHEN (d_day_name = 'Sunday') THEN sales_price ELSE NULL END) sun_sales
, SUM(CASE WHEN (d_day_name = 'Monday') THEN sales_price ELSE NULL END) mon_sales
, SUM(CASE WHEN (d_day_name = 'Tuesday') THEN sales_price ELSE NULL END) tue_sales
, SUM(CASE WHEN (d_day_name = 'Wednesday') THEN sales_price ELSE NULL END) wed_sales
, SUM(CASE WHEN (d_day_name = 'Thursday') THEN sales_price ELSE NULL END) thu_sales
, SUM(CASE WHEN (d_day_name = 'Friday') THEN sales_price ELSE NULL END) fri_sales
, SUM(CASE WHEN (d_day_name = 'Saturday') THEN sales_price ELSE NULL END) sat_sales
FROM wscs
INNER JOIN date_dim ON d_date_sk = sold_date_sk
GROUP BY d_week_seq
)
SELECT
d_week_seq1
, ROUND(sun_sales1 / sun_sales2, 2)
, ROUND(mon_sales1 / mon_sales2, 2)
, ROUND(tue_sales1 / tue_sales2, 2)
, ROUND(wed_sales1 / wed_sales2, 2)
, ROUND(thu_sales1 / thu_sales2, 2)
, ROUND(fri_sales1 / fri_sales2, 2)
, ROUND(sat_sales1 / sat_sales2, 2)
FROM (
SELECT
wswscs.d_week_seq d_week_seq1
, sun_sales sun_sales1
, mon_sales mon_sales1
, tue_sales tue_sales1
, wed_sales wed_sales1
, thu_sales thu_sales1
, fri_sales fri_sales1
, sat_sales sat_sales1
FROM wswscs
INNER JOIN date_dim ON date_dim.d_week_seq = wswscs.d_week_seq
WHERE d_year = 2001
) y
, (
SELECT
wswscs.d_week_seq d_week_seq2
, sun_sales sun_sales2
, mon_sales mon_sales2
, tue_sales tue_sales2
, wed_sales wed_sales2
, thu_sales thu_sales2
, fri_sales fri_sales2
, sat_sales sat_sales2
FROM wswscs
INNER JOIN date_dim ON date_dim.d_week_seq = wswscs.d_week_seq
WHERE d_year = 2001 + 1
) z
WHERE d_week_seq1 = d_week_seq2 - 53::INT
ORDER BY d_week_seq1
LIMIT 100
;
Estimating UNION ALL
Consider this part of the query:
WITH wscs AS (
SELECT ws_sold_date_sk sold_date_sk
, ws_ext_sales_price sales_price
FROM web_sales
UNION ALL
SELECT cs_sold_date_sk sold_date_sk
, cs_ext_sales_price sales_price
FROM catalog_sales
)
Here, we construct a Common Table Expression (CTE) we will later join to date_dim
. But how should we estimate the cardinality of this join?
FROM wscs
INNER JOIN date_dim ON d_date_sk = sold_date_sk
With statistics, it’s easy to estimate the cardinality of the UNION ALL
. However, to estimate the cardinality of the join itself, we have a few choices.
- We know the number of distinct
sold_date_sk
in each of the union branches. But even with very advanced statistics information, it is difficult to say how many distinct values theUNION ALL
of them has and what the overlap with thedate_dim
actually is. - We know there is a foreign key from
ws_sold_date_sk
tod_date_sk
and another foreign key fromcs_sold_date_sk
tod_date_sk
. Because of this, we can know that the join todate_dim
of theUNION
must yield:- The
SUM
of the two inputs MINUS
the null rows on eitherws_sold_date_sk
andcs_sold_date_sk
- The
The second choice obviously provides the superior estimation method. Using primary/foreign key relationships such as this one is a key (pardon the pun) optimization for query planners: It rewards database designers who build good schemas and properly declare keys with a better query plan. This is why proper database design is very important for getting the most out of relational databases.
Estimating GROUP BY
From an estimation perspective, this join is of particular interest:
FROM wswscs
INNER JOIN date_dim ON date_dim.d_week_seq = wswscs.d_week_seq
WHERE d_year = 2001
Here, we are joining on something that is not a key (d_week_seq
)—which in general means there is plenty of opportunity to get the estimate wrong.
How well we estimate this join depends a lot on what we can say about the following construct:
wswscs AS (
SELECT
d_week_seq
, SUM(CASE WHEN … etc..
FROM wscs
JOIN date_dim ON d_date_sk = sold_date_sk
GROUP BY d_week_seq
)
Estimating aggregation is generally a difficult problem (because the cardinality of the output depends on the correlation of the input columns). However, for this particular join, we can be very confident that the cardinality is likely to be the same as the cardinality of d_week_seq
, because we are grouping on only a single column.
The second part of the join requires us to reason about the cardinality of d_week_seq
within date_dim
, given the filter of d_year = 2001
. Here, a two-dimensional histogram of (d_week_seq, d_year
) would be a great help. Unfortunately, the number of possible histograms you’d have maintain in this manner would grow so fast that you’d often run out of memory/disk to hold them. However, we can say that the number of distinct values can’t possibly exceed the number of rows estimated for d_year = 2001
—i.e., 365. That gives us a reasonable cap.
CTE materialization
In Q01, we saw that although the same CTE is used multiple times in a query, it may not be a good idea to materialize the result. However, for Q02, we are looking at some very significant savings. Consider the following construct:
SELECT
d_week_seq
, SUM(CASE WHEN (d_day_name = 'Sunday') THEN sales_price ELSE NULL END)
, … more columns …\
FROM wscs
INNER JOIN date_dim ON d_date_sk = sold_date_sk
GROUP BY d_week_seq
The output size of something grouped by d_week_seq
is obviously a tiny fraction of the input size of wcsc
(which you may recall is the UNION
of catalog_sales
and web_sales
). Even with a high-speed column store, there is a bit to be gained by materializing this intermediate result, particularly because expressions such as the following are very expensive to evaluate:
SUM(CASE WHEN (d_day_name = 'Sunday') THEN sales_price ELSE NULL END)
Human beings really like strings; we find it easy to remember and work with them. However, machines pay a significant price when comparing strings. For a CPU, it’s orders of magnitude faster to compare integers than it is to compare strings. There are seven of these string compares for every row in wcsc
—which becomes billions of compares during query execution. So, it’s worthwhile to save the overhead of calculating these multiple times by materializing the intermediate result.
JOIN order
In this query, we don’t have a lot of choices with regard to join ordering. The dependencies between the joins force an order. The order is:
- Join
wscs
todate_dim
—materialize the result ofwswscs
- Join the two copies of
wswscs
(for 2001 and 2002) - Join the two above copies with each other on
d_week_seq