One of the things I have noticed about engineers is that there’s a certain gravity that pulls them to a particular type of work. It’s not that there’s always a particular interest in doing a specific type of development – it’s more complex. It might be a skill set that is suited for a particular job that is hated, but that serves the project so well that person MUST do that work. Or perhaps the person loves hacking C++ so much that they shepherd every problem down the road to the point where they’re hacking where they are happiest. The attraction of these dynamics tends to pull them to these tasks. Gravity.
What I noticed about myself over the years I have been working is that I am essentially floating in space. There isn’t any gravity where I live. No matter where I have worked, after a few years of being there I look around me and I have my hands all over everything. Gravity doesn’t pull me in any particular direction; I land everywhere. I haven’t met anyone else like me; actually, I think my managers and coworkers probably get seriously annoyed because I end up making messes all over the place. On balance, I hope the benefits of having me on the team are positive, but I think it’s probably a PITA for everyone involved!
When I started at Yellowbrick, Neil Carson and I sort of had this gentleman’s bet. After Neil hired me, he felt like I would tend to work on UI mostly at Yellowbrick, and I told him I would end up working on a lot more than UI. I think I won this bet (but I think I’ve lost more than I’ve won over the years)) because I spend probably about 20% of my time doing UI work. The rest of the time I think I am floating around all over the place, working on what I think will make the product better, all the while focusing on Workload Management, or WLM, as a cross cutting concern.
What is Workload Management?
So what is WLM? From one simplistic perspective, Workload Management is the set of knobs you give an administrator to control the resources on an MPP database: CPU, memory, and disk. At Yellowbrick, our WLM lets an administrator do these things to establish a policy that governs what resources a query is given. What’s this look like?
We think of WLM as:
- Query grammar.
- Query parsing.
- Query planning.
- Query optimization, tree reshaping, plan rebuilding.
- Query analysis.
- Query code generation, including transforms from an AST to C++.
- Query compilation.
- Query resource planning.
- Query policy.
- Query execution.
- Query distribution.
- Query concurrency.
- Access Control and privileges, role management, identity management.
- Encryption, compression.
- Network stack development, including Infiniband, DPDK, codegen-ed RPC, etc.
- Kubernetes orchestration.
- AWS/Azure/GCP hardware optimizations.
- Data ingest, unload, backup, restore, custom ingest formats.
- Operating system customization.
- Linux kernel customization.
- Custom switch development.
- Firmware and embedded systems development.
- HA stack.
- Network driver customization.
- Private networking.
- Data replication.
Above all that lives your typical middleware, database, and front end stack. And yet at Yellowbrick, we have no one single engineer that does a bit of ALL of that above or below the line where the database lives. There is no “full stack” engineer because no one at Yellowbrick can do all those jobs. It’s really, what coordinate space(s) do you live in at Yellowbrick? Because I don’t think we have much in classic computer science and systems development that we DON’T do.
Cross Cutting Concerns
Statistics have to be measured EVERYWHERE in Yellowbrick for the duration of a query. WLM is all about aggregating and building a picture of an overall workload, and letting you drill into any session, transaction, or query within the workload to understand where bottlenecks are, where queries take the wrong path, are spending too much time spilling, or using too much memory. But in an MPP data warehouse like Yellowbrick’s, where we obsessively engineer the execution pipeline down to machine code, we aren’t allowed to measure these statistics in the same way as other products or perhaps enterprise databases do.
I remember the day and I remember where I was sitting when I had my first conversation about statistics and how we would collect them with Thomas Kejser. We were sitting in the back of Neil’s car coming from a meeting at one of our original investors, and Thomas and I were in the middle of an argument over how we would measure statistics to build the picture of a query’s lifecycle.
For Thomas and I, this was probably our first encounter, so we were arguing from different viewpoints. From my perspective, we were going to install some agents on our executors to measure CPU usage and memory usage, because I came from a classic background in enterprise management, and that’s what you do. You measure. You sample. And you build a picture.
Thomas told me we couldn’t afford the switch in execution context to measure performance, because we would perturb our query’s performance. We were obsessively building this bare-metal machine executor that was using AVX instructions and attempting to avoid thrashing the L1/2/3 cache pipelines of the CPU. I learned that performance was king at Yellowbrick. It was a way of thinking I hadn’t dealt with since I was at University doing systems programming coursework. I was an application engineer, and Thomas wasn’t.
The seminal moment came when I told Thomas you HAD to have an agent on the worker to measure performance, because how could you know and alert if your CPU utilization went above a certain threshold? There, I had him. I didn’t know Thomas that well at this point, but I knew I was going to be able to install that agent and move on to measuring CPU performance and other stuff we needed. Thomas smiled. He said, “We consider it a bug if your CPU utilization drops below 100%.”
I didn’t understand. Normally, when your database starts spiking, you call someone to go fix it. I had built an entire company as the CTO around the notion that you take such alerts and aggregate them at a massive scale to build a better picture. It was called “Business Service Management,” and I knew a lot about it.
So of course I kept pushing back. But being the engineer that I am, I started to fear. A lot. I told Thomas to explain himself, and he told me the bad news: Yellowbrick was going to be its own operating system, and we were going to fully utilize every drop of CPU of every core on every place the query was running. There wasn’t even a single spare compute cycle we would devote to ANYTHING that would perturb our query’s performance when it was given the go to execute, and it wasn’t going to be something that we debated.
So if during a query’s lifecycle it is using less than 100% of the CPUs, across all our query executors, that we had done something wrong as engineers and we would consider it a bug, or perhaps a deficiency to go address. The CPU had to be running at 100% or we didn’t succeed. The ship sailed, and this wasn’t a debatable position to argue over.
This was my Maxell Moment at Yellowbrick:
Well, f— me. I think I said to Thomas, “I am Jack’s lack of happiness.” I remember him laughing in my face over this. He got it, and he got that I got it. This news washed over me like a hurricane blast.
I ended up with so many questions about this. How would I show a picture over time of the utilization of our product? How would I show when memory was over-used, or under-used? I had already started to think about WLM, and this was 2015 before we even started on WLM. But this is the point: your big picture visualizations support WLM, and that understanding leads your administration to the resources for an MPP database to build the WLM tools you use to tune. At Yellowbrick, we were going to be VERY different than I had been in the past.
And here’s a hint about where we landed: yes we do collect statistics, LOTS of them. We don’t do it the same way. More on that in another blog.
Build a Picture to Understand Your Problem
At some point around 2017, Neil decided he needed to focus on the Yellowbrick business more, and I was given the responsibility to own WLM. I had been building UI and delegating a lot of requirements to other engineers, but I had floated all over the place and gravity hadn’t kept me in the UI much. I had been working a lot on our Java code, and helping the engineers with the maven shit, the open source component shit, how to organize our IOC container, why we needed it, etc. Neil knew I could hack Java, so it made sense.
When I took over WLM, I started looking at our statistics and the data started to trouble me. We hadn’t announced ourselves to the world yet (that wouldn’t happen until 2018), so we didn’t have customers at the time. But I had started the feel uneasy about our concurrency model. It felt like there were big gaps in concurrency where we would be bumping along, and BLAM everything would stop being concurrent, and then things would return to normal. I didn’t have proof, but the data was starting to speak to me and say that we had a problem.
I told Neil about this, and he said, “No way, there’s no problem with concurrency.” So I shut up and kept building WLM features. I came back to him once in a while and said, “We have this concurrency issue.” Neil was annoyed with me because I wouldn’t shut my pie hole about this, and I bet him that there was a problem, and I could prove it to him. (See the theme here, there’s a lot of betting going on!)
I set about trying to build a picture of concurrency. There’s a lot of research on this, but it tends to be graphed. I like line graphs, and I could build the picture that way, but I wanted to SEE concurrency. It felt to me that knowing that you had 10 queries executing at one moment followed by 1 query executing would only prove you had the problem, not that you knew where the problem was.
I ended up looking at the D3 site https://bl.ocks.org/ a lot. Mike Bostock is such an amazing engineer to have built D3…it was incredible stuff. One particular example of over-time visualization was this amazing event plot D3 viz called Event Drops. It was super cool to plot when things happened over time. I felt that if I could hack a visualization like event drops, where queries plotted over time in the execution lanes on our database, I would see where the concurrency problems were.
Event drops don’t consider event durations, so it wasn’t easy. And it wasn’t animated in real-time. But I did hack that picture with it, and I will blog about our re-imagining of that visualization on HTML canvas at a later date. It has served us well.
I don’t have a screenshot of how I won this bet, but when I built our Execution Timeline feature, you could see queries running over time, including their durations, and right in the middle of the picture, every so often, we would see a big “hole” in time. The visualization showed me that the Yellowbrick database was massively concurrent until something happened that it was exactly ZERO concurrent. Everything stopped….did something, then started doing concurrent things again.
So I won this bet. We had a concurrency issue. The issue had to do with the query lifecycle, and not to bore you with too much detail, there was a place where we had to compile queries into machine code that made every query completing off our executors wait for a query compiling. It was a silly bug that was an easy fix, but I don’t think I would have gone hunting for it until I had that picture. And that picture has served Yellowbrick customers well over the years. Our engineers and customers use it a ton.
The Query Lifecycle and Another (Spiky) Bet
Back in January of 2020, the world was very different. A lot changed. My pandemic journey led me to a complete rewrite of our query lifecycle in Yellowbrick – an all-new WLM that also began with a bet. We were sitting around our Palo Alto office talking about some of the more challenging high-end customer issues we were facing, and we were dealing with a lot of mismatches in how the Java JVM ergonomics work for different parts of our query lifecycle.
At one end of the query lifecycle, where we have a query being dispatched to multiple execution workers, our Java code was very I/O bound. It was using synchronous I/O, and we had a lot of threads sitting, waiting for a compute eternity for responses to their requests to set up, running, waiting for completion of their run, or teardown running. In that sense, the query was doing nothing.
At the other end of the query lifecycle, where a query would transform its AST into machine code, we had a very GC-intense process to code-gen a bunch of C++ code, compile that code, cache that code, and it was thrashing the Java GC to hell. If we had a bunch of activity in this space, it was starting to cause problems for the execution pipeline, the place where most of our code was I/O bound on network dispatches. Those network dispatches, however, were super sensitive and we were having problems maintaining reliable communications.
So I posited that we really should break this into two microservices, one that was solely dedicated to the ad-hoc and unpredictable nature of codegen and AST analysis, that was GC-intensive and “spiky,” while the rest of our main execution engine in Java should be in a separate JVM, to make it more predictable and reliable.
Neil threw down the gauntlet to see if I couldn’t turn that around quickly, and we ended up doing exactly this: making our “spiky” component separate and walling it off from the super-critical time-sensitive high-latency parts of the Java stack we have. This ended up making the product much more reliable, and for the most part, has been a huge win.
Another bet we made was that we decided to re-do our product’s query governance component, our state manager or query context manager, if you will. The reason was that we weren’t getting enough visibility into every pocket of a query at all parts of its lifecycle. For example, we couldn’t tell the time spent parsing vs. planning vs. locking vs. waiting for the CPU to do all those things. And we also didn’t have complete control over all those phases of the query lifecycle, such as if you were an administrator and you wanted to gate the number of queries that were allowed into the planner. We needed more fine-grained visibility and control.
This was a big project. Another engineer at Yellowbrick, Matt Ripley, had been working hard on the systems side of this problem by rewriting our RPC layer to be asynchronous, based on Netty, instead of using synchronous I/O. This seemed like a good time to cut over to using that, but little did I know HOW MUCH the world would change. Not only was the world dealing with a raging pandemic, but in my world, I was going to go down a road I hadn’t yet traveled.
The Asynchronous Development Model is a Virus
What I learned about development with Netty, and asynchronous programming using Java’s CompletableFuture model, is that you can’t stick your toe in the async world and get a taste of it. It’s viral. Once you start doing async, making only part of your stack async doesn’t cut it. It’s an all-in bet.
I mean, what’s the use of doing this?
CompletableFuture.allOf(connenctions.map(c => .makeSomeRemoteRPCCall()).join();
All you have done is reduce the number of synchronous threads down to one, but you still have a blocked thread. And that one blocked thread can still cause you LOTS of trouble:
- It can hold a lock.
- It can hold a limited resource.
- It can (and will) deadlock with something else doing one of the two things above.
You see, writing asynchronous code is super unforgiving. You can’t do things like you used to do:
- I will acquire some resource.
- I will call some remote service.
- I will continue.
Your code can’t do this straight-line anymore. It has to be reactive and composed of a bunch of things that are sequenced. During this project I learned something very critical in Java: don’t use locks to sequence steps of a program. The corollary to this is: only use locks to protect state, and never hold a lock if you’re doing something more than protecting state. Breaking these rules will always lead to resource starvation, deadlocks, and unreliability. Instead, compose asynchronous stags of a pipeline using CompletableFuture.thenCompose().
After some time of building the new async pipeline, we had to wire it up to the rest of the database which was using RPCs, and grafting in the RPC async library with Netty was a dream. In the end, if you were to jstack (take the running stacks) of our Java code that governs the query pipeline, you won’t find ANY blocked threads ANYWHERE, and if you do, it’s likely a bug. It’s a big reactor of things waiting and reacting and then waiting and reacting, forever.
Tests Are Your Savior
At Yellowbrick we could make a big bet to change our product in this fundamental way because we invested an absolute monster amount of engineering effort that measure the correctness of our database over the years. So if we decided to change something fundamental to how to orchestrate a query’s lifecycle, and we got it wrong, the bet is pretty safe to make because our tests would find our faults. Of course, we’re always making more tests, and engineering new ways to error-inject problems into our code and make sure everything reacts well. Part of our re-imagining of the query lifecycle was to build test infrastructure.
So in 2020, it was high time to make the changes we need to improve the visibility, control, and reliability of our query lifecycle, and the results have been great. Our 5.0 product shipped and has had very few issues we used to face with prior generations of our product.
WLM…Needs More Cowbell
While much of the Yellowbrick database self-tunes, the need for workload management tools and features that elevate and explain more parts of the query lifecycle seem unbounded. It doesn’t seem like it will be a journey with an ending. We are always improving.
For that reason, I feel like I will continue to float in zero-gravity at Yellowbrick, and that’s why I love working on WLM in general. I get the dream job of building cool visualizations, pulling together data from a huge number of places, and providing our administrators the insight they need to set that knob to 11.