Data Warehouse on Kubernetes

Yellowbrick Logo
A “Direct” Path to Analytics: Cloud-Native Data Warehousing

A “Direct” Path to Analytics: Cloud-Native Data Warehousing

cloud data warehousing

In today’s data-driven world, businesses are grappling with vast amounts of data that require sophisticated management and analysis. As data volumes continue to grow, traditional data warehousing methods are increasingly becoming inadequate to handle the complexity and scale of modern data processing. In response to these challenges, cloud-native data warehousing has emerged, offering improved performance, scalability, and cost-effectiveness.

In this DM Radio interview, Yellowbrick CTO Mark Cusack and 7wData Founder Yves Mulkers join Bloor Group CEO Eric Kavanagh to discuss cloud-native data warehousing trends and innovations and explore how it’s transforming the analytics landscape.

They discuss the benefits of direct data access, columnar data storage, and high-speed interconnects that optimize the path between CPUs and storage. The panelists also provide real-world examples of how cloud-native data warehousing can help businesses gain faster insights from their data, improve decision-making processes, and drive business outcomes.

Transcript:

Eric Kavanagh:

Ladies and gentlemen, it’s time once again for DM Radio. And we’re going to kick off with a five-minute drill with our good friend, Mark Cusack, CTO of Yellowbrick, a direct path to analytics. And there you go. Take it away, Mark.

Mark Cusack:

Thank you very much, Eric. And hi everybody. What I’d like to talk to you about today over the next couple of slides very quickly is about some of the innovations and trends that we are seeing in data warehousing. Yellowbrick is a data warehousing company. We provide a SQL MPP data warehouse, which we’ll get into maybe a little later on in the discussions.

But first I want to talk a little bit about how data warehousing architectures have evolved over time. And over the past 10, 20 years, a lot of vendors evolved to the picture on the left where you see the massively parallel processing shared nothing picture of data warehousing where you’ve got compute very closely bound to the storage, block-level storage typically provides great scale at a massive level, but in quite a constrained way with that compute and storage very much fixed, so a degree of inflexibility there.

Now over the last sort of five or so years, a lot of vendors, particularly as they’ve deployed their data warehousing platforms in the cloud, have evolved to the model on the right-hand side, this kind of so-called MPP logically shared everything idea where the computers separated from the storage, the data within your data warehouse is persisted within cheap and deep object storage and then you introduce caching layers.

But you do have this separated compute and storage and you have elasticity and it’s these compromises that you always make when you go from the model to the left that’s very fixed and optimized to one that’s more flexible on the right. And I think we’ll get into that kind of discussion on what those compromises mean from a price and performance perspective. But most vendors in the space today have a model like that on the right-hand side and Yellowbrick’s included in that.

Another evolution that I see in the industry is around how these things are priced, how they’re consumed, how customers spend on data warehousing. And the traditional picture is somewhat on the left-hand side where we bought a fixed capacity of data warehousing, compute, and storage. The green line there. So the cloud spend or the spend over time would be on the Y axis, time obviously on the X.

And in some cases, you’d be under provision for what you needed at that point in time and you’d go through some CapEx exercise to procure more data warehousing capacity and spend more there. And then over the last few years, we had the arrival of more consumption based on demand pricing, which is that red line which more closely and more accurately kind of fits to the way that you’re using your data warehousing and generating analytics benefit from that.

But things become a little bit less predictable in that kind of model. And I think what we’re seeing as an industry as a whole is how we can blend these two together. And so you get the benefits of discounted rates at fixed capacity when you know your workload’s running 24/7, but know you can go outside of that fixed box when you need to. And so I think a lot of companies, including Yellowbrick, are now evolving to a cloud pricing model around data warehousing that is blended in this capacity consumption way.

And finally, I just want to talk what I see about where we’re going next and what’s really important for cloud data warehousing. And I think there are three strands, three different evolutionary axes if you like that that is driving where data warehousing is going. And it’s partly driven by the economic climate we find ourselves in where we need to still deliver fast answers, but we need to do it at a lower price point as well.

We’re seeing the emergence at the hardware level of more efficient CPUs. Moore’s law really no longer counts. It starts to become performance per watt being the new metric here. And as CPU technology advances, we’re seeing smaller and smaller incremental performance improvements. But we’re starting to see the arrival within Intel chipsets, for example, like Sapphire Rapids, the next generation of new accelerator technologies built into those chips that data warehousing vendors are starting to take advantage of outside of the general-purpose computing.

I think the second strong trend that I see that’s really important to help drive the cost of cloud data warehousing down is looking at where we optimize our software here. Most vendors obviously concentrate on their data-based software and optimize at that level. You’ve got to get smarter and more creative when you are optimizing for the cloud because can you imagine every vendor is now running on the same cloud hardware, right?

So really what differentiates them from one another, you’ve got to look at not only the software because most vendors use the same tricks from 50 years’ worth of relational data warehouse ideas for optimizing a query. So you’ve got to look at where else in that stack you can optimize for. And that’s something we really focus on at Yellowbrick.

And then finally I think to help improve the amount of automation in the stack to reduce the DBA overheads, the operational expense, we’re now looking for a nicer, cleaner cell service user experience driven unit driven by containerization, by Kubernetes, how we can get a huge level of automation to help us manage these more complex data warehousing components.

And we’re starting to see real importance and a move away from a pure SaaS model of consuming data warehousing to looking at how you can reduce your costs by running it yourself. And the idea of introducing automation is critical here. Nobody wants to run it themselves if it’s complex, but containerization and Kubernetes is helping us drive down costs and allowing people to run it themselves effectively.

Eric Kavanagh:

Yeah, that’s fantastic. That is a fantastic five-minute drill. Look, these folks up online Yellowbrick Data, that’s Mark Cusack, the CTO, right here on DM radio.

Speaker:

You’ve heard AM you’ve heard FM Now tune into DM Radio, the world’s longest-running show about data each week host Eric Kavanaugh interviews the brightest minds in the world of information management. Want to be on a show, send an email to info@DMradio.biz. Now here’s your host, Eric Kavanaugh.

Eric Kavanagh:

All right, ladies and gentlemen, hello and welcome back once again to the longest-running show in the world about data. It’s called DM Radio. Yours truly, Eric Kavanaugh here. And it’s always a pleasure to talk to experts in the field, folks. And I have to say we have one more all-star cast with us today. We’ve got Martin Cusack of Yellowbrick data, the CTO over there, and our good buddy Yves Mulkers from 7wData is on the line.

And we’re going to talk all about cloud-native data warehousing and a direct path to analytics. We’re going to talk about some really interesting hardware innovations and really architectural innovations that we’re seeing in the marketplace and it’s really quite fascinating. I’ve been following data warehousing for 22 years now since the earliest days of 2001. It was February. In fact, I started working for a consultancy and back then it was all about ETL.

And of course, Teradata was the 800-pound gorilla of the industry. There was no Snowflake, for example. They weren’t even a twinkle in someone’s eye yet back in 2001. And it was all very, very expensive and it took a whole lot of time and effort to put together, to engineer, to pull the data out of source systems to get it into a data warehouse. And of course, the whole point back then was you couldn’t query operational systems.

And so we couldn’t really get a strategic view of what’s happening in the business except just by thinking about things and running reports and so forth. And that’s not the same experience as really interacting with an analysis of the data to understand where is your business today. What’s happening in the marketplace, where are you going? Are these efforts you’re doing to market working, all these kinds of fun things.

That’s what we want from analytics. We want to understand what is the relationship between our activities in the business and the results that we see on the frontline in sales and operations and efficiencies and so forth. And things have really, really changed. So of course we went to the cloud and I was just chatting with Mark before the show about a revelation I had not too long ago, about this whole scale-out world of cloud computing and how we really sacrificed state at the altar of scale.

So we wanted to be able to scale out, but then state, which is where things stand right now in a transaction and a business process, we have to manage that somewhere else. But cloud is still fantastic stuff. However, guess what happens, whether it’s Amazon or Snowflake, or other cloud-based services, one of the jokes I’ve heard is that the largest item on your bill is the AWS item because you didn’t know it was going to happen.

All of a sudden it spikes and there’s a lot of effort being put into containing that cost, right? I mean one thing we can say about AWS is they’ve done a great job of really building into their DNA, the movement of bringing down prices. They understand that you want to be optimizing all the time and the background and they do that for you, but they don’t do everything for you of course. That’s why we have all sorts of players in the space.

And what I think is fascinating here is what Yellowbrick has done at an architectural level, but also at a hardware level. And maybe Mark, they’ll bring you in to comment on this. We were talking before the show when I first saw Yellowbrick, which is maybe five or so years ago, four or five years ago, it was a real hardware play and you had done some very interesting things with solid-state drives and sending the data straight to L1, L2 cache on the motherboard, and this is all interesting stuff. And then you had sort of an “aha moment.” Tell us a bit about that and why it matters.

Mark Cusack:

And this really speaks to the heart of the title here, Direct Path. So about five years ago, the problem at hand was what’s the quickest most direct way I can take data at rest from my SSDs and move it straight into the CPU? And one of the innovations that we did at Yellowbrick was really to devise a way of minimizing that path.

And traditionally databases take CPU instructions, take data from the SSDs, push it into a buffer cache in DRAM, the CPU instructions, then access the data in DRAM, do your SQL processing, and give you results at the end of the day. And so the revelation that we had at Yellowbrick was what if we cut out that middle layer? What if data just went straight from the NVMe drives straight into the L3 L2 caches in the CPUs and we cut out that DRAM step?

And that’s really what we get out of the heart of direct data path because what that means is you can effectively address petabytes of NVMe data stored there as if it were an in-memory resource.

Eric Kavanagh:

That’s really interesting for lots of different reasons. One of which, and maybe I’ll bring Yves in to comment on this is because when you’re doing analysis, you need this very snappy experience. You need to be able to click on something and re-render something or take a different look at something and get deeper and deeper. And the longer that anything takes to get done, the more chance there is of just truncating that thought process, which is what you want. That’s what the business person is supposed to be doing, is connecting the dots in his or her head to understand what really makes sense and to dig into things.

And analytics are great, but you have to interpret the analytics and you have to find and explore and really optimize the analytics to get the business value from it. And to me, that’s a very compelling storyline that you have here about expediting that path so it happens in near real time. So you don’t risk losing that aha moment. Yves, what do you think?

Yves Mulkers:

Yeah, exactly. Like you say in business it’s about the span of attention and you have so many, you see something in your data and you make a reflection and you want to do the next test or the next assumption, you want to have it validated. And with queries running 2020 seconds and so on, you lose that. You need a different way of interacting with your data and that’s very crucial on getting those insights.

Definitely if you’re looking at that websites and you want to look at the path that a visitor is going through to offer some product or so that’s why you need that real-time insight in your data, but not only on the online data, but like you said Eric, when you are exploring data and you want to have the insights, there it’s very crucial to have that direct insight. And I think all the manipulations that come – like you explained in the beginning – in the traditional ways of moving your data and into a next hub and into another hub and into a new transformation, that all takes time. And that’s some time that we don’t want to have any more or don’t have these days anymore.

Eric Kavanagh:

Yeah, exactly. You have to move quickly and the business has to move quickly. I mean, look at the market dynamics that we’re all dealing with right now. I saw it personally, I saw it last year frankly around July, that everyone in our industry just kind of took a deep breath and we’re all thinking, okay, what’s really going to happen here? Because you have all these extraneous circumstances like the war in Ukraine for example, affecting gas prices and so forth and supply chains that we already had COVID to deal with.

So we’re seeing just a tremendous amount of change, but that is really sort of a forcing function to bring about innovation and to force new ways of doing business. Mark is, that kind of what you found at Yellowbrick that you took the handwriting on the wall and said, “Hey, we got to sharpen our pencils here and come up with something a bit more compelling.” Can you maybe walk us through that?

Mark Cusack:

Yeah, absolutely. And I think to Yves’ point, it’s not only about faster access to near real-time analytics from what we see amongst our customer base here, and I think as an industry sometimes we miss just how important price performance is as a primitive deliverer of business value.

And I’ll give you one example. We have a customer in the insurance space. They’re interested in getting real-time insights on the one part of their business. The other part of their business is, “I must run my claims ratio processing and it must be completed in a known set timeframe.” We have other customers in the hedge fund markets as well who have to do daily FINRA reporting on the trades that they’ve made there. If they don’t have timely answers – a timely summary that they can give to the regulators – they will get fined.

So I think on the one hand, there’s a make you money by getting faster answers, but there’s also this kind of save you money aspect to faster analytics. And as I said, sometimes I think we forget just how important speed and pure performance is as we chase business value higher up in the stack if in the value chain, if you were.

Eric Kavanagh:

Yeah, that’s… Go ahead.

Yves Mulkers:

You bring on a nice point, Mark. It’s the predictability of your queries, what you’re running, and having an exact timeframe where you can really trust upon that it is finished within that timeline. I think that’s very important as well.

And what I was thinking through is this, if you see the type of data that has changed and when you’re cutting out the memory and can directly load into the CPU as you were explaining, that’s something which came along in the different ways of data what we have these are days and before it was really optimized even for analytics, or for operational type of data.

Mark Cusack:

I think that’s absolutely true. And going back to our history, we had an offering that was an appliance kind of thing where everything was locked down. You knew you wouldn’t have noisy neighbors, you knew the performance of every single component of the data Warehouse platform there. Now think of what happens again when companies transition to the cloud. You’re working in AWS, you’re no longer in a noise-free environment. There are other AWS users using the object storage, the networking, potentially the same compute instances.

And so what we’ve had to do is we’ve transitioned our software from an on-premises solution in a company’s own data center running on our hardware to pure software in the public cloud is figure out how we extend that direct data path notion to kind of smooth away, eliminate a lot of that noise and improve the signal-to-noise ratio when you’re running analytics in one of these environments.

Eric Kavanagh:

That’s a really, really good point. And so one thing I think a lot of folks don’t realize is when you go to the cloud – well think about what happened in the Hadoop world. So we had Hadoop for a number of years present itself as a viable alternative for analytics. My theory was always they were just trying to become the next Teradata and they thought they could use commodity hardware to do that. It turned out to be a bit cludgy to say the least, map reduce, which is the activator underneath or the active element if you will. It was designed to index the web and it does that very well, but it’s not designed to do the kind of analytics that many business people want to do. So that was a bit of a challenge, but while all of a sudden, what do we run into the network?

So your network, even when you’re on-prem, you’re pulling data around, there’s all sorts of network behavior that you can monitor. Well, you can’t really do anything about the network at AWS or Microsoft or Google or all those guys. And once you’ve done that I think is very interesting is you’ve used this innovation of this direct path, this accelerator if you will, to sort of inject speed into the cloud because you’re also able to circumvent some of the networking issues in some of the OS issues. Can you talk about that for a second?

Mark Cusack:

Yeah, absolutely. So if you extend the idea of minimizing the path takes from persistent storage into the CPU where you’ve got to take more of a holistic approach in the cloud, you’ve also got to understand that in a multi-node parallel data warehouse data moves between the processing nodes and you often need to move very, very high volumes and you don’t necessarily have the bandwidth and latency characteristics in cloud networking that you get in a controlled on-prem environment. And the same goes for introducing this additional degree of indirection by getting data out of object storage as well. And so one of the things that we did was really take a close look at how networking works in the various cloud providers and what we needed to do to reduce the processing overhead of moving data around. So we extend this direct data path out to networking.

Now if you look at most applications on the web, they use standard or on a cloud rather, they use standard TCP/IP networking, nothing unusual there. You have to start to look at how you use new ideas like Intel’s DPDK, you need to look at new more efficient network protocols that are bespoke and tailored to data warehousing operations, which is the kind of things that we did. And at the same time, you look at the standard off-the-shelf libraries that cloud providers give developers to integrate their applications with S3 or Azure Blob Storage, and you find out you can make huge efficiencies and performance improvements in those as well if you do things in a little bit of a smarter way. But actually underpinning all of this, Eric, is the idea of the Linux kernel. This is the heart, the beating heart of any modern cloud data warehouse.

It is not meant to do exclusive data warehousing. It’s a general-purpose operating system. And what most vendors fall into the trap of today is really not optimizing at the operating system level. They optimize that, the database software stack level above, but they’re all running on the same OS and the same hardware at the end of the day.

So what we did at Yellowbrick was totally rip that model up, run all of our operations in what’s called a user space bypass mode, where we basically cut out a lot of unnecessary Linux operations and overhead there. And so we find ourselves getting an order of magnitude better networking performance by writing our own networking drivers another order of magnitude for doing NVMe device drivers adding network protocols that’s giving us apex bandwidth improvement over what most other vendors can offer in there. All of these things are important. It’s critically important to get the foundation and then worry about how you optimize the database stack, if that makes sense.

Eric Kavanagh:

That makes complete sense and that’s absolutely fascinating. So to explain to our audience here, Linux is the de facto operating system of the cloud. It’s the de facto operating system of most large enterprises these days and has been for quite some time. And if you get all the way down to the kernel and just think what does the OS do? The OS does everything. The OS is the middle person, if you will, between all the apps and the hardware and what you’re actually dealing with here.

So to optimize for the Linux kernel … and I took some notes here as you were talking, so what did you say? We run our own networking drivers. We rip that model up and run all our operations in a user space bypass mode. That’s cool, man Yves, it’s like they set up a special off-ramp and on-ramp to get around all the traffic jams in the middle of the city, right? Real quick, go ahead.

Yves Mulkers:

Yeah, I was thinking, making the analogy back to the days of really embedded systems where you really were programming on the chip and everything and then had your own OS systems to optimize all the hardware that was available, and especially in production plans. So that’s kind of thing what you did, but really targeted to the data crunching and the databases of the data warehouses where we already saw that there with the analytics we had to column store databases, we had different types of storing your data in not typically CSV files, but in the Parquet or AVRO format, which is optimized to do this type of analytical compute on the data.

So I think these are the things and the insights, what we gained on just the data layer, which it pulled through completely through the chipset if not through the CPU, if I completely understand what you did in optimizing and bypassing some of the Linux and operation functions, which is a kind of overhead in that really data crunching what you are doing or trying to do.

Mark Cusack:

Yeah, that’s spot – that’s actually spot on Yves. Yeah. Yeah.

Eric Kavanagh:

This is very, very clever and it takes time to do this folks. So the people at Yellowbrick, they saw this opportunity and what I love is that you took this innovation that you had come up with of solid state drives with the original architecture and then said, “Okay, everything is moving to the cloud. We need to get on that train and figure out some way to optimize what we’re doing for a cloud-native environment,” and cloud native, just to explain to the audience what that really means. It means something that can run in any cloud is really what that ideally means.

You can take it from one place and put it in another and it’s a good chance that it’s going to run. And that gets to the whole Kubernetes conversation. We’ve talked about this on the show many times. One of the absolute biggest innovations in my lifetime, I would say out of Google because they realize we need to abstract away this challenge because we can’t always be worrying about what OS is over here and which OS is over there and what version are you on and all this stuff. And that’s just a nightmare for integration, especially when you’re in the cloud and you want everything to work together and everybody to play nice with each other. That’s the dream at least of cloud computing. Well folks, don’t touch that dial. We’ll be right back. We’re talking to Yves Mulkers of 7wData and Mark Cusack of Yellowbrick Data. Very, very cool stuff. All about cloud data browsing. We’ll be right back. You are listening to DM radio.

Mark Cusack:

So I just want to talk a little bit more about the direct path approach here and really what it means at the end of the day. Excuse me, from a performance perspective here. So as I kind of talked about on the left-hand side here, most databases on the planet have this three-step path of accessing data when they’re running an SQL query first pulling data off the dis via CPU into a buffer cache in main memory, and then making memory calls and lookups against the data in the buffer cache. What we’ve done differently with this kind of Direct Data Accelerator idea is just shuffle data from the NVMe drive straight into the CPU, and then memory really gets relegated for dealing with intermediate steps within your query or spilling and spooling data. It’s not in the direct path as it were.

And that does have tangible benefits directly when we compare ourselves with how our customers have used Yellowbrick versus other alternatives on the market like Snowflake. And what we found, it really has benefits in two areas. There’s a raw performance benefit. Your queries are typically five to 10 times faster, but also its concurrency scaling characteristics of Yellowbrick are better, which is really what the first graph shows you here is Yellowbrick scales to five, 10, 20 users all hitting the system or running queries on at the same time. What we find is our scaling characteristics are better.

The response times of customers’ queries are more, even are more linear if you like. And we also find as well that the core part of the direct data path, it means you’re doing in the cloud, the same kind of operations or the same kind of SLAs are being met, but with a fraction of the cloud hardware being used. So this is a real bottom-line difference to the bill you’re paying to AWS at the end of every month. If we can do it in four nodes, what it might take 32 nodes in another cloud provider to provide or data warehouse provider, that ends up being a massive cost saving at the end of the day.

Eric Kavanagh:

Yeah, that’s a really big deal. And I thought of to myself of an analogy I can use, I’ll probably do it when we go back to the radio show live about what this really means, but that’s very clever that now you’re using main memory for the intermediary steps, and not for the sort of heavy lifting, which is going straight from the disk over to the CPU, right?

Mark Cusack:

Right. That’s exactly it. Yeah.

Eric Kavanagh:

Yeah, that’s interesting. And then of course, the cost-saving side, I’ll make sure to make a point about that too, because that’s very, very interesting because the costs do just explode. If you’re not paying attention to what you’re doing, you’ll get a bill that’s twice what it was last month. You’re like, oh, geez, what am I going to do? And times when there are layoffs everywhere, you don’t want to be jeopardizing that –

Mark Cusack:

And it’d be great to talk a little bit hopefully around the differences between offering data warehousing as a SaaS and how efficiencies are not being passed onto the end customer if there’s a temptation for a lot of SaaS vendors to keep that as margin as technology improves. Yep.

Speaker:

Welcome back to DM Radio. Here’s your host, Eric Kavanaugh.

Eric Kavanagh:

All right, folks, back here on DM Radio. We’re talking all thing analytics today, all things analytics. We’ve got Mark Cusack from Yellowbrick Data and Yves Mulkers of  7wData on the line. And we just took a little brief break there to take a look at this technology and it’s really quite fascinating.

And I’ll just kind of explain what an analogy here, what these folks have done. So most people are familiar with, of course, with storage, with your hard drive. Now we have solid-state storage or the little flash drives, so it’s not a spinning disc anymore, it’s just a solid piece of material, which is one reason why it’s so fast. Another reason, right? Doesn’t break as much, right? There’s no needle to kind of break and ruin the disc, but then how do you take advantage of this? So when you’re talking about data warehousing, there’s so much data that you’re trying to crunch, you’re trying to process, you have to pull it in, analyze it, run functions on it, et cetera.

Well, what they’ve done by creating this direct path, this accelerator from the storage to the compute, so they’re bypassing main memory to a large degree if they’re leaving main memory open for intermediary steps, which is important.

So think about if you’re ever out and trying to remember someone’s phone number and you don’t have a piece of paper or anything to write it down, what do you do? You kind of put it like 4, 4, 3, 6, 6, 5, 2, 4, 4, 2, 1. And you’ll actually put the first number. One of the tricks is put the first number in a box and just remember that number and then focus on remembering the longer number. And then you kind of put them together. And that’s kind of what main memory is doing in this equation, right? Because it’s got intermediary steps because you’re doing these massive calculations in the compute, but then it’s going back in main and memory going, okay, now add this one. Now add that one. Is that a decent analogy, Mark?

Mark Cusack:

Yeah, it’s very good. I know. I think this is exactly what we’re using it for as well. And it’s interesting, again, when you move to the cloud, what we’re doing is not only thinking about that path from SSD into CPUs, we’re also looking at having a direct path from the network devices themselves, the Nyx, into the CPU as well. So again, we’ve got data flowing in across the network across from storage and taking advantage of this direct path approach in both cases.

Eric Kavanagh:

Yeah, that’s really cool. And that’s going to lower your bill too, because on the other side of the equation, when you get up into the cloud, well, they’ve got systems for monitoring how much compute is being used, how much storage is being accessed, all that kind of fun stuff. And it tick, tick, tick, tick, it ticks up. We’re getting better and better as an industry about knowing when and how it ticks up and being able to track all that. Because just a couple of short years ago you didn’t have that capacity. And Yves, I’ll throw it over to you. That’s a really big deal because now you’re, again, not just speeding up the analytic process, but you’re actually taming that bill a little bit on the other side. What do you think? Go ahead, Yves.

Yves Mulkers:

Yeah, I mean, so many surprises that I hear with the monthly bills off on the cloud services of the data warehouses. One of the issues is designing these systems in the traditional way where you think it’s always available, but you need to architecture in a different way. That’s one of the issues as well, on the other hand, it’s kind of not being experienced with the new type of loads, what you’re doing, and increases the bills. You don’t control it, you don’t see what is happening. So a lot of our observability is still lacking in the systems, so keeping the tap on your bill is very hard on that.

And especially if you think in the traditional way, you are crunching too much of your data with too much of the resources needed. So definitely there is a lot of optimization that you can do. And when I hear in our discussion with Yellowbrick, yeah, I’m still surprised that you can have a tenfold of performance gain at the fraction of the hardware needs in that respect. So that’s only very inspiring to look at the product and go for that.

Eric Kavanagh:

Yeah, it’s very interesting stuff. But it kind of speaks to just being able to analyze the different component parts of the equation and look for ways to optimize. And there’s always a way to build a better mousetrap. I mean, the fact that you’ve gone in all the way down to the Linux kernel and really taken a hard look because as you said it’s an OS that is multipurpose. It’s for all the different things that are going on out there. And you took a really clever approach to just focus on how to use this one piece of functionality to get one job done until everything else is just quiet. Now we’re going to focus on this right now, right? It’s like closing the door, putting your headphones on, saying, “All right, no one bother me. I got to get some data processing done.” Right.

Mark Cusack:

Yeah, I mean, I think you are building a better mousetrap is quite a nice analogy. You could take it a step further in. What I like to characterize about what we are doing here is it’s the difference between a piston-engine aircraft and a jet-engine aircraft. You have to take a totally different approach here. You can tinker around and try and optimize and get the best performance out of your piston engine, but you’re ultimately limited by that technology, and you’ve got to look elsewhere.

And I think as an industry, we are doing a disservice to a lot of customers out there, a lot of industries by offering and doing the same things across different vendors, the same kinds of optimizations, and not thinking outside the box. And what does that give you? At the end of the day, if you do performance characterizations across different cloud warehouse vendors, they may all be within 10 or 20% performance of each other because they’re all using the same hardware, all using the same cloud operating systems, et cetera, et cetera, all have built-in margins because they’re offering software as a service and have a managed services component that they’re passing on. So the price and performance of a lot of these vendors out there is very similar.

Eric Kavanagh:

That’s it. I mean, you bring up a really good point here in that everyone in the cloud is using pretty much the same setup. Most are in AWS. I mean, Azure is certainly coming along. Google is getting serious about this stuff, but AWS I think is still the line in the cage, if you will. But if everyone’s using the same system, well then you have to find some other way to differentiate yourselves, and that’s exactly what you’re doing.

I’ve heard from various sources in the industry that some of the other big players in this business actually took the whole same exact code base and just put it up into the cloud. I was kind of surprised to learn that, but I guess it was probably a driver of time and wondering how long it would take to re-architect everything. And that’s not the case with you folks because you did re-architect specifically for the cloud, specifically for this particular use case. And I think that is a pretty key differentiator. What do you think, Mark?

Mark Cusack:

Yeah, I mean that’s fair. And I think it makes a lot of sense because if you look at what value add, a lot of the cloud service providers that might pay put a Postgres service into their cloud, for example, what they’re actually providing, the value they’re really providing is all the managed services around them, making it self-service, making it easy to manage, really doing away with the complexity that in the past was really tied into having a very complex data warehouse that was capable operating at a petabyte scale that you needed a lot of care and feeding in the past.

But the more we get into containerization and orchestration service like Kubernetes, we can start to take a lot of that management complexity away these days with these new architectures. So that’s really what we focused on. We’ve deliberately not gone down the line of offering Yellowbrick as a managed service within AWS, for example.

We want customers to run it themselves in their own cloud account, which has huge benefits in terms of owning their own data. They can get discounts on the infrastructure they used to run Yellowbrick because they’re consuming their own credits, for example, with AWS and Azure and so on and so forth. And we are not marking up AWS hardware and selling it back to a customer, which is basically what a SaaS vendor in data warehousing does.

And you could take this to the kind of nth degree and think, well, what motivation does a SaaS provider for data warehousing have to pass new efficiencies in chipset capabilities or new instance types coming out in the cloud and pass those price-performance benefits onto the customer because they’re eroding their margins by doing that. And I think that’s another reason why you see a lot of homogeneity in the performance and price of a lot of cloud vendors out there today. They’re trying to maintain those margins, and we think that’s crazy. We think you can dispense with all of that margin stacking and run it yourself with ease.

Eric Kavanagh:

Well, and two, let’s think about the jobs, right? Because what we’re seeing with the cloud, in general, is that a lot of the traditional IT-oriented jobs setting up systems, for example, monitoring systems, also optimizing, especially in the data warehousing world where you would have DBAs optimized with certain indices or optimized in various other ways manually. And these days, that’s a much different storyline right now. You don’t have to worry about that as much. It’s done dynamically for you. So the jobs themselves are changing. And I think that it’s not that those jobs necessarily go away, it’s that the people who are doing DBAs are probably going to be running some SaaS in the cloud now instead of doing their own job, they’ll just kind of pivot to something that is adjacent, which is leveraging what’s available today. What do you think, Mark?

Mark Cusack:

Yeah, I think that’s fair. I think if you look at the characteristics of DBAs today, they’re definitely varying in terms of skillset towards SRE. Site reliability engineering to more cloud operations are becoming more important. Most vendors in the data warehousing space are automating a lot of the housekeeping tasks that a lot of data warehouses, collection of statistics, blah, blah, blah, are a lot of these kind of things that traditionally you’d have to manually orchestrate. And so as an industry, I think we’re doing well at that, but where I think we’ve failed in many cases is making it really easy for companies that don’t necessarily have a huge cloud ops team to able to realize the savings of running that there’s themselves and yeah, sorry.

Eric Kavanagh:

Yeah, that’s a good point too. And also when I first started tracking data warehousing, it was for the Fortune 2000, and that’s about it. I mean, you didn’t go out and build a data warehouse if you were a small to mid-size company. It was just way, way too expensive. And those dynamics have all changed, as has the whole world of data. Right now we have just so much more data to deal with and unwieldy data too. Not just your traditional structured data that you would have in a relational data warehouse. There’s data flying off the shelves these days, and you want to be able to manage all that and to kind of leverage it. Yves, maybe I’ll throw this one over to you. There are so many more companies now interested in data warehousing and hats off to Snowflake for really energizing the market in that sense, right?

Because they made it easy. They went back to the basics. And as we discussed earlier, the industry went down this road, a very interesting road with Hadoop, and I think we learned a whole heck of a lot about parallel processing, about federated systems, about scale-out architectures, which we can now leverage in the world of data warehousing. Because I remember there was this whole argument of, oh, schema on re, we don’t have to do schema on right anymore. Everything’s going to be great. We’ll be thrown into a giant data lake and it’ll be fine. And that movie did not end very well. And so there’s a lot of efforts to remediate all that and still get some value from it. But Yves, I think there’s nowhere to go but up in this business in terms of leveraging data to get analysis to run your business better, the addressable market is so much bigger than it was 10 years ago, don’t you think?

Yves Mulkers:

Yeah, exactly. I mean, if you look at the number of solutions offered these days in the market, it’s gigantic. And you see there is a very niche in every little bit part of what we need to do. You have companies focusing on logging. You have companies focusing on the number crunching, on observability, on privacy, on monitoring, on all these aspects. And I was thinking through where we had a discussion about DBAs back in the day, every company or bigger company had a DBA to run the Oracle system to optimize it. These jobs, they move to the platform. So that means we have a shared knowledge of this expertise within the platforms. So if you are a small company, you already have a DBA at hand because they can do it for you. And that’s the shared knowledge. A disadvantage is to disconnect what you have.

If you see a performance issue, you don’t talk directly to the DBA anymore as such to optimize it. This is something where you were saying, we don’t offer Yellowbrick as a managed service, but we try to give it to the people so they can optimize across the complete state. That means you need to have a bigger skillset set of your workforce available to be able to execute, but it allows you to really go for maximum optimization of your complete stack.

And what you see now is because the cloud has become more complex in all the various layers, what you have available. So back in the day we talked about DBAs and I recently had a discussion. I said it’s the same as the platform engineers. No, it’s not exactly like that because the platform engineers, they look as well as at the hardware at really the infrastructure level, and they look at the network on top of them. So where the DBA was only looking at the database and how the data was stored and optimal stored and the credit performance as such. So these responsibilities really have broadened over the days and with the cloud platforms because you connect so many more services together, which have to talk together to get that in an optimal way working for best performance.

Mark Cusack:

I think that’s one of the problems we try to solve. We want to make this a Snowflake-like experience. And to go back to the point earlier, you’re right, they absolutely revolutionized the user experience around data warehousing, which had been pretty poor up until Snowflake kind of changed the game there. And you’re absolutely right, one of the problems is if you start to roll up your sleeves and try and stitch all this stuff together, you are stitching load balancing services together with IM roles and EBS and S3, and you’re trying to knit all these together. And of course, that’s a disaster.

So our approach is how we completely mask all of that detail away from the DBA. So it’s our automated stack is responsible for stitching these services together to putting the observability framework in place so that you can monitor this thing and surface up issues and alerts and provide that effectively a Snowflake-like experience, but running in your own cloud account. That’s really our design decision and goal at the end of the day.

Eric Kavanagh:

Yeah, it’s a very, very good point. And it’s the point to be made, quite frankly, because what we’re seeing with the cloud is this standardization, this democratization, and functionality, and that’s what people want. I think in the next segment we’ll get into a bit of how we’re really transitioning the roles and responsibilities of your team members from keeping the lights on, which is what ID has done traditionally, get the lights to work first of all, they get them to work well and keep them working well. That’s typically been what they do. Now, it’s a much different story when you’re in the cloud environment, but you do need the proper tools to be able to optimize what you have in the cloud. We’ll pick that after the break. We’ll be right back. You are listening to DM Radio.

Mark Cusack:

Hey. Yeah. I want to pick up on this whole Kubernetes and containerization kind of revolution that’s really happened and how we’re seeing our customers start to take advantage of how we’ve containerized our data warehouse to provide a container and a platform rather for containerized analytics and analytic apps, which is really what this somewhat complicated set of arrows is really getting at here.

And if you kind of go from the left to the right, you’ve got your standard sources of data upstream, whether it’s from data lake open source file formats, or whether it’s legacy ETL processing from Informatica tasks or data stage or so on and so forth. And obviously new real-time streams coming into your business that you want to take advantage of. And Yellowbrick’s architecture is completely containerized within a customer’s account. We hide the details of Kubernetes away from the administrators and the end users, but we provide scalable services for when data loads elastically expand.

We persist those within object storage, within our multi-node data warehouse instances. There are scalable compute clusters attached that we manage as part of this entire installation. We have things like load balances. So when you’ve got a differing number of concurrent queries from the right-hand side hitting your system, we can distribute those queries to the least busy compute that you have available. So we can do some nice load balancing.

This is really important actually. And we are working with customers that they may be starting really small in terms of data volume. They’re not enterprise level, petabyte level, but they’re undergoing enormous business growth. And what they’re looking for is a solution that will scale their data loading characteristics and their reporting characteristics as their business grows. And we’ve built an architecture to do that.

And then finally, on the right-hand side, the idea of supporting containerized data apps. And we are seeing a future where data warehouses aren’t being consumed by a few hundred business analysts within a company, but they’re being consumed by end users providing real-time actionable analytics into the hands of consumers. So there are concepts Bank of the Future that underpin this idea, and I think that’s where data warehousing will be important in the future.

Eric Kavanagh:

Yeah, that’s such an excellent point. And you have to have a scale-out architecture underneath to be able to pull that off. I mean, you’re not going to pull that off in the usual way with some on-prem solution. And I think that’s absolutely brilliant and just an excellent point because what happens is when you enable your consumers, the people all the way at the end of your pipeline, if you will, to play around with their data to learn from their data, guess what? They’re learning things. They’re more engaged with you as a brand now. They’re going to stick with you. I mean I’ve watched, and I haven’t seen a whole lot with our vendors that provide us technologies along these lines, but you’re starting to see it in the banks and other places. And I remember when I first, I mean, I have a funny story, but I won’t do it for the live show here, but someday, I’ll tell you the story of when I was like 2001 or 2002 on the road and I logged into my bank account, see how much money I had because I never balanced my checkbook.

And the check is in there, I see a check number and it’s underlined. I’m like, oh, that check number is underlined. Usually you can click on that. What is that? I clicked on it and boom, there’s a scanned version of my check. I was like, oh, ba-da-bing, this is a small bank in Texas, all right. This wasn’t Chase or Wells Fargo, any of the big guys, it was a small regional bank in Texas that figured that out. And I was like, OM freaking G. That is a fantastic innovation and it just changed the game. All right, so hold on. We’ll be back here in just a second. Okay, good. 10 seconds. Standby.

Speaker:

Welcome back to DM Radio. Here’s your host, Eric Kavanaugh.

Eric Kavanagh:

All right, folks back here on DM Radio, talking to Mark Cusack, CTO of Yellowbrick data, and also our good friend Yves Mulkers of 7wData, talking about cloud-native data warehousing, and there are multiple clouds. I always say I’m thankful, oddly enough, ironically to Microsoft for saving us from the monopoly of Amazon Web Services because if they hadn’t come along, I don’t know. And of course, Google is out there too, and you folks run on a couple different clouds, right? You’re on Microsoft and you’re on AWS, is that right?

Mark Cusack:

That’s right. Today with Google to come later on in the year, that’s absolutely right, Eric. Yeah.

Eric Kavanagh:

So you can be in any of these environments and there will be differences between them, whether it’s pricing or usability or whatever, and you want to have some elasticity. But in the break there, you just mentioned something that I thought was very interesting, this whole concept of extending analytics out not just to your partners, for example, but to your consumers, to your clients out there. And in the banking world, that gets very interesting when I can play around with my financial numbers and you’re starting to see a whole bunch of apps come around to help you manage your money and to look for these recurring expenses, for example, because with software as a service, it’s death by a thousand cuts, man.

You buy a bunch of stuff for 39 bucks a month, and before you know it, you’re spending some real money out there. But when you can empower end users to analyze their own data, that creates a fantastic user experience, they’re going to be more connected to your brand, and they’re not going to want to live without that functionality. I mean, guess that’s a good way to tee off this last segment here is that when you give someone analytic capability, you don’t ever want to take it back, because they’re going to be very unhappy about that. What do you think, Mark and Yves?

Mark Cusack:

Yeah, I mean, it’s funny. I was chatting to a CTO at a company the other day that provides identity verification services to banks to the extent to which when you want to create a new bank account, you can take a selfie on your iPhone, whatever, submit that, and then within seconds be authorized to open a bank account. And so they have an enormous amount of image processing, underpinned actually the backend by data warehouses that are managing a lot of the metadata around this.

But that will effectively give you a government-certified authentication of your identity. And really, actually, this service really came out of COVID, of course, where no one could go to a branch or anything like that to open a bank account. So now we’re getting into the point of having these kinds of services making that the ability to acquire new customers so much more seamless than it has been in the past.

Eric Kavanagh:

Yeah, that’s good stuff. And speaking of real-time connectivity, my web just switched to the backup there. Wow, that was interesting. I lost you for a second and came back. But Yves, I’ll throw it over to you again. The analytic experience is one that everyone, I think every human being intuitively gets it. Once they start playing around with something, they’re going to figure it out real quick and they’re not going to want to give it up because once you can see, browse around, and play around with your data and understand it, pivot the data, et cetera, you’re not going to want to go back to not being able to do that.

It’s almost like losing your cell phone or something and walking around without a cell phone. I’m like, what am I doing in this world? So I think that this is a real move in the right direction to go cloud native, to optimize the cost, to get the analytics in the hands of business people because they’re just going to have a whole heck of a lot of fun with it. And I swear you’re going to have an analytic culture, and that is the key to success in this world. Yves Mulkers, what do you think?

Yves Mulkers:

Yeah, exactly. Like you say, analytical culture, but it’s educating people in what you can do with the data and how you should do that. But if it’s so cumbersome, yeah, people tend to just take what they know. If you look at simple solutions, building a pivot table in Excel, which feels very natural to me because you just dream it, but for a lot of people they, they’ll know on how you can analyze the data, what you should put together to get some kind of insight. So that’s the first step in leveraging that and helping people understand what you can do with your data and giving that at a certain speed to give the insights. If you have to wait, if it’s too complex to put everything together, that’s very hard. Another thing, what you just mentioned is sharing that data or way of access to certain systems, which makes it easy to tap into your data.

I still look at my bank. I mean, it’s so hard to get the data out of my transactions and put that in any kind of system to get the insights. Some of the banks, they tried for a while to categorize my expenses and so on and so forth, but not in the way how I want to follow up on my expenses. So that’s kind of, “Hey guys, this is so simple, why can’t you give that to me?” It’s like I’m talking to my accountant and I’m explaining all the time. If I’m looking just at my accounts, I better know how far I am in my financial situation compared to how you do your accounting and stuff like that. It’s really amazing. He says, “You can tap into the accounting system. Yes, but it’s not the analytics I want to look at. I want to do the kind of predictions in a very sophisticated way, not by looking back one year and a half and then understanding that you were in trouble type of things.”

And the thing of sharing your data, I think in Belgium, they already did it in a very good way. We have a system which has been established by the government, which the kind of identification of your system, it’s called It’s Me and you can connect to the banks with your It’s Me. So it’s a standard way of authenticating and accessing any kind of data, and that builds trust as well of connecting the various systems together. So I think we’re definitely on a good way and a good path on bringing access to data in a very commercial way, and I mean making it available to everybody.

Eric Kavanagh:

Yeah. We’ve got a question from an audience member at our virtual studio audience here. Is this the same architecture, the true real-time data platform that has been discussed for so many years? Seems like Kubernetes containerization has helped or given you a leg up to rethink the design. I mean, if there’s a lot of similarity, and in fact, Yves you even mentioned this earlier in the show, the old days of embedded systems, and there are different ways of doing things. I mean, Kubernetes is the new foundation and containerization is the new way, it’s the way forward of building applications. And so to get into the weeds and understand how it is operating and how you can leverage that engine is really pretty important stuff, right, Mark?

Mark Cusack:

Well, yeah, and as a data warehouse platform vendor, we don’t want to focus our investments on trying to figure out how to scale these fundamental compute and storage components ourselves. And that’s what Kubernetes buys us. It allows us to focus on the direct path stuff, the database technology itself, and leave elastic scaling in response to new higher volume real-time streams coming in on the Kubernetes level to be handled there. It abstracts a lot of that detail so we can focus our investments at a data warehousing level.

Eric Kavanagh:

Yeah, that’s such a good point too. So you’re not reinventing the wheel. That’s the other nice thing about the cloud is that before the cloud, everyone was always reinventing wheels all day long and you weren’t sharing things. But the cloud, and really, and I should tip my hat to open source, of course, because Kubernetes is open source, an open source itself has done a tremendous job of leveling the playing field. So we can all sit on top of all that. And to your point, focus on what makes sense and I’ll throw one last question over to Mark. One thing I’m realizing more and more every day is that we do have these different generations who are working in the professional world these days. You got the millennials, you got Gen Z, Gen X, and old guys like us on the call here. And the younger people are much better at just consuming new kinds of technology and new interfaces and getting it pretty quickly.

And a lot of the stuff they can get when it’s in a visual environment, you’re playing around with your data. That’s a pretty cool place to be as a younger person as opposed to the old days of systems design and setting up clusters and doing all that stuff, that there are people who like to do that. There aren’t many people who like to do that. They do it for a lot of money because they’re really smart. But where I’m going with this is there’s a whole new generation of people coming into the business world right now who are going to get their hands in this stuff and they’re going to freaking love it. And I think that’s going to take this whole industry to another level. But what do you think, Mark?

Mark Cusack:

Well, yeah, they’re the developers of the business in the business today, right? They’re the coders, they’re the data engineers, data scientists. They’re influences today, but they’re going to be decision makers around platforms later on as their career changes. And so we are very much focused on enabling and talking about what we do to developers as well, and the kind of benefits you can get from running and developing applications against Yellowbrick, for example.

Eric Kavanagh:

Yeah. Good stuff. Final thoughts from you, Yves? One minute, 60 seconds.

Yves Mulkers:

Sixty seconds. Want to reflect back on the containerization and the Kubernetes and definitely what I see in the data space. What has changed and optimized in the last ways and with the cloud coming is that you can now automate a lot of this infrastructure stuff. I mean, if you’re looking at Terraform to deploy your systems, that’s already a big relief. Why you say we tune it up, one server, one Kubernetes, one container, and then you can do it a thousandfold or 10,000 fold. So for scalability, that’s a big improvement compared to back in the days you had to buy a new server, you had to install it like you were explaining Eric, really cumbersome. You had these standard operating procedures to follow to make sure that they were configured in the right way. I think this is really where software engineering is coming together with the data engineering. That’s really the sweet spot where we are at and really the way forward for the future.

Eric Kavanagh:

Yeah, this is good stuff. Well, folks hop online, look up Yellowbrick Data, and we had a question come in from the audience last minute. Are you going to go to the Oracle cloud? So maybe Yellowbrick will take a gander at the Oracle cloud next after Google. Oracle is still kicking butt. With that folks, we will bid you very well. Send me an email info@DMradio.biz. You have been listening to DM Radio.

Get the latest Yellowbrick News & Insights
Data Brew: Redshift Realities & Yellowbrick Capabilities –...
This blog post sheds light on user experiences with Redshift,...
DBAs Face Up To Kubernetes
DBAs face new challenges with Kubernetes, adapting roles in database...
Unleashing Innovation: The Oft Overlooked Power of the...
Yellowbrick recently partnered with Vitrifi Digital, a well-funded start-up on...
Book a Demo

Learn More About the Only Modern Data Warehouse for Hybrid Cloud

Faster
Run analytics 10 to 100x FASTER to achieve analytic insights that have never been possible.

Simpler to Manage
Configure, load and query billions of rows in minutes.

Economical
Shrink your data warehouse footprint by as much as 97% and save millions in operational and management costs.

Accessible Anywhere
Achieve high speed analytics in your data center or in any cloud.