Access to good data is the biggest inhibitor of AI success, or at least that was my experience working for automated machine learning pioneers DataRobot. This blog presents a few thoughts on the role data engineering plays in setting a path for AI success. If Data Science is the Sexiest Job of the 21st Century, Data Engineers will be the unsung heroes.
Garbage in, garbage out:
Data-driven decision-making is only as good as the underlying data. Traditionally, when customers had problems, they called a Customer Service Rep who could navigate the systems, apply business policy, and get an answer. Today, self-service gives direct access to meaningful data without expert interpretation but within narrow constraints. Yesterday, I visited the customer portal of my insurance provider to check some basic information about my auto policy. I struggled to find the answers, so struck up a relationship with their chatbot, which could not understand even simple sentences and poles apart from what we’re now accustomed to with ChatGPT. The insurer’s chatbot will be AI-driven, so based on the context of our conversation, I assume it’s not a shortcoming of Large Language Model (LLM) tuning but rather a lack of access to operational data to offer the right response.
At Yellowbrick, we are seeing an increase in the number of people looking for a data engineering platform to drive AI initiatives. Data engineering collects, processes, and transforms raw data into a structured and usable format for AI analysis and decision-making. It involves designing and maintaining data pipelines, data domains, and infrastructure to ensure data is reliable, accessible, optimized for analytics, and, above all, secure and governed. Data engineers work at the intersection of data science and data architecture, playing a crucial role in enabling organizations to harness the power of data for insights and informed business decisions. Without a reliable data platform, AI will not move much beyond garbage-in, garbage-out.
Why not just let GenAI loose on your existing data warehouse?
You can and should. Business Intelligence (BI) is likely a killer first use case for GenAI because it is performed in a controlled environment, providing an assist role enhancing data analysis and decision-making. BI is based on a very structured and predictable view, whereas GenAI offers more flexibility in its research, including access to non-databases such as emails, documents, log data and external data, which we all use today. A value of GenAI is finding new insights, so the key point of my blog is: don’t be constrained by data. Hence the emphasis on data engineering. For example, in healthcare, patient data is highly sensitive and is often stored internally. Lack of access to this data can limit the effectiveness of AI in diagnosis and treatment. Or in finance, access to transaction data is essential for accurate fraud detection. Without it, the system may struggle to identify unusual patterns. This granular detail may or may not be in the data warehouse.
Another consideration when using GenAI on an existing data warehouse is the additional workload on the data warehouse. This will be use-case specific based on query complexity, analyzed data volumes, concurrency, etc. Workload management is critical to avoid cloud consumption costs skyrocketing out of control.
GenAI roll-out will be experimental, iterative, and resource-hungry.
As an emerging technology in business, AI rollout will be iterative. Organizations will learn about AI use cases, LLM training, strengths, shortcomings, and cost/benefit to adapt deployment. I suspect there will be some spectacular and embarrassing public screw-ups along the way, which will raise barriers for the risk-averse. Who can forget Amazon’s sexist hiring algorithm? This creates a challenge for data engineering, which will have to operate in a fast-moving yet always-secured universe, constantly tuning the scope and nature of data available to the AI. Most large organizations are not geared up for this. Systems have been designed around rigid data structures supporting very repeatable processes with ad hoc queries from Analysts who understand the data.
A technology stack supporting data engineering should be scalable, integrate data from diverse sources, handle data transformation, correlation, and extrapolation, store data efficiently, offer reliability and fault tolerance, prioritize data security, ensure high performance, provide robust monitoring and logging capabilities, seamlessly integrate with existing tools, be cost-effective, and easy to manage. Above all, this technology should support a fail-fast approach but in a very tightly controlled and secure environment to minimize the impact of failure.
Yellowbrick exhibits these capabilities to allow data engineers to ensure effective AI deployment. Its high concurrency, high-performance architecture, and patented acceleration technology allow data engineers to consolidate and manipulate vast amounts of data as easily as an Excel spreadsheet (but probably easier!). Yellowbrick also supports hybrid deployment with the same capability operating simultaneously on-premises, in the cloud, or both. Support for Kubernetes further simplifies the management of containerized applications across diverse environments.
With business use of GenAI still in its infancy, data sovereignty is back on the agenda. C-Suite executives are reluctant to let this technology loose on the cloud (discussed in Breaking Analysis: Cloud vs. On-Prem Showdown – The Future Battlefield for Generative AI Dominance.) Data sovereignty is critical for AI roll-out because it ensures that data is handled in a manner that respects legal, privacy, and security considerations. Organizations that deal with AI and sensitive data must be aware of and adhere to data sovereignty regulations to maintain trust, minimize risks, and avoid legal complications. Unsurprisingly, an emerging best practice is to start with AI in the carefully controlled confines of on-premises but move to the cloud once problems have surfaced, been ironed out, and the organization becomes more confident with this new technology.
Summary
In the evolving landscape of AI adoption, organizations are embarking on iterative rollouts, constantly adapting AI deployment strategies based on use cases and learning experiences. This poses a challenge for data engineering, often ill-suited for rapid adaptation. An effective technology stack supporting data engineering in the AI context should be scalable, diverse in data integration, and proficient in data handling, storage, security, performance, monitoring, and seamless tool integration. Yellowbrick is a good solution. Data sovereignty, crucial in AI, ensures compliance with legal and privacy regulations, prompting a trend to initiate AI projects on-premises before considering cloud deployment, allowing organizations to maintain trust, mitigate risks, and navigate legal complexities while experimenting with AI.
I acknowledge the help of ChatGPT in writing this blog!