Why Does Data Quality Matter?

Why Does Data Quality Matter?

Tags
Thoughts
Author
Josh

The Obvious

Why does data quality matter?

It doesn't — until you want to learn anything.

Let’s be honest: data within organizations is really hard to use. Even when leaders say they are data-driven, data teams are often expected to deliver massive results with little input. There is too much of it in too many formats, across too many tools, and without structure or understanding.

But AI? Can’t it solve the world's problems? Maybe. This past week alone, LinkedIn couldn’t decide whether LLMs were better used in text-to-SQL or Text-to-Semantic layers. Some teams believe asking an LLM to write SQL directly onto a table does not provide adequate context, leading to hallucinations and incorrect answers. The semantic layer is filled with definitions and metrics that help the LLM understand how a data team works and models data so the LLM has the context needed to provide the right answer. On the other hand, the semantic layer hasn’t proven to be bulletproof either. The debate is still ongoing.

Diagram from
Diagram from Cube.Dev (a semantic layer startup)

The data world is looking to integrate AI, but how and, more importantly, how trustworthy are these new tools?

Trevor Fox shared this line, which sums up much of the current kerfuffle.

“LLMs require predictable contexts to provide reliable outputs. But... data exists on a spectrum from orderly to chaotic.”

Imagine you are trying to learn physics, but rather than an excellent, clean, synthesized textbook, a teacher puts every notebook, idea, study, and hypothesis written by every physicist, good and bad, and then asks you to find the value and learn why it is valuable. This is the way data teams are treated. We have access to data. Moving data isn't the issue. The issue is knowing what to do with it and how to wrangle it into a usable data structure.

As a data engineer told me last week, the data structures within organizations today can’t scale as fast as we hope they will.

In years past, teams cared about data quality, but no one was responsible for actually ensuring data quality. AI changes this.

The VC firm Greylock had this to say in a post regarding Vertical AI.

With reduced barriers to building AI applications on LLMs, data is arguably the most important currency in building a differentiated position.

Marc Benioff, the CEO and founder of Salesforce, had this to say

These are powerful statements. Both internal and external use cases depend on high-quality data.

What does this mean?

AI is making data teams' workloads explode. Everyone wants insights, to ask their data questions, and to derive value. The reality is more complex. First, we need to synthesize information. You can only begin to understand it once it is in the proper format.

At Artemis, we help teams ensure data quality through our three-layer platform. Each layer has a simple job.

  1. The data layer brings structure to your current data platform.
  2. The logic layer allows you to apply business logic and concepts to your datasets.
  3. The automation layer automates data workflows across your tools.
image

We will dive more into this in our next post!