Fragmentation in the Data Stack and cost structure with Chris Riccomini
Chris Riccomini shares his thoughts on the fragmented data stack, dbt, and how to get data platform costs down.
I recently had the chance to interview Chris Riccomini. Chris currently manages Materialized View Capital. He spent over 15 years working on infrastructure at major tech companies such as PayPal, LinkedIn, and WePay. He was involved in open-source projects like Apache Samza and Apache Airflow.
JG: How do you see the relationship between data engineers and analytics engineers (I am curious how your position has shifted since you tweeted in Nov 2022)?
CR: Good question. I realize that these definitions are quite fluid.
To me, a data engineer is responsible for the E and L in the data pipeline—extracting and loading data. Analytics engineers are responsible for the T—transformation. A data engineer’s job is to build a data integration layer to get data into the system where it’s needed reliably. One of those systems is a data warehouse. An analytics engineer’s job is to transform the raw (loaded) data into something more usable for product managers, sales ops, business analysts, and others. Note here that I did not mention deriving value and insight from the data; that’s a role I would describe as a business analyst.
In practice, I find that most people actually do more than one of these roles in an organization. For example, someone with an “analytics engineer” title might end up doing business analyst stuff, too: creating reports, data visualization, building business review dashboards, fielding questions from the executive team, and so on.
JG: We’ve spoken about how data warehouse spending became an issue at your last company. What happened, and how much work did it take to get it under control?
CR: The short answer is that we allowed the data warehouse to grow organically for several years with a focus on convenience instead of cost. At first, the data engineering team was just me. For many years, it was just two or three people. With the number of pipelines we had, we simply didn’t have time to optimize for cost.
As time went on, though, we grew—not just in team size but also in data and query size. We had more pipelines and more people running queries. Cost slowly crept up. It wasn’t overnight, but over a year or two. Then, ZIRP ended, and the CFO came knocking.
Fortunately for us, there was a lot of low hanging fruit. For example, we exposed views rather than tables to users. When a user queried a view, the view would de-duplicate rows to make sure that users only saw the latest version of each row in a table (we were using Debezium and CDC to load data, so a new table record was inserted for each update to a row). For frequently queried tables, it was cheaper to simply materialize those views rather than keep de-duplicating the entire table each time a user queried the data. Another more basic example was that we loaded all tables from each database into our data warehouse by default. Many of the tables were rarely used. We could remove such tables to save some money. There are many such examples, but you get the idea.
JG: What are your thoughts on dbt? What is it missing? How can it be better?
CR: dbt is a very useful tool. We built something internally at my previous job that looked similar, but was much more basic. Nearly every company I interacted with had some version of a Python script that could be used to auto-generate tables and views. And nearly every company had some kind of data workflow orchestrator. The use case for something like dbt was self-evident. Moreover, there was real value in standardizing such a tool. It benefited from network effects as developers contributed integrations and packages. It also, rather sneakily, snuck in SDLC best practices so analytics engineers could build better transformations.
As for how dbt could be improved, I think the folks at SQLMesh, sdf, Coalesce, and Dagster would be better positioned to answer that. What I can say is that—going back to your very first question—what I’ve never been able to understand is why we split the transform tool (dbt) out from the extract and load tools (workflow orchestrators, CDC, and such). These two things seem very much related (so much so, we keep them in a single acroynm—ELT). I think Dagster was early in recognizing this. Another signal of this oddity is that we now have dlt as well as dbt; they’re the same idea but one is for extract/load and the other is for transform. Unifying extract, load, and transform tools back together is something that needs to happen, though we’re in the infancy of this transition.
JG: What challenges have you experienced or seen data teams experience when optimizing or maintaining their data platform?
CR: Oh gosh, it’s really endless. Data quality issues, schema compatibility, data loss, cost, managing Python runtime environments, keeping pipelines within latency SLAs, difficulty getting access to observability data, dealing with an explosion of tools (the so called MDS), and so on. And that’s just on the data pipeline side.
Straddling the production and non-production domains presents some unique challenges, too. Many data teams have to act as a support team, security team, and ops team as well. They end up having to help users fix queries, are on-call when the platforms break, and often need to gatekeep access, monitor for sensitive information, and so on. While production engineers often have a single (usually standard) process to manage such things, data teams often get left behind for various reasons.
JG: In your opinion, why does dbt increase spending on data warehouses?
CR: dbt decreases the friction of running transformations in the data warehouse. This is a good thing, provided these new transformations are providing business value. However, given the way cloud data warehouses bill, this necessarily means you’re spending more money. dbt has attempted to help with this by providing incremental models, and the community has tried to do some transformation outside of the cloud data warehouse (in DuckDB, for example). Still, the default is that things run in an expensive way.
JG: What are your thoughts on increasing the focus on optimizing data warehouse costs, especially in relation to dbt workloads?
CR: I think the experience I discussed above is relevant to this question. We had to choose whether to optimize our data architecture for simplicity or cost when rolling out our data pipeline. There was no simple, cost-optimal way to build the pipeline at the time.
The ideal scenario is one in which we can build both simple and cost-effective data pipelines. We need to lower the cost of lowering the cost, so to speak. To do so, I think we need optimal-by-default tooling. Incremental ETL pipelines, integrated EL and T tools that can figure out where transformations should occur (i.e. not just in the data warehouse), cost observability tools, and so on.
JG: How do you see tools like Artemis changing how data engineers work?
CR: I think Artemis has an opportunity to address the stuff I just mentioned: make it easy for me to build a simple, cost-optimal data pipeline. Leveraging AI to figure out how to clean up waste dbt pipelines is a great first step. I expect a lot more from you guys, though. As I said, I can imagine query monitoring and optimization with AI, query rewriting to move transformations outside of the CDWH and into local (cheaper) engines like DuckDB, and a lot more.
JG: How do you think tools like Artemis will impact how data teams perform maintenance on their data stack?
CR: I see two approaches to solving the fragmentation in the modern data stack (MDS) space. The first is to provide a fully integrated, opinionated solution about how to manage your data warehouse or data lakehouse. The second is to provide AI-based solutions that help manage the MDS fragmentation and reduce toil for data engineers. The nice thing about the latter approach is that it meets organizations where they are. That’s what I like about Artemis. You don’t need to move off your data warehouse, you don’t need to move off of dbt, and you don’t need to adopt some vertically integrated solution. I’m excited to see more tools like this pop up. We’ve dug ourselves into a pretty big hole with the MDS tech stack, and tools like Artemis will play a role in helping us wrangle all this stuff.
JG: What advice would you give to organizations looking to implement cost optimization strategies for their data warehouses without sacrificing agility and innovation?
CR: Every organization has 7±2 tables that are really business critical and 7±2 tables that are really expensive. Most organizations know which tables are business critical, but it might not be obvious which tables (and query patterns) are most expensive. So job #1 is to set up some observability to figure out what is costing you the most.
Once you’ve got a clear picture of business critical and expensive tables. Take the union of these two table sets and apply the 80-20 rule: figure out how to optimize each of them enough to reap 80% of the savings with 20% of the effort.
Next, you need to set up some monitoring to alert you when future tables and queries go astray. Once you’ve established this beachhead, go back and ask yourself what (if anything) needs to be done longer term to manage costs. Do you need to fundamentally re-architect your pipeline? Do you need to change certain tools or eliminate query patterns? You don’t need to be perfect with this stuff, so it’s quite possible that the playbook I just outlined will get you where you need to be. Don’t go overboard.