Ken Muse

Understanding Modern Data Warehouse Storage


This is a post in the series Intro to MDW. The posts in this series include:

As we learned in the previous post about ingestion, the process of building a Modern Data Warehouse involves landing the data into an appropriate storage solution. Understanding the storage options is the next part of our MDW journey.

In the past, data storage might have meant using a special database designed to store the incoming data. For larger data sets or semi-structured data, disk arrays or Hadoop clusters were utilized. With Azure, we have a number of similar options we can explore. Generally speaking, the best way to store the data depends on both how it will be used and the format of the arriving data. A recurring theme in the MDW pattern is using the right tool for the right job. This is because each technology has a set of trade-offs that must be considered.

For example, an OLTP database attempts to balance reads and write operations. As a result, is it optimized for small, fast inserts and updates. The read operations are optimized to support set-based aggregations and joins across de-normalized data. The systems can handle large numbers of simple transactions very effectively. However, when these systems receive a large number of write activities, the read operations can suffer. Similarly, heavy read operations can limit the speed at which writes can occur. As the number of write requests increases, the increasing compute utilization can lower the overall system performance. Because it is optimized for a specific type of workload, using it for other purposes can lead to suboptimal performance. Even using Azure Cosmos DB, this issue is present. If the container is optimized for writes, it is often sub-optimal for most readsIf the container is instead optimized for the read performance, the resulting environment may impact the write performance.

How do we proceed without a database if we have a large number of write operations which must occur with low latency AND be readable afterward? One approach is to commit the data directly to Azure Storage. While blob storage cannot handle ACID transactions involving multiple files, it can provide atomic guarantees on single-file commits. In addition, a single account can handle up to 10 Gbps of data ingress, 50 Gbps of egress, and 20,000 requests per seconds. Microsoft documents these Azure Storage sclability targets. Additional accounts can be used to further increase the available performance. These are soft limits, so it’s possible to increase those through the Azure support team. Microsoft provides the details of the scale targets for Azure Storage, for anyone interested. To summarize, a single Storage account can handle massive volumes of concurrent atomic reads and writes very well. At the same time, it cannot handle set-based operations or querying like a database. This often makes it an ideal choice for landing data and then reading the results without contention, assuming that you are continuously creating or appending new data.

There are other options, of course. Depending on the scale and type of data, a traditional replicated database can be a potential storage solution. The one limitation of that approach is that the data is frequently overwritten. Not only does that mean that data is lost, it can create read/write contention, so a primary use case is often as an ingestion point from applications. Another approach is to use Azure Cosmos DB. This globally distributed database ensures a separate of reads and writes through different consistency level models. In cases where we need to ingest the data one way and read it another (such as the example, above), w can rely on the change feed to create a reactive model that automatically moves the data between containers to optimize the performance for the read and write cases. In that case, we’re trading off some latency to optimize for both writes and queries. If you’re dealing with IoT devices on Azure and need to be able to perform near-realtime queries as data is ingested, then you may need to look at Azure Data Explorer, which is the basis of Azure Time Series Insights.

As you can see, storage has some hidden complexities. There’s a lot of details to consider as you plan, but hopefully this post gave you a few insights. Ultimately, most MDW implementations will rely on landing the data into Azure Data Lake. From there, additional processing can take place. By trading off some latency at this stage, we gain the ability to use more powerful tools in other parts of the pipeline.