Optimizing reads for a write-optimized data store (Chapter 3 Part 2 — Designing Data-Intensive Applications)

Mahesh S Venkatachalam
6 min readJul 27, 2021

Here are my notes from the third chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on data structures used for storing and querying databases.

  • This is Part-II of chapter 3 — a continuation of the last article which can be found here. In Part I, we started with a simple data storage and retrieval model based on a basic key-value-based design, its advantages, and limitations, and learned about two concepts of compaction and merging in such systems.
  • By the end of the post, I hope you have developed an intuition for:
  • optimizing query read time in data storage systems using hash indexes.

The previous post did cover some of the different techniques and processes (compaction and merging) that help in reducing fragmentation and also reads to some extent. We will share few additional ideas in this post.

Optimizing data reads

Before delving into the optimized solution for query read times, let us revisit the problem.

Going back to the design for which we are trying to optimize reads, here it is again.

The problem with data reads in the above design was that we need to scan the entire file from beginning to end, and then return the last one to access the most recent value for the particular key. It is O(n).

One way to optimize reads goes back to a common technique that we use in databases for making reads faster — Indexes. Reads are optimized because instead of doing a file scan, the index helps to do a seek and directly hit the right record. Consider maintaining a hash map as an index. For each key-value pair in the hash map, a pointer to the most recent value to the key is maintained in the map.

Another design choice that helps in optimizing search time is that each segment has its own in-memory hash index. In order to find the value of a key, we first check the most recent segment’s hashmap. Only if the key is not found there, then we check the second-most-recent segment’s hash map.

Figure 1. Introducing the idea of an in-memory hash map as an index [1].

Now, while this certainly helps but it has few caveats. First, we need to ensure that the hash map can be maintained in memory. Secondly, while hash maps are great for point-based queries but how to allow range queries? The third area to look into is handling data loss in the hash map in an event of a system crash. We look into each of these areas below.

How to control the size of the hash map?

  • If the nature of the workload is such that the access to existing keys is more frequent than the creation of new keys, then that is one way that the growth can be relatively contained.
  • Another idea is to maintain a sparse index. This means that instead of maintaining an entry for every unique record on the disk, we maintain index entries for only a subset of keys on the disk file.

Figure 2. Sparse index [1]

Think about how would you search for a value such as “handful” using the index in the figure above?

The answer lies in employing a trick which is to keep the data on disk in sorted form by key (SSTable). Notice how the data is stored in sorted order in the file segments in the figure above. Based on this, here are steps you would follow to trace handiwork from Figure 2 above.

  • You look for an entry in the index which is either equal to the search key (handiwork) or the closest values less than and greater than it. In the example in the figure, they are the keys — handbag and handsome. Because of the sorting, you know that if handiwork key exists, it will appear between the two.
  • Jump to the offset for the handbag and then search linearly until you haven’t yet crossed a value that is greater than the search key. For example, in Figure 2, you go through the following sequence within your file search: handbag → handcuffs → handful -> handicap -> handiwork (found it!).

Now, what if you search for a value that neither has an entry in the index and nor is it present in the files on disk?

Using the two-step approach above, you will end up traversing several or all files on disk before you return to the user that the value does not exist. Can we make searches for missing entries optimal?

And the answer is yes — think of bloom filters [2]. While they can lead to false positives sometimes but they are efficient for true negatives scenarios.

How to handle data loss in a hash map in an event of some memory issue?

Consider the scenario where there is data in the hash map which is not yet persisted on disk — we can avoid this data loss scenario by maintaining a write-ahead log on disk. And use that to restore the hash map. This way contents will be preserved.

How to facilitate range queries if you maintain an in-memory data structure like a hash map?

Two potential approaches:

  • If you are maintaining SSTables on disk, then first find segments with the two end values provided by the user and return all the data within those segments, bounded by those end-values. Example: return all values for keys between handful and hangout. Look for a segment containing key1 “handful”. Mark this as “segment1”. Find if the other key exists (using bloom filter?). Let’s call it “segment 2”. Start with key1 in segment1. Return all values till you hit key2 in segment2.
  • Use a tree data structure (can maintain additional links between siblings to facilitate range queries) instead of the hash map? Trees such as red-black or AVL can also help in ensuring the tree is balanced and the user can get the data in sorted order (even if data is inserted by the user in a random/unsorted manner).

This concludes our discussion on optimizing read access in append-only data storage. But before you leave, here are some points for you to consider to check your understanding of the major concepts discussed in the two posts.

1. Strategies that play a role in improving read vs. write latency.

2. When to use bloom filters?

Any questions on these questions or anything discussed above? Post in comments. If you like the post, share and subscribe to the blog to get notified when I publish a new post.

Mahesh SV

References:

  1. Designing Data-Intensive Applications: https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
  2. https://blog.medium.com/what-are-bloom-filters-1ec2a50c68ff

--

--

Mahesh S Venkatachalam

Data Enthusiast, Write about Data Engineering, Architecting