Introducing LSM trees (Chapter 3 — Designing Data-Intensive Applications)

Mahesh S Venkatachalam
4 min readJul 27, 2021

Here are my notes from the third chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on data structures used for storing and querying databases.

  • This is Part-III of chapter 3 — a continuation of the last article which can be found here. In Part I, we started with a simple data storage and retrieval model based on a basic key-value-based design, its advantages, and limitations, and learned about two concepts of compaction and merging in such systems. Part II built upon this design and looked into strategies for optimizing reads in the append-only storage design. The foundation of this discussion in Part II was around maintaining data in sorted order by key.
  • By the end of the post, I hope you have developed an intuition for:
  • how to ensure that data is sorted by key when incoming writes can occur in any order.
  • LSM trees — which are an alternative to traditional B-tree approaches used in many database engines. The post will cover their specific use cases.

Constructing and maintaining SSTables

Starting with the first agenda of this post — how to maintain a sorted structure by key in the presence of incoming randomly ordered data?

Idea:

  1. This is possible using data structures such as red-black trees/AVL trees.
  2. When a write comes in, add it to the in-memory sorted data structure (memtable) as a red-black tree/AVL tree.
  3. When this in-memory structure reaches a threshold size, write it to disk (as an SSTable file). This file becomes the most recent segment of the database.
  4. While SSTable is being written out to disk, accept new incoming writes to a new memtable instance.
  5. When a read comes in — search in memtable first ==> most recent segment ==> next-order segment and so on.
  6. Periodically, segment files will undergo compaction and merging.

One problem with the approach above: if the database crashes, then the most recent writes to the current memtable instance will be lost.

Solution: Maintain a log such that every write to memtable is also persisted to log on disk.

This log does not need to be sorted since its only purpose is to preserve the contents and facilitate a restore later in an event of a crash.

LSM (Log-Structured Merge) Trees

All through this chapter, we have been talking about an append-only storage model — an optimized approach for write-heavy workloads. Is that an LSM tree? The answer being no.

LSM trees sit between a pure append-only model and a traditional B-tree model.

Why not a pure append-only structure: that would imply unsorted data and read latency will suffer in such models.

Why not a b-tree like index: that would imply update-in-place ==> random reads and writes.

The core idea of LSM trees is to make writes faster by following a design that allows sequential disk IO vs. random accesses. We do that by maintaining data sorted by key both in memory and disk (discussed above) and maintaining multiple index files (memtable, segment files on disk).

Where do LSM trees score over B-trees?

In addition to their value proposition of supporting sequential IOs, here are few other areas where LSM trees shine.

  1. SSTables are immutable ==> simpler locking requirements. Only memtable observes contention and requires strict locking primitives.
  2. Better compression ==> Lower storage overhead. B-tree storage engines leave some disk space unused due to fragmentation when a page is split, for example. LSM trees are not page-oriented and periodically re-write SSTables (compaction/merging) to remove fragmentation.

Downsides of LSM-trees

Things are not all that rosy for even LSM trees. Here are some areas to consider where LSM trees do not emerge as a strong competitor:

  1. Reads are slower on LSMs because several data structures (memtable, multiple segment files at different levels of compaction) have to be consulted. LSM trees are considered a better option for write-heavy workloads while B-trees are better for read-heavy operations.
  2. The compaction process can sometimes interfere with the performance of ongoing reads and writes. The response time can be noticeable on high percentiles especially.
  3. An LSM tree can have multiple copies of the same key in different segments. For databases that require strong transactional semantics, this poses a challenge.

This concludes our discussion of append-only and LSM storage models. Here’s a good link with additional references on LSM trees: http://www.benstopford.com/2015/02/14/log-structured-merge-trees.

In the next post, we will take a step back to position and categorize LSM trees over B-trees from another level (a higher-level one).

If you liked the post or have any feedback/question, do drop a note. Thanks!

--

--

Mahesh S Venkatachalam

Data Enthusiast, Write about Data Engineering, Architecting