Hi there

I am starting a blog series sharing my notes from the highly popular book – Designing Data-Intensive Applications by Martin Kleppmann. The motivation for picking this book came from a book club channel that started in May 2021 as a part of a non-profit organization to help graduates to land in the interview. It is a community of professionals and students trying to help each other in sharpening computer science fundamentals of algorithms and data structures and I am a member of the org.

In this blog series, I plan to share my notes and learnings from the…


Spark SQL is an Apache Spark module for structured data processing. One of the big differences with the Spark API RDD is that its interfaces provide additional information to perform more efficient processes. This information is also useful for Spark SQL to benefit internally from using its Catalyst optimizer and improve performance in data processing.

The biggest abstraction in the Spark SQL API is the DataFrame. In Spark, a DataFrame is a distributed collection of data organized in rows with the same scheme. Conceptually it is equivalent to a table in a relational database. DataFrames in Spark have the same…


Here are my notes (Part-II) from the fifth chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on replication in distributed systems. And in this post particularly we will continue our focus on single-leader replication but look at the challenges inherent in it.

As a quick recap from Part-I of this post, we talked about single-leader replication architecture to achieve read scalability. All the writes go to a single leader but reads can be served by any of the several replicas.

We also talked about synchronous vs. asynchronous replication methodologies that one can further adopt in single-leader architecture and trade-offs in each.

Picking up from there in this post — we understand that to achieve read scalability which is the main premise for single-leader…


Here are my notes from the fifth chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on replication in distributed systems.

It is a long chapter and I will be split it into multiple blog posts to keep it at a relatively reasonable size and easily comprehensible. Replication, as the word represents, refers to keeping a copy of the same data on multiple machines. In this post, we’ll talk about

  • Use cases where this concept is useful.
  • While replicating static data would be trivial, the challenge comes in handling changes to the replicated data. There exist different replication architectures (single-leader, multi-leader, leaderless). …


Here are my notes from the third chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on data structures used for storing and querying databases.

  • This is Part-IV and the last post of chapter 3. In the previous three posts ,we looked at an alternate storage mechanism to B-tree, known as LSM trees which are optimized for write-heavy workloads.
  • In this post,
  • we take a step back to draw another classification scheme for storing and retrieving data in databases — OLAP vs. OLTP
  • re-visit LSM vs. B-trees with respect to the above higher-level classification.

OLAP/OLTP

1. While we focused on comparing databases as LSM vs. B-tree models…


Here are my notes from the third chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on data structures used for storing and querying databases.

  • This is Part-III of chapter 3 — a continuation of the last article which can be found here. In Part I, we started with a simple data storage and retrieval model based on a basic key-value-based design, its advantages, and limitations, and learned about two concepts of compaction and merging in such systems. Part II built upon this design and looked into strategies for optimizing reads in the append-only storage design. …


Here are my notes from the third chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on data structures used for storing and querying databases.

  • This is Part-II of chapter 3 — a continuation of the last article which can be found here. In Part I, we started with a simple data storage and retrieval model based on a basic key-value-based design, its advantages, and limitations, and learned about two concepts of compaction and merging in such systems.
  • By the end of the post, I hope you have developed an intuition for:
  • optimizing query read time in data storage systems using hash indexes.

The previous post did…


Here are my notes from the third chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on the data structures used for storing and querying databases.

  • Back in chapter 2 also, we talked about different data models and query languages for databases. How is this chapter different? Well, chapter 2 is from the user’s point of view — the tools/constructs (tables for relational models, nodes, and edges for graphs) that you need as an end-user to query or write to a database. And this chapter is about how the database engine actually accomplishes it.
  • By the end of the post, I hope you have developed an intuition for…

This post covers my notes from the section on query languages in the second chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on the following theme:

In Part I, we talked about different types of data models, specific use cases in which each of them excel. The author compares relational, document, and graph data models in detail. This post addresses the following question — If data models are a way of representation, then what are the paradigms for querying them (Declarative, Imperative, and MapReduce).

The main differences between them is:

  • In the declarative style, you are only specifying what all information…


Here are my notes from the second chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on the following main themes:

  • Different types of data models, specific use cases in which each of them excel. The author compares relational, document, and graph data models in detail.
  • If data models are a way of representation, then what are the paradigms for querying them (declarative vs. imperative). Will be covering this in another post as Part II of this chapter.

Data models

Think of them as a way of representing information to facilitate communication among different groups of people and also as tools that offer abstractions (hiding…

Mahesh S Venkatachalam

Data Enthusiast, Write about Data Engineering, Architecting

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store