Chapter 2 — Data Models(Designing Data-Intensive Applications)

5 min readJul 27, 2021

Here are my notes from the second chapter of the book — Designing Data-Intensive Applications by Martin Kleppmann.

The chapter focuses on the following main themes:

Different types of data models, specific use cases in which each of them excel. The author compares relational, document, and graph data models in detail.
If data models are a way of representation, then what are the paradigms for querying them (declarative vs. imperative). Will be covering this in another post as Part II of this chapter.

Data models

Think of them as a way of representing information to facilitate communication among different groups of people and also as tools that offer abstractions (hiding complexity). See an illustration of the layered approach as an example.

Helps in communication because of the same vocabulary that people at a certain layer use among each other. Similarly, for people and entities in general among different layers to talk to each other, there are frameworks available (Hibernate, for example). There is a specific term used — Impedance mismatch to signify this disconnect between objects between a layer and the one next to it.

What do we mean by abstractions? Think about how many times you had to think about underlying representation around storage in terms of bytes etc. while creating a database. You do not unless you are working at that level of writing code at the database engine level. In other words, the data model you are focusing on (relational model) is abstracting/hiding the complexity of layers underneath it.

Source: DDIA Book, Page 27

Different data models

The author focuses on relational and document models in the first part of the chapter followed by a graph model towards the end.

Relational and document models

Let’s look at the differences between relational and document model using an example. Here’s sample

scenario:

Source: (Left) Sample information to be represented. (Right) Represented using a relational schema

(DDIA, Page 31)

Figure 3. Information from Figure 1 represented using a document schema (Source: DDIA, Page 31)

Let’s talk about the pros and cons of one representation over the other. (Blue highlights ==> relational is better; Green highlight ==> Document wins)

Now, while the above table enumerates differences between the two models, let’s go over few scenarios to pause and reflect our understanding of the differences between the two models. Think around which model would you pick up in the following scenarios. There may not be a single absolute correct answer. So, focus on providing the rationale for whichever side you take.

Application code simplicity. What aspects would you consider here in choosing one model vs. the other?
The structure of your data is not clearly known at the time of saving it in the database.
Which model is better in terms of storage locality?
We talked about how document models are not recommended for storing many to many types of relationships. While they are not recommended, but can you think of how you would store and query information if document models are indeed used?
There is a concept in relational databases called normalization. It is about storing information in a way that reduces redundancy and ensures data integrity and consistency (across multiple tables and using primary and foreign keys to bind them). Can you articulate what each of these ideas means in this context of storing information?
Would you consider document databases normalized?

If any questions on the list above, please reach out by posting in the comments section.

Before we switch to graph data models, one last point about relational and document models.

Convergence of relational and document stores

It is interesting to see how over time the two families of data models also seemed to converge in certain aspects. For example, RethinkDB (document store) supports relational join-like queries. Similarly, some relational databases such as PostgresSQL (since version 9.3) have started supporting JSON documents to a certain degree.

Graph Data Models

These are the recommended modeling paradigm for many-to-many relationships. Two types of graph models are covered in the chapter: Property graphs and RDF model.

Property graphs extend the standard graphs by allowing vertices and edges to also have properties in the form of key-value pairs. This is in addition to their standard features such as the start and end vertex for an edge. How does that help? You can represent different types of vertices and the type can be characterized using a key-value property.

RDF (Resource Description Framework): This is another graph-based representation. The difference between the two: labeled property graphs and RDF is best summarized in one of the StackOverflow answers here. Sharing a screenshot to highlight the specific response that I am referring to on the link.

Figure 4. Difference between the two graph representations: RDF and property graphs [source]

This concludes the main takeaways I had on data models from this chapter. Will be covering querying paradigms (declarative, imperative, MapReduce) in the next post.