Dremel: Interactive Analysis of Web-Scale Datasets
Published:
Introduction
Parquet is one of the important and impactful format in recent data engineering history. So this blog tries to understand how does parquet works at a very basic level. Parquet format was largely influenced by Dremel Paper as mentioned in the motivation statement. This blog post is designed to walk you through the key points of the paper using language that’s more approachable. It can be particularly useful if you’ve already read the paper and are looking for clarification on certain parts, or if you simply prefer the conversational tone of a blog over the formal language of an academic paper. However, I want to emphasize that the original paper is quite straightforward, and I recommend reviewing it either before or after reading this post.
Quick Summary
For those pressed for time, here’s a brief summary:
Dremel stands out as Google’s internal tool for swift data analysis and exploration. Specifically crafted for interactive examination of read-only nested data, its blueprint draws inspiration from parallel database management systems and principles from web search.
The paper highlights three main contributions:
- Introducing a column-striped storage format tailored for nested data.
- Presenting a SQL-like language customized for dealing with nested data structures.
- Utilizing execution trees derived from web search systems to enhance database query performance.
In essence, the paper demonstrates the feasibility of constructing a system that facilitates the analysis of trillions of read-only records spanning petabytes of storage, all achieved at interactive speeds, typically clocking in below 10 seconds. If you’re eager to delve deeper into the intricacies of how this is achieved, read on.
Why was Dremel needed?
Let’s kick things off by delving into the motivation behind the development of Dremel. The introductory insight of the Dremel paper revolves around the fact that data prevalent in typical “web computing” environments often defies a relational nature. In this context, “non-relational” signifies that the data exchanged within these web computing systems doesn’t neatly fit into a relational model, such as a flat collection of tuples akin to what you’d store in a conventional Postgres database. Examples of such non-relational data encompass:
- Messages exchanged by distributed systems (example: Kafka),
- Structured documents like websites,
- Data structures used in everyday programming languages.
Crucially, each of these data categories typically involves nested structures. The paper further underscores that attempting to store such data using a relational model, involving the processes of de-structuring or “unnesting,” re-structuring, and splitting it into well-normalized database tables, becomes impractical at web scale.
Now, consider the prospect of…
- Storing this data in its original nested structure,
- Accessing the data without the need for re-structuring,
- Crafting queries on nested data without the complexities of joins.
Enter Dremel – a solution engineered precisely for achieving all of this seamlessly and rapidly, even when dealing with petabytes of data for interactive analysis.
How Dremel does it?
Arguably the most important pillar of the Dremel system is that its data model centers around nested data. What is nested data? In essence, and likely in practice, this just means data that can be described with the Protocol Buffers specification. Protocol Buffers is Google’s data serialization format, and comes with an interface description language (IDF) that allows defining nested data structures. Here is an example of a nested data structure, a.k.a. record, defined in the Protocol Buffer IDF:
message Document {
required int64 DocId;
optional group Links {
repeated int64 Backward;
repeated int64 Forward; }
repeated group Name {
repeated group Language {
required string Code;
optional string Country; }
optional string Url; }}
If you’re unfamiliar with Protocol Buffers records (or Kafka messages), it’s crucial to note the following:
Nested Data Structure: A Protocol Buffers record describes a nested data structure, wherein fields or sub-records (groups) can be classified as:
- Required: Must be present.
- Optional: Can be omitted.
- Repeated: Occur zero or more times.
Connection Between Optional and Repeated Fields: Both optional and repeated fields can be omitted. In this context, “omitted” means that a field need not have a value when serializing an object from a programming language into the final data format, which can then be stored or transmitted.
Field’s Path: A vital aspect of nomenclature associated with nested data records is a field’s path. In the provided record, each named entity represents a field. The path to a field is the dot-separated list of fields one would traverse to reach that specific field. Examples of field paths include:
- Document.Links.Forward
- Document.Name
- Document.DocId
In both the paper and this article, it’s common to omit the top-most field name. For instance, Name.Language.Country is synonymous with Document.Name.Language.Country. This understanding of field paths becomes particularly relevant in the later discussion on the lossless storage of such records.
Columnar Storage
While row storage is suitable for transactional workloads that involve frequent inserts, updates, and single-record retrievals, columnar storage excels in analytical environments where fast query performance and efficient data compression are crucial, making it the preferred choice for OLAP systems. Advantages of Columnar Storage for OLAP:
Compression: Columns often contain repeated or similar values, enabling better compression. This results in reduced storage requirements and improved I/O efficiency, particularly beneficial for large analytical datasets.
Analytical Queries: Analytical queries typically involve aggregations and computations on specific columns. Columnar storage allows for the retrieval of only the relevant columns, leading to faster query performance compared to row-based storage.
Parallel Processing: Analytical queries can be parallelized more effectively with columnar storage. Since each column is stored separately, different segments of the query can be processed in parallel, leveraging modern multi-core architectures.
Cache Efficiency: Analytical queries often involve reading a subset of columns repeatedly. Columnar storage enhances cache efficiency, as only the relevant columns are loaded into memory, reducing the need to fetch unnecessary data.
Lossless Columnar Representation
As the previous paragraph explained, Dremel uses column-oriented storage, which means leaf fields are stored contiguously in memory. Interestingly, this means that the values are “lifted” out of their logical structure. Take the following record:
Name
Language:
Code: 'en-us'
Language:
Code: 'en'
Name
Language:
Code: 'en-gb'
In a columnar storage model, values of the Name.Language.Code field will be stored in memory like this:
en-us
en
en-gb
The initial record featured values with distinct logical locations within its structured framework. However, the current storage format involves a completely flat arrangement of these values. The challenge arises during re-assembly – how do we discern that the initial two values correspond to the first Name group, while the third value pertains to the second Name group?
To address this, the Dremel paper introduces two supplemental pieces of information stored alongside each field value, aiming to achieve a lossless columnar representation. The first of these is termed the repetition level, and the second is known as the definition level.
Repetition Level
The repetition level serves as an indicator of which repeated field in a field’s path the value has encountered. In our ongoing record example, both Name and Name.Language are recurring groups. Therefore, for a field path like Name.Language.Code, the repetition value provides information on whether a specific value belongs to a repetition of the inner group Name.Language or the outer group Name. Let’s revisit our earlier record for clarity:
Name:
Language:
Code: 'en-us'
Language:
Code: 'en'
Name:
Language:
Code: 'en-gb'
The value “en” experienced the most recent repetition within the Language group. Conversely, for the “en-gb” value, the entire Name group was the most recently repeated. Consequently, the repetition level for the latter value is denoted as 1, as in the field path Name.Language.Code, it is the first field, Name, that underwent the latest repetition. In the case of the former example, “en” the repetition value is 2, signifying that the second field, Name.Language, was the most recently repeated.
For the very first value in the record, “en-us” the repetition level is 0. This is because the Document itself was the most recently repeated (even though it didn’t technically “repeat” since it’s the initial record in this example, but for explanatory purposes, it serves as the base case).
Now, armed with the ability to assign a repetition level to each stored value, we can revisit and modify our in-memory layout as follows:
'en-us' | 0
'en' | 2
'en-gb' | 1
Definition Level
Another number Dremel associates with each stored value is its definition level. This level indicates how may fields in a path that could be omitted, are actually present. In Name.Language.Code, both Name and Name.Language are repeated groups, which means they could be omitted. If Name is however present in a field’s path, the definition level is bumped to 1. If Name.Language is present too, the definition level is bumped to 2. This number becomes relevant for encoding of NULL values. In the following record:
Name:
Language:
Code: 'en-us'
Country: 'USA'
Language:
Code: 'fr'
Name:
Language:
Code: 'en-gb'
Country: 'USA'
The second Name.Language group has omitted the Name.Language.Country field. This information has to be stored in form of a NULL value in the Name.Language.Country column. What the definition level tells us, then, is how much of the record surrounding a field is omitted. In the above record, the Name and Name.Language group are present, but not the final Name.Language.Country field. The definition level of that NULL value would be 2, because it’s two ancestor fields are present in the record. In this record:
Name:
Url: 'http://www.example.com'
the Language group is omitted as a whole, so the definition level of the NULL value we store for the invisible Name.Language.Country field is 1. Zooming out further:
Document:
DocId: 1
Document:
DocId: 2
this record doesn’t even have a Name group, so we’d store a NULL value for the Name.Language.Country column with definition value 0, because none of the optional or repeated fields in the path Name.Language.Country, which are Name, Name.Language and Name.Language.Country, are present.
My understanding of the purpose of this value is not to aid in lossless storage. This, as far as I can tell, is accompalished by the repetition level alone. I believe the purpose of this number is to be able to store NULL values without storing an actual value. That is, any definition level smaller than the length of the path indicates that the value is NULL. The exact value then tells us how much of the path is omitted. Many real-world datasets are very sparse (have lots of NULL values), so efficient encoding of NULL values would have high priority in a design for a system like this.
Assembling Records
Dremel emits records as its query output. To do so, it must assemble records from the data columns it reads from memory. What is particularly important and interesting about this task is that Dremel must be able to read partial records. That is, records that follow the structure of the record schema (the Protocol Buffer specification) while omitting certain fields or entire groups. The paper describes that at Google many datasets are sparse, often with thousands of fields in the record schema, of which usually only a small subset is queried.
Most of the examples we’ve looked at so far were partial records, for example:
Name:
Language:
Code: 'en'
the only field present is Name.Language.Code. Fields like Name.Language.Country and entire groups such as Links are omitted. Nevertheless, the record follows the original schema, with Language nested under Name and Code under Language.
To do the actual record assembly, Dremel creates a Finite State Machine (FSM). Here is an example figure from the paper:
This FSM constructs the record by logically jumping from column to column and scanning values from memory until exhausted. Edge values indicate repetition levels.
Takeaway
In conclusion, we’ve explored Dremel, a platform tailored for “Interactive Analysis of Web-Scale Datasets.” What key insights can we draw from it? Here are some takeaways:
- Scan-based queries on read-only nested data can achieve interactive speeds, typically below 10 seconds.
- Columnar storage, previously successful in relational databases, can be effectively applied to nested data in Dremel without loosing any information
The paper is commendably written and offers valuable insights on various aspects. This article aimed to supplement the paper by providing additional details where necessary, beyond the scope of an academic document.