Blog posts

2024

Parquet Metadata: Deep Dive into Types

14 minute read

Published:

Introduction

Expanding upon the groundwork laid in the previous blog post which introduced the core concepts of Parquet metadata, this subsequent entry delves even deeper into the intricate realm of metadata information and exploring the nuances of logical types. Join me on this illuminating journey as we unravel the inner workings of Parquet metadata, uncovering its profound impact on data management and analytics.

Parquet Metadata: An Intro to data serialisation

9 minute read

Published:

Introduction

Engaging daily with Parquet, I find it fascinating to delve into the intricate workings spanning from data residing in memory to its storage as byte arrays in HDFS block storage or within object stores like ADLS or S3, facilitated by a block storage style wrapper interface such as ABFS or S3a. To grasp the serialization process of Parquet, it’s imperative to explore how data is stored on disk and the consequential impacts on performance and cost. Let’s embark on a detailed exploration of Parquet’s serialisation, tracing its journey from the rudimentary realms of JSON and XML to its definition with Thrift.

Dremel: Interactive Analysis of Web-Scale Datasets

12 minute read

Published:

Introduction

Parquet is one of the important and impactful format in recent data engineering history. So this blog tries to understand how does parquet works at a very basic level. Parquet format was largely influenced by Dremel Paper as mentioned in the motivation statement. This blog post is designed to walk you through the key points of the paper using language that’s more approachable. It can be particularly useful if you’ve already read the paper and are looking for clarification on certain parts, or if you simply prefer the conversational tone of a blog over the formal language of an academic paper. However, I want to emphasize that the original paper is quite straightforward, and I recommend reviewing it either before or after reading this post.