Building a Data Pipeline - Part 0

Covers general elements of data pipelines, what they do, and why they do it

October 26, 2018

Intro

When I started this blog, I wanted it to show building multiple different types of projects from a practical perspective. As an astute reader, I’m sure you’ve noticed that Iridium has been the only the project we’ve been working on. Time to change that!

Welcome to the first post of Project Grimwhisker!

What?

It’s a data pipeline. I’m calling it Grimwhisker because I happen to own that domain name and don’t want to try to find and buy yet another one.

The rest of this article will explain what a data pipeline is and isn’t, and describe the project in more detail.

Data Pipelines

First let’s talk about what a Data Pipeline is.

Buzzword

In one respect, data pipeline has become so diluted in meaning that it is like devops now. It doesn’t mean anything, or it means something different to every person, depending on how you look at it. Managers, PMs, TPMs, and all the rest like to throw these terms around without having a common definition. So do engineers, for that matter.

Definition

For the purposes of this tutorial, we define data pipeline as a system that does at least the following:

Ingest data from somewhere
Does some sort of processing on that data
Provides the results

It may also provide:

Stores the data somewhere (this is optional)
SQL-like query functionality

If we put those in categories, we get:

Ingest
Processing
Storage
Query

Those are the four components we are going to talk about and build in this project. For the rest of this post, I’m going to give an overview of what each layer does, common technologies used, and other bits of useful info. Then we’ll build each component, in the order listed.

Overview

Behold my beautiful text art!

┌────────┐         ┌────────┐          ┌────────┐
│ User 1 │         │ User 2 │          │ User 3 │
└────────┘         └────────┘          └────────┘
     │                  │                   │
     │                  │                   │
     └──────────────────┼───────────────────┘
                        │
                        │
                        ▼
        ┌───────────────────────────────┐
        │                               │
        │         Ingest Layer          │
        │                               │
        └───────────────────────────────┘
                        │
                        ▼
        ┌───────────────────────────────┐
        │                               │
        │       Processing Layer        │
        │                               │
        └───────────────────────────────┘
                        │
                        ▼
        ┌───────────────────────────────┐
        │                               │
        │         Storage Layer         │
        │                               │
        └───────────────────────────────┘
                        │
                        ▼
        ┌───────────────────────────────┐
        │                               │
        │          Query Layer          │
        │                               │
        └───────────────────────────────┘
                        ▲
                        │
     ┌──────────────────┼───────────────────┐
     │                  │                   │
     │                  │                   │
     │                  │                   │
┌────────┐         ┌────────┐          ┌────────┐
│ User 4 │         │ User 5 │          │ User 6 │
└────────┘         └────────┘          └────────┘

This is a very high level overview of what we’ll be building. Two things to note:

Users 1, 2, and 3 are sending data
Users 4, 5, and 6 are querying data

Use-cases

Why would anyone want to build such a monstrosity, you might ask? There’s lots of reasons. Some good, some not so good. I’ll list a few here:

You have developed a mobile app and want to collect analytics data from it. In which country are most of my users? What part of my app do they use the most? Is my app crashing a lot?
Ad targeting. The more data companies can collect about you, the better chance they have to show you an ad you’ll click on.
You’ve tagged some form of wildlife and want to collect data about where it travels.
The government wants to know who you talk to and where you go.

Note	I am not arguing the ethics of these; just listing use-cases. The ethics are an entirely different discussion. =)

So let’s talk about the components!

Ingest

This is outermost ring of your pipeline. It has one job, and one job only: accept data as fast as it can, and get it written to durable storage.

Important

Durable storage means storage that will survive a reasonable amount of failure of components. The minimum is getting the data from RAM to an HDD or SSD. Some groups are ok with replicating data in RAM multiple times. We’ll talk more about this later.

This layer has to be dead simple. Why? Because if it goes down, you lose data. If it can’t keep up, you lose data. The clients might have retry logic built-in, but they might not, and you can’t rely on that.

Critical Requirements for Ingest

Stateless. It must be able to scale up and down in response to the amount of traffic.
Parallelizable. Any application that cannot take advantage of multiple cores is not a good fit for ingest.
Simple. You do not want to be debugging some complicated system at 3AM that is dropping data.

Data Formats

This is where it can get messy. If you have the advantage of greenfield for both backend and frontend, you can come up with a schema and encoding. One team might decide on JSON over regular HTTP, because it is easy and highly compatible. Another team may want to be on the cutting edge and accept protobuf-encoded data via gRPC. You can even do both!

For our ingest layer, we’ll be using JSON over HTTP to start.

Durable Storage

Let’s say we write a Go application to handle ingest. When it gets a request, that data will live in the RAM of the server that received it. But what then? If we try to write it to local disk, we’ll probably overwhelm it. And then we’d have to figure out how to get that data from that server to some other storage, and what if the server crashes before we do that, and so on.

For now, I’ll just say that we will send it on to another system designed to write data to disk.

Processing Layer

This layer where you can do whatever arbitrary processing you want. Examples include:

Validate that the data sent matches the schema
Validate that a field is the type it is supposed to be
Do a lookup on the incoming IP address and inject city, state, and country

Raw Data

Processing the data often means changing it. You may add a field, alter a value, or remove a field. Once you start doing that, you lose the pristine copy of the data. For example, if you run some process that transforms incoming data for a month, and then find a bug in that algorithm, you have lost the option to just re-run the algorithm on the pristine data. You have to try to figure out a way to re-transform the transformed data.

This can get even more complicated when multiple groups want access to your data, and want to run their own transformations. What if they erase the data you needed? If you have a pristine copy, you can just let them each operate on their copy.

The downside to this, of course, is cost. In our design, we’ll keep a pristine copy (because I’m paranoid), but some teams may choose not to. It is just a question of cost vs. benefit.

Storage Layer

This is durable storage that can hold months or years worth of data, often compressed. S3 is often used for this purpose. It differs from the durable storage in the ingest layer in that this is permanent storage.

Note	Permanent does not mean forever in this case. It means until your data retention policy says to delete it.

Query Layer

This is the interface you present to the people who are accessing the data. Business development people, marketing people, etc. Common examples are https://www.tableau.com, https://metabase.com and a ton more. Anything that makes pretty graphs and charts.

It also includes things like SparkSQL. If the consumers of your data are more developer-centric (say, a machine learning team), they may want to just run SQL queries. https://prestodb.io is often seen at this layer. https://impala.apache.org is another, as is https://drill.apache.org.

End

OK, that’s it for the intro tutorial! In the next one, we’ll dive into the ingress layer, talk about schemas, and even more exciting things!

If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.