Building a Data Pipeline - Languages and Stack
Choosing a language and stack
Intro
Hello! Welcome to the exciting topic of ingesting massive amounts of pastries data! We talked about ingest a bit in the previous tutorial; in this one, we’ll examine various architectures in great detail.
Let’s get the controversial part out of the way. Language. =)
Language and Stack
This is a topic that provokes many a developer into fits of rage, much tabs vs spaces or vim vs emacs. The truth is that all the popular languages are popular for a reason: they solve a problem. Some solve many problems. But they do it will enough people are using it.
This doesn’t mean that one language is equally good at everything. It means pick the right tool for the job. In this case, data ingest.
Common Choices
Let’s start with the high-level languages, because that’s how a lot of projects start. Ever hear "Oh, we’ll just prototype in Python!" and then it is six months later and you’re trying to figure out how to scale gunicorn?
Ruby, Python, et al.
Ruby has Rails. Python has Django. Both have the ability to accept JSON data over http. So what’s the problem with them?
Scaling.
Higher-level languages often have what is called a GIL, or Global Interpreter Lock. That means your Python/Ruby VM can only have one active thread of execution, even if you create multiple OS threads. This means we have to do process scaling, where we create multiple language VMs. These then have to be managed by an application server (unicorn, gunicorn, etc).
The problem can get worse, depending on how your particular application server handles requests. Many use a process per request model. This means you can service one request per VM process. So if we have unicorn create 8 processes running our application, we can only service 8 simultaneous requests.
If your typical request can be serviced quickly, this may be fine. And most of these applications do allow for some level of dynamic scaling in the number of processes.
Processes and Heap Fragmentation
An OS process is resource intensive to creat and maintain. Memory must be allocated for it, for one. And this is where another problem comes in: heap fragmentation. If you watch a typical Ruby or Python web app, you’ll often see the process RAM usage creep up, and eventually the application server will kill it and start a new one. Why? Think of computer memory as a series of cubbies, like this:
An empty heap
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ │ │ │ │ │ │ │ │ │
│ │ │ │ │ │ │ │ │ │
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
Almost all high-level languages use the heap for class instantation. Let’s say we have a class, Bear, that takes up 3 cubbies, and another object, Tooth, that takes up 5, and then another Bear. After the three allocations, our memory looks like this:
A partially full heap
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│BBBBB│ │BBBBB│ │BBBBB│ │TTTTT│ │TTTTT│
│BBBBB│ │BBBBB│ │BBBBB│ │TTTTT│ │TTTTT│
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│TTTTT│ │TTTTT│ │TTTTT│ │BBBBB│ │BBBBB│
│TTTTT│ │TTTTT│ │TTTTT│ │BBBBB│ │BBBBB│
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│BBBBB│ │ │ │ │ │ │ │ │
│BBBBB│ │ │ │ │ │ │ │ │
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
If we wanted to allocate a tooth, we don’t have enough slots now. We have 4, we need 5. Now let’s say that the first Bear, the first 3 slots, get garbage collected and are empty again, so our heap looks like this:
A fragmented heap
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│ │ │ │ │ │ │TTTTT│ │TTTTT│
│ │ │ │ │ │ │TTTTT│ │TTTTT│
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│TTTTT│ │TTTTT│ │TTTTT│ │BBBBB│ │BBBBB│
│TTTTT│ │TTTTT│ │TTTTT│ │BBBBB│ │BBBBB│
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐
│BBBBB│ │ │ │ │ │ │ │ │
│BBBBB│ │ │ │ │ │ │ │ │
└─────┘ └─────┘ └─────┘ └─────┘ └─────┘
Now we have 7 slots free, but they are not contigous! At this point, many language VMs will just allocate more memory. As time goes on, we end up with a heap that has gaps everywher that are hard to fill. Thus, a fragmented heap!
What would be nice would be to move that last Bear object into the first three slots, then allocate the second tooth there. This is sometimes called heap compaction. But it can be slow, depending on the workload, may not do much good. Hence why it is often easier just to start a new process.
Performance
There is no gettig around the fact that these higher level languages are, ultimately, interpreted at runtime by a VM, not executed as native code. This has a performance cost.
All that being said, these languages are excellent for prototyping, if you want to try out something quickly. Just don’t expect them to handle high amounts of traffic.
Can they scale? Yes, but it gets very complicated, and we want simple.
Java
Java’s performance is fine for data ingestion. It has the advantage of having libraries to connect to anything ever.
JVM-based Languages
An advantage of the JVM is that there is a large ecosystem of languages that run on top of it. Scala is becoming a popular choice. Clojure is another. There’s a language for everyone, and it is easy to use Java libraries from the other languages. There are even options like JRuby and Jython that let you run Ruby or Python atop the JVM. An interesting side-effect of this is you can get access to true parallelism in languages that are clones of Python/Ruby.
Disadvantages
From an operations perspective, the JVM is very complex to deploy, manage, and scale. It requires a lot of application-specific optimization, and there are approximately 9 billion differe flags you can provide it. If your team already has skills in this area, or a team that will be managing it, Java or a JVM-based language is worth considering.
Stop the World Garbage Collection
Because Java is a garbage-collected language, the JVM has to track every object created and when it should be destroyed. In large applications, this can get quite complex, leading to long GC pauses. And you don’t know exactly how much garbage there is to collect until you pause execution and do a sweep.
Over the years, a lot of work has gone into the JVM to make this better, and there have been great improvements. But there is no getting around the core issue. For ingest, this probably wouldn’t cause problems; it is more likely to cause issues in clustered applications, like Kafka or ElasticSearch.
Java-based solutions are complex, and we want simple. So let’s move on.
.NET Languages
All of the above regarding the JVM is applicable to CLR-based languages: C#, F#, etc. For many years, you had to run these on IIS servers, which were Windows only. Now there is .NET Core, which offers has a compilation story much like Go’s. It is easy to compile and run on many platforms. Their built-in webserver, Kestrel, is also competitive with the fastest C and Rust HTTP servers.
If you have a lot of Windows/MS devs, definitely consider .NET Core. My expertise is not in the .NET ecosystem, however, so we won’t be using one of these.
Go and Elixir/Erlang
These are my two top recommendations when you need highly-scable architectures that can handle a higher amount of concurrency and leverage multiple cores.
c Go Go was designed to be a performant yet simple language. It excels at writing network services. The built in http server and router fulfill almost every use case. Learning Go in a weekend is feasible for an experienced programmer. Rapid acquisition of the language is one of its design goals.
Concurrency
Go has goroutines
, which are run-time threads. That is, when a Go program starts, it creates a number of OS threads equal to the number of cores on the system (although this is configurable); programmers can then create lightweight goroutines which are multiplexed across those OS threads. This model is usual. In the case of an HTTP request, each one is handled in an isolated goroutine
. It is possible to have hundreds of thousands of goroutines
.
Deployment
Go is the easiest language I’ve worked with when it comes to compiling and cross-compiling. It compiles very quickly, and compiling for different architectures is simply a matter of setting environment variables. Stick that single binary in a Docker container, and off you go.
Disadvantages
Go is, by design, a very simple language. It has no generics, for example. The package management situation is horrible, as in there isn’t a coherent one and you have to try to figure out how to manage multiple Go environments. It is rather like Node in that way, or Ruby.
It also has a garbage collector, but it is a far simpler one, and is focused on low latency. It isn’t problem-free, but the non-deterministic pauses are much less of a hassle in Go.
Elixir/Erlang
Erlang is a language that was written to run on telephone switches handling millions of calls and could tolerate no downtime. Its VM (equivalent to the JVM and CLR) is called the BEAM, or sometimes the EVM.
Elixir
Elixir is a language written using Erlang macros. Erlang syntax is…different, and Elixir was created by a person from the Ruby community. So you can think of it as Ruby running on the BEAM.
Processes
The BEAM has several fundamental differences from the .NET CLR and JVM. It provides lightweight processes using the same M→N
schedule as Go, but the BEAM enforces the Actor model at a fundamental level. As an example, if you were building a web scraper in Go, you might create 10 goroutines that each lock a shared data structure when they want to access it. This is shared mutable state.
BEAM requires processes to communicate with each other via message passing. Instead of locking and directly updating the data structure, each scraper would send a message saying it wanted to update it, and with what data.
This has some interesting effects; one of the nicer ones is that garbage collection is done at the BEAM Process level. There is no Stop the World Garbage Collection. The second nice thing is that the BEAM uses a pre-emptive scheduler. In a language like Python, a lightweight thread has to yield to give another lightweight thread time to run. If it never yields, it can block forever. Go has an interesting hybrid model, where the compiler injects yield statements at function entry and exit points. But it is still cooperative.
With the BEAM, a process has a certain amount of time allowed per processor cycle, and then it is moved off the core and another one takes its place. This prevents problems like the ones described above, but it also introduces others. Many programs are not written with the idea that execution might pause at any point.
Entire books have been written about this, so I won’t go into things like supervisor trees and such here. Suffice to say, these qualities allow BEAM to be used for soft-realtime systems.
Disadvantages
Erlang and Elixir are not well-known. They are also functional languages, which can have a difficult learning curve as people get used to it. For stability and consistency, however, the BEAM cannot be beat.
All the Others
Yes, you could do this in OCaml, Haskell, Crystal, etc, but I’m running out of space. =)
Which One?
For the edge ingest layer, the choices for me are Go or Elixir. For this tutorial, we’ll use Go, though if I have the energy, I might do an Elixir version in parallel so you can see the differences.
End
That’s it for this article! Next one, we’ll implement our edge layer.
If you need some assistance with any of the topics in the tutorials, or just devops and application development in general, we offer consulting services. Check it out over here or click Services along the top.