Pipeline Data Engineering Academy home blog pages letters

Become a Data Engineer on a Shoestring (aka The Best Free Courses and Learning Resources)

I was tinkering with the idea of finding the right way to help others to identify the resources that give you bang for the buck when it comes to upskilling yourself in data engineering… and this is what I came up with. So this is how to spend the remainder of your learning budget before the year ends.

Alan Kay (Squeak, Smalltalk, OOP, GUI) suggested that programming is pop culture, because it spreads much, much faster than mentorship, education or formal study does. I think the best example of this are the talks of James Mickens, they can help shine a light on why people think that blockchain or machine learning are the solution.

Treat this as a restaurant menu. No one should consume it end-to-end. Scan it and pick something that looks good. Try it. Rinse and repeat! Wishing you a great 2021!


A) I want to pass the interview for ...
Learn Spark or Kafka or whatever people do these days at a Coursera Specialization or a Udacity Nanodegree — I did them, I did not like them. Expect to burn about a hundred bucks.

B) I want a band aid aka branded lock-in trainings!
Cloud providers, and OS-as-a-marketing-tool companies like Cloudera or Databricks are there to help you! Either for free or for a thousand dollars.

C) Gamify my learning!
For browser-based self-paced bytesize Sudoku, try learning Python and SQL in a command line at DataQuest or at DataCamp — the latter one is also available on mobile.

D) I have to be able to bluff my way into mastering data engineering tools by tomorrow!
Udemy is the 'you get what you pay for' eBay kleinanzeigen for your needs. 10 bucks will cover most courses when they are on sale == that's basically always.

Learn SQL

It is back.

The single most important technology that anybody who has 'data' in her title must master.

Select Star SQL (free, 3 hours)
An interactive book for learning SQL with SQLite running in the browser - no setup required, the best intro for a beginner.

The SQL Murder Mystery (free)
A good way to practice SQL skills after the tutorial, works in the browser.

SQL Police Department ($20)
If you got a taste for story-based learning, continue here.

Learn Programming

The uncrowned coding intro is definitely the Harvard edX CS50 (free or €167 for a Verified Certificate) or you can learn at Dr. Chuck's Python for Everybody at Coursera (you can do it in a focused month for €41).

If you have previous programming exposure you could spend 4 days on honing your Python skills with Practical Python Programming by David Beazley. I totally recommend his presentations/workshops Generator Tricks For Systems Programmers and A Curious Course on Coroutines and Concurrency. I was about to embed his Discovering Python keynote, but I'd better just point to that it's one of my favourite keynotes ever.

Watch screencasts

I didn’t understand the format until Torsten showed me Destroy All Software. It's not for everyone, it's not for everything, but now I understand its place in the universe. And yes, you can even learn computer science with it. Computer science is actually fun, too bad you don't get to do it much in the everyday crunch — although it pays well if you know its most crucial concepts.

Destroy All Software ($29/month)
Classic is mostly Ruby, computation is Python. Can't miss the Unix and Git ones.

execute program (16 free lessons then $19/month)
Hit the SQL and the Regex.

Computer Science Essentials, (free samples, $79)
You need Database Normalization, Make, Reading Shell Scripts, Shell Script Basics. Has a book version, see below at The Imposter's Handbook.

The Missing Semester of Your CS Education (free)
Check Shell Tools and Scripting, Data Wrangling, Command-line Environment, Version Control (Git), Debugging and Profiling.

Top 3 books I recommend

Jeroen Janssens: Data Science at the Command Line, 2e (O’Reilly)
Not just data science at the command line. The new one with make will take you very, very far.

Bill Karwin: SQL Antipatterns - Avoiding the Pitfalls of Database Programming (Pragmatic Bookshelf)
When I got to the third chapter describing situations that I encountered personally in my career I was 100% sold. Also for all who think database programming is not a thing. Hold my beer for a sec.

Greg Wilson: Data Crunching, 2005 (Pragmatic Bookshelf)
My first book on data. No frills, talks about file formats, has code examples, hits regex. Every day you will do something that is covered in this book.

More books for the data engineer

The Imposter's Handbook (each book $30)
You need the first book: Database Normalization, Make, Reading Shell Scripts, Shell Script Basics. The second one is a good-to-have. See Computer Science Essentials above for the video version.

A Curious Moon ($30)
Beautiful ebook with SQL exercises, also intro to Postgres with real NASA data.

Markus Winand: SQL Performance Explained (book for money, free web version)
A multiformat, practical handbook on advanced SQL and running databases in general.

A free Allen B. Downey: Think Python 2nd Edition and a $6.99 Ian Miell: Learn Bash the Hard Way ebook wouldn't hurt to have either.

Books for the connoisseur

Julia Evans: programming zines
Important topics are covered, and it will show how you can talk about concepts without fuzz. I like them all. Check the black and white free ones.

Philipp K. Janert: Data Analysis with Open Source Tools (O’Reilly)
High-end analysis practice, more useful than most ML/AI. Very well written. It raised the bar for me in technical writing.

Angus Croll: If Hemingway Wrote JavaScript
Only if you're into languages and the thinking structures they impose, but then you will start gifting it.

Books for your Zoom background

Martin Kleppmann: Designing Data-Intensive Applications (O’Reilly)
I wish this would be our problem, not pop culture. Great and detailed, limited relevance from a practical point of view. Very academic, you can watch ‘em videos.

Ralph Kimball and Margy Ross: The Data Warehouse Toolkit 3rd Edition
I mean it's too long, it's too slow, it belongs to a previous era, where computing has been immensely less performant, still if you want to build mental models of how data describes different industries, a good old printed SAP.

Python Cookbook 3rd Edition
One has to have a reference book on Python. You know, no Internet, no Stack Overflow.

Thanks to Torsten Becker, Péter Fábián, Balázs Dénes and Steffen Kiedel for discussions over lunches in Berlin.

Support gastronomy during lockdown, order food from your valued vendors.

One more thing:
Check out the
second edition of this series with even more recommended learning experiences for aspiring data engineers!