Pipeline Data Engineering Academy home blog pages letters

Top 5 Specialisations for Data Engineers

When mankind embarks on a journey of exploring a new subject matter, the first part of the ride is usually bumpy and full of ideas and initiatives that turn out to be erroneous in hindsight. It is very true that failing is essential for learning - just ask the people who call themselves "serial entrepreneur" on LinkedIn. When working on figuring out how to approach a problem we tend to systematically reinvent and restructure hypotheses, tools, methodologies and processes over and over again in order to maximise the likelihood of a desired outcome, and later on we try our best to optimise all this for efficiency. This is an established form of discovery that aims at generating value, all in the name of capitali... prosperity.

Just take the evolution of the automobile or the continuous development of modern medicine as an example. I would argue, that we as a society are just getting started with finding out how software development and maintenance should be sustainably institutionalised, and the growing pains of today can be vividly felt when talking to people who have spent some time working in tech. Setting up organisational structures, distributing tasks, managing workloads, streamlining communication, exploring new methods of collaboration are all popular subjects for bestsellers for a reason. As a consequence of the quickly evolving consumer expectation towards the products and services built upon software, we keep iterating and we remain on the lookout for somebody with answers - or at least with a totemic acronym we can put on pink post-it notes.

But within this circus, there are smaller stages with plays of comparable conflicts: the stage of data being one of them. The quickly changing nature of the tools and processes applied to generate value out of the growing amount of available data requires constant reiteration of the skillset of professionals working with it, and a continuous realignment of organisational roles and responsibilities. You can multiply this uncertainty-soaked complexity with demographic, cultural, geographic and organisational factors that all have an undeniable influence on the daily business of a team working with technology.

This environment and the constant change is what makes it so difficult to create blueprints for long-term career pathways within data, therefore everyone who is planning on staying en vogue should care to remain ahead of the curve... somehow.

how to Level up

The established, basic trifecta of data roles in 2021 consists of the data analyst, the data scientist and the data engineer - yet a lot is happening when you look at further career development. The most straightforward way of increasing the value one generates is to take on more decision making responsibility (financial responsibility, personnel responsibility) - think Senior Data Engineer, VP Data Engineering, Head of Data etc. But there are more and more specialised roles popping up that mix the fundamental competencies of a data engineer with different group of expertise to create a powerful combination. These are some of the roles that I see are highly sought after on the job market, these skill-combos are carrying loads of value to businesses who are ultimately looking for casting the right actors to star the show.

Interestingly, we can observe the participants of the first cohort of Pipeline Academy already getting a feel for their own personal paths and how they are starting working towards that goal. We are putting a lot of emphasis on the fundamentals of data engineering at the bootcamp, yet ultimately it's the combination with the unique professional backgrounds and personal ambitions of our students that make them really stand out on the job market.

Let's take a look at five selected roles that are worth a deeper dive.

1. Analytics Engineer or BI Engineer

The role of the Analytics Engineer or Business Intelligence Engineer started popping up in 2018 in more and more blogs and articles mainly thanks to this thing called dbt.

“Today, if you’re a “modern data team” your first data hire will be someone who ends up owning the entire data stack.”

This title describes basically a pissed off data analyst who is by default able to generate insights out of data, and with some added engineering skills they become an end-to-end powerhouse for setting up solid BI pipelines without the help of an additional engineer or software developer. This idea is part of a broader trend of enabling data consumers to access what they want without intermediaries. Analytics engineers are highly valued especially in the days where hiring data engineers is fairly difficult due to a lack of available supply. Some are even going as far as saying that "80% of analytics is effectively data engineering"... Definitely one of the most promising roles you can enter in 2021. There are a handful of forward-thingking scale-ups who are already hiring for this role in Berlin.

Competencies: analytical domain knowledge, SQL, Python and ETL/ELT knowledge, dbt, data visualisation and storytelling skills.

2. Machine Learning Engineer or AI Engineer

ML engineers are data scientists who can productionize their models in order to solve business challenges - this might sound like an oversimplified definition, yet that's pretty much it. Does this really get you into one of the best paid positions within data? Well, kind of. Currently I see two directions to arrive at this role: either you come from a data science angle and learn how to set up architectures and integrate sophisticated models into the equation in a robust manner (notice the DS —> DE pivot), or you come from the computer science/software engineer/data engineer realm and you level up your understanding of the mathematical/computational models you are supposed to build or paste into your pipelines. The key expectations include:

  • "Translating the work of data scientists from environments such as Python/R notebooks analytics applications.

  • Creation of web services/APIs for serving ML/AI model results and enabling access to customers or internal teams.

  • Automating model training and evaluation processes.

  • Automating feature engineering, ensuring data for model training is cleaned and readily available and facilitation of the flow of data between ML/AI models and an organization’s data systems."

And the list goes on. You see, putting something 'into production' is what does the trick, this is where the business value is generated. Some critics even go as far as saying:

"Classic algorithms + domain knowledge + niche datasets are going to solve most real problems, not deep neural nets. Most of us aren’t working on self-driving cars."

... and there is certainly a lot of truth to that.

Competencies: experience with the fundamentals of machine learning models, computer science or software engineering background combined with the latest data engineering expertise, SQL, Python and the command line, concepts like Continuous Delivery and Continuous Integration, understanding of Docker and maybe Kubernetes.

3. DataOps and Machine Learning Ops

Data architectures need maintenance, sometimes significantly more than traditional or legacy software stacks to perform and keep the output quality at a high level at all stages of the lifecycle. The reason for this are the constantly changing tools and business requirements that demand attention, and of course the pipelines that need fixing as a result. Shouldn't this be done by the data engineers, you ask? Well, not really, but you'll definitely hear about small-scale data teams that don't cultivate ops in general. Mature (or bold) data organisations who have identified proper machine learning usecases with a positive ROI are the ones who first have to start thinking about maintaining the systems their ML engineers have put in place as well. Running the operations-side of data is often considered the least glorious part of the job, yet it's a very sought-after and rewarding role.

Competencies: SQL, Python, command line, understanding algorithmic complexity, operations experience, knowledge of concepts like CD/CI, logging, monitoring and profiling, understanding of Docker and Kubernetes.

4. Data Product Manager or Data Product Owner

In order to successfully lead a cross-functional team one requires not only a feel for prioritising and managing multiple tasks in parallel, but also excellent communication and social skills. But as we dive deeper and deeper into the dark seas of technical complexity when designing and deploying software and data products, navigating becomes exponentially tougher. If companies would like to meet consumer needs and deliver the right features in an efficient and agile manner, the people in charge of the product roadmap need to be aware of the high-level technical requirements and consequences of their decisions (think data infrastructure costs, maintenance effort, data governance and quality across the org, stakeholder expectation management etc.). It is becoming clear that while the field of building websites or mobile apps is experiencing more and more commoditisation and is therefore turning simpler and more accessible, dealing with the whole lifecycle of data products is much more complicated and requires and increased awareness of the latest tooling landscape, the methodologies and the lingo in data.

Data engineers who would like to be in charge of a product are well suited to fulfil this role, at the same time I am witnessing how experienced product people are leveling up their game to become data/AI POs by learning the essentials of data engineering at the bootcamp. Data leaders of the future, now is the time to do a deep dive into data infra; and in case ML comes into play in the consumer facing product, there are plenty of accessible online courses available for getting the knowledge you need. As Adnan Boz (founder of the AI Product Management Institute) puts it:

“Since the development lifecycle of AI projects is based on “searching” rather than “planning”, companies need professionals who are trained to look at products as optimization problems rather than a programming problems.”

Competencies: organisational skills, communication skills, understanding of the architecture and deployment process of data products and their consequences for the business and the user, stakeholder management.

5. Data Governance Manager or Data Czar

Talking to data professionals who are tired of hearing platitudes about a company being "data-driven", a shared pattern seems to surface: data governance is not a nice-to-have anymore, but essential for utilising your data and your data staff. Now, what they are talking about is not the superficial "oh, sure, we more or less understand GDPR and we do have a data warehouse"-type of governance, but rules that are transparently applied across the whole organisation. The tendency of giving this role to one dedicated person or team who oversees and manages all potential conflicts and helps resolve them within the organisation is becoming more and more popular, especially as the legal responsibility of dealing with data is growing (and so are potential fines for companies). Data czars manage data access for users, deal with data lineage, data quality, data dictionaries, support feature development with guidelines about collecting, storing, leveraging and sharing data in and outside of the company. However they are not to be confused with the data security manager (the Germans have a word for it: Datenschutzbeuftragte).

This role will become a mainstay: take the increasing complexity of data management in general, mix it with the continuously evolving legal environment and add the growing demand of generating more and more value for shareholders through using data. It's a very fine line between becoming an overprotective naysayer versus being a thoughtful enabler of data teams.

Competencies: understanding data privacy and security best practices and legal requirements, stakeholder management skills, database management, encryption and decryption techniques, data quality measurement, documentation and advocacy.


...and something I did not mention above, although it is a real point of differentiation from the employer perspective: industry-specific expertise aka domain knowledge is a major asset. Understanding the mechanisms of a vertical or the language used in a certain sector will always bring you bonus points when applying to a job. If you leverage that the right way it can secure you a head start in the battle for a position.

The only constant in life is change

The lines between these roles are blurry at best, and they will undergo significant changes in the coming years. Engineering skills within data remain predominant however, they are kind of a superpower that can be combined with a surprising amount of unrelated competencies.

I guess the main takeaway should be:

“Most employers expect you to have overlapping skillsets. I feel like in the end it’s not about who gets wiped out, it’s about who is versatile enough to constantly adapt to the ever changing industry.”

If you are working on the frontlines of technology, you have to face the fact that this simply is the nature of this very quickly evolving environment, and you have to accept that being successful requires you to be curious and determined. Indeed adaptation is the magic word and since we’re far from being done exploring technology as a method, we can expect more change to come.

At the same, this challenge is what makes data and tech such an exciting and rewarding place to be in.

PS: In the next episode of Career Specialisations for Data Engineers: Sustainable Data Architect, Data Quality Engineer etc.

PPS: If you think I've missed something, or you have suggestions for the above - feel free to shoot me an email.