Home/Roadmaps/Data Engineer
Roadmap · Updated May 2026

The Data Engineer trek

Build the infrastructure that powers data teams. SQL, Python, Spark, Kafka, Airflow, cloud data warehouses, and the architecture of modern data lakehouses.

Stages
13
Estimated time
7 months
Level
Intermediate → Advanced
Maintained by
3 practitioners
01
Stage 01

SQL & data modeling foundations

Data engineering starts with SQL mastery and understanding how to model data for analytical workloads.

SQLData ModelingBeginner
02
Stage 02

Python for data engineering

Python for building pipelines, not analysis. File I/O, API calls, data transformation, and packaging reusable code.

PythonData PipelinesBeginner
03
Stage 03

Cloud data warehouses

BigQuery, Snowflake, and Redshift — the analytical databases that power modern data teams.

BigQuerySnowflakeCloud
04
Stage 04

dbt — analytics engineering

dbt transforms raw data into analytics-ready models, tested and documented, as code.

dbtAnalytics EngineeringSQL
05
Stage 05

Apache Spark & distributed computing

When data doesn't fit in memory: Spark's DataFrame API, optimization, and the mental model for distributed computation.

SparkPySparkDistributed
06
Stage 06

Stream processing with Kafka

Real-time data ingestion: Kafka fundamentals, Kafka Streams, Flink, and building event-driven pipelines.

KafkaStreamingReal-time
07
Stage 07

Workflow orchestration with Airflow

Airflow, Prefect, and Dagster — scheduling, monitoring, and making data pipelines reliable in production.

AirflowOrchestrationPipelines
08
Stage 08

Data quality & testing

Pipelines without data quality checks are time bombs. Great Expectations, dbt tests, and systematic data quality frameworks.

Data QualityGreat ExpectationsTesting
09
Stage 09

Data lakehouse architecture

Delta Lake, Apache Iceberg, and the lakehouse pattern that gives data lakes ACID transactions and schema enforcement.

Delta LakeIcebergLakehouse
10
Stage 10

Data governance & security

Data cataloging, lineage, access control, PII handling, and the compliance requirements data engineers must operationalize.

GovernanceLineageCompliance
11
Stage 11

Infrastructure for data

Terraform for data infrastructure, cost optimization for data warehouses, and running data platforms on Kubernetes.

TerraformCostKubernetes
12
Stage 12

Real-time analytics

Combining batch and streaming for hybrid architectures. ksqlDB, Flink SQL, and building dashboards on live data.

Real-timeFlinkksqlDB
13
Stage 13

Capstone — build a data platform

Design, build, and document a production data platform that a data team can rely on.

CapstoneAdvancedPortfolio

Trek complete. What's next?

You've walked the full roadmap. Now ship the capstone, write about it, and share the path with the next engineer who needs it.

Read the blogExplore more roadmaps