Complete Engineering Roadmap · 2026–2027

From Software Engineer
to Lead GenAI Engineer

A first-principles, project-driven curriculum for engineers who want to build the most consequential AI systems of this decade — not just use them.

Months

Phases

Weeks

12+

Projects

Phase 1

ML & DL Foundations

Months 1–3

Phase 2

GenAI Core

Months 4–7

Phase 3

Advanced Systems

Months 8–11

Phase 4

Production MLOps

Months 11–13

Phase 5

Lead Mastery

Months 13–15

Phase One

ML & Deep Learning
Foundations

You cannot lead what you don't understand from the inside. This phase builds the mathematical and architectural intuition that every advanced GenAI concept rests on. Skip it and you'll hit a ceiling — every time.

Months 1–3 · Weeks 1–12 ~3h daily commitment

Week 1–2

prerequisite

Mathematics that actually matters for ML

Linear algebra: matrix operations, eigenvalues, SVD — understand these as transformations of space, not just arithmetic. Probability: Bayes' theorem, distributions, MLE — this is the language all ML speaks. Calculus: gradients, chain rule, partial derivatives — gradient descent is just "follow the slope downhill," and that metaphor only clicks once you feel what a gradient is geometrically.

Don't memorise formulas. Build intuition. A gradient is a direction in high-dimensional space. That's all gradient descent ever is.

Week 3–4

project

Classical ML from scratch — no sklearn

Linear regression, logistic regression, decision trees, SVMs, ensemble methods (bagging, boosting). Implement at least two in pure NumPy. You must understand loss functions, regularisation (L1/L2), and overfitting before touching neural networks. The bias-variance tradeoff is not an abstraction — it's a design decision you'll make constantly.

Project

Housing price predictor in NumPy — linear regression, gradient descent, no libraries. Plot the loss curve.

Week 5–6

projectcore

Neural networks & backpropagation

Build a multilayer perceptron in pure NumPy: forward pass, compute cross-entropy loss, backpropagate gradients using the chain rule, update weights with SGD. Then rebuild it in PyTorch and verify your NumPy gradients match PyTorch's autograd. Understanding what autograd is actually computing is what separates engineers from ML engineers.

Project

MNIST digit classifier — built twice: NumPy first (verify each gradient by hand), then PyTorch. Results must match.

Week 7–8

project

CNNs, RNNs, and the vanishing gradient problem

Convolutional layers: why local connectivity and weight sharing work for images, receptive fields, pooling, architecture intuition (LeNet, VGG, ResNet residual connections). Then RNNs: sequential computation, hidden state, BPTT. LSTMs: gates, cell state, why they solved vanishing gradients. Understand this history — it leads directly and inevitably to the transformer.

Projects

CIFAR-10 CNN classifier + IMDb sentiment analysis with LSTM. Compare LSTM vs vanilla RNN on long sequences.

Week 9–10

foundationproject

Transformer architecture from scratch

Implement every component by hand: byte-pair encoding tokenisation, learned token embeddings, sinusoidal and learned positional encodings, scaled dot-product attention (QKV), multi-head self-attention, layer normalisation, position-wise FFN, residual connections, causal masking for decoders. Read the original "Attention Is All You Need" paper — not just summaries. Understand why this architecture parallelises where RNNs couldn't.

Project

Shakespeare-level character language model — transformer-based, trained from scratch. Reference: Karpathy's nanoGPT walkthrough.

Week 11–12

project

PyTorch mastery & production training pipelines

DataLoaders and custom Dataset classes, learning rate schedulers (cosine annealing, warmup), gradient clipping, mixed-precision training (FP16 with torch.cuda.amp), gradient accumulation for large effective batch sizes, model checkpointing and resuming, Weights & Biases experiment tracking. Train something non-trivial on a GPU. Profile GPU memory usage and know how to reduce it.

Project

Full training pipeline with W&B experiment dashboard, LR scheduling, mixed precision, and checkpoint resuming.

Phase 1 Exit Condition

Build, explain, and train a transformer from scratch

You must be able to implement a transformer without reference code and explain what every component does mechanically — not just what it's called. If you can't answer "what is LayerNorm doing and why does it go before the attention in pre-norm transformers?" you are not ready to advance. Repeat weeks 9–10 before moving on. This is not optional.

Skills acquired in Phase 1

NumPy from scratch

PyTorch autograd

Backpropagation

CNN architectures

LSTM / GRU

Transformer internals

Multi-head attention

Training pipelines

Mixed precision

W&B tracking

Linear algebra

Probability & stats

Phase Two

GenAI Core —
LLMs, RAG & Agents

The heart of the job description. You will build real GenAI systems — not toy examples — and measure them with real metrics. By the end, you'll have a deployed production API and a multi-tool agent running on cloud infrastructure.

Months 4–7 · Weeks 13–28 Cloud background accelerates this

Week 13–14

paper

LLM internals — how GPT-style models actually work

Decoder-only architecture (GPT family) vs encoder-decoder (T5, BART) vs encoder-only (BERT). Tokenisation deep-dive: BPE algorithm, WordPiece, SentencePiece — implement BPE by hand. Context window mechanics. KV cache: what it stores, why it trades memory for latency, how PagedAttention improves on it. Autoregressive generation: temperature, top-p, top-k, repetition penalty — implement each sampling strategy. Read the GPT-2 and GPT-3 papers.

Project

Run LLM inference from scratch using Hugging Face transformers — implement your own sampling loop with all decoding strategies.

Week 15–16

project

Prompt engineering — systematic, not intuitive

Zero-shot, few-shot, chain-of-thought (CoT), tree-of-thought, self-consistency, ReAct prompting. Output structuring: JSON mode, XML schemas, function calling for structured extraction. System prompts, role prompting, constitutional AI prompting. The key insight: prompt engineering without evaluation is guessing. Build an automated evaluation harness that scores prompt variants — LLM-as-judge, BLEU, ROUGE, custom rubrics.

Project

Prompt evaluation framework — 5 prompt variants, automated LLM-as-judge scoring, statistical significance testing.

Week 17–18

projectcore

Embeddings & vector search — the geometry of meaning

Dense embeddings: what they encode (semantic relationships as geometric proximity), cosine similarity mechanics. Embedding model selection: MTEB benchmark, task-specific vs general models, dimensionality trade-offs. Sparse embeddings: TF-IDF, BM25 — when keyword matching beats semantics. Vector databases: Pinecone, Weaviate, pgvector. HNSW indexing internals: the hierarchical graph structure, how it achieves O(log n) approximate nearest neighbour search, recall vs speed trade-off.

Project

Semantic search engine over 10,000+ document corpus — compare cosine similarity, HNSW, and flat exact search on latency and recall.

Week 19–21

projectcore

RAG systems — beyond the naive baseline

Start with naive RAG (chunk → embed → retrieve → generate) and measure it. Then fix its failure modes one by one. Chunking: fixed-size vs semantic chunking vs late chunking — understand what information is lost at chunk boundaries. Hybrid retrieval: dense + sparse with Reciprocal Rank Fusion — measure improvement on your dataset. Re-ranking: cross-encoder models (ColBERT, BGE-reranker) and why they outperform bi-encoders for precision. Query expansion, HyDE (hypothetical document embeddings), contextual compression. RAGAS evaluation framework: faithfulness, answer relevancy, context precision, context recall.

Project

Production RAG system with RAGAS eval scores above 0.8 — document baseline vs hybrid vs reranked improvements with metrics.

A RAG system without an eval framework is a guess. Every architectural decision must be measured. If you can't show the improvement numerically, it didn't happen.

Week 22–24

projectcore

Agentic systems — planning, memory, tool use

ReAct loop: Reason + Act, the observation feedback cycle. OpenAI function calling and Anthropic tool use — implement both. LangGraph: stateful agents as directed graphs with cycles, conditional edges, checkpointing, human-in-the-loop interrupts. Memory architecture: in-context (limited), external vector store (semantic retrieval), episodic (conversation history), procedural (tool selection memory). Multi-agent patterns: supervisor-worker, peer-to-peer (AutoGen), sequential chains. Agent evaluation: where it fails (tool selection errors, infinite loops, off-task drift) and how to diagnose each.

Project

Multi-tool research agent with web search, code execution, and external memory — LangGraph-based with automated evaluation harness.

Week 25–28

project

FastAPI + backend engineering for AI systems

Build production AI APIs: FastAPI async endpoints with streaming responses via Server-Sent Events (SSE), Pydantic v2 request/response validation, middleware (auth, rate limiting, logging), error handling and retry logic, background tasks. Containerise with Docker. Your cloud background is an enormous advantage here — you already understand deployment. The focus is on AI-specific patterns: streaming token generation to the client, async LLM calls without blocking, cost tracking middleware.

Project

Deployed streaming RAG API on your cloud platform — with auth, rate limiting, cost tracking, and load testing results.

Phase 2 Exit Condition

A deployed multi-tool agent with measured RAG performance

Deploy a multi-tool agentic system backed by hybrid RAG retrieval, served via a streaming FastAPI endpoint on real cloud infrastructure. Present RAGAS scores for your RAG system. Explain every architectural decision — why hybrid retrieval, why that reranker, why LangGraph instead of LangChain chains. If you can't defend the choices with numbers, go back and measure.

Skills acquired in Phase 2

LLM internals

KV cache

Prompt engineering

LLM-as-judge eval

Dense embeddings

HNSW indexing

Hybrid RAG

Re-ranking

RAGAS evaluation

LangGraph agents

Tool use & function calling

FastAPI / streaming SSE

Agent memory types

Vector databases

Phase Three

Advanced Systems —
Fine-tuning, Multimodal & Knowledge Graphs

This is where "beyond basic RAG" begins. Fine-tuning, multimodal architectures, knowledge graph integration, and production guardrails — the systems that define the Lead level and separate you from every candidate who only knows how to chain prompts.

Months 8–11 · Weeks 29–44 Where most engineers plateau

Week 29–32

projectpapers

LLM fine-tuning — LoRA, QLoRA, SFT, DPO

First, the decision framework: when does fine-tuning beat prompting? (Domain-specific vocabulary, consistent format requirements, proprietary knowledge too large for context, latency-critical applications). LoRA: low-rank decomposition of weight updates — mathematically, you're constraining ΔW to be a product of two small matrices. QLoRA: 4-bit NF4 quantisation + LoRA — trains a 70B model on one consumer GPU. Supervised fine-tuning (SFT) data preparation: instruction-response pair curation, data quality over quantity. HuggingFace TRL + Unsloth for efficient training. RLHF conceptually: reward model training, PPO optimisation loop. DPO (Direct Preference Optimisation): why it replaces PPO in most practical settings — simpler, more stable, no separate reward model. Push your fine-tuned model to HuggingFace Hub.

Project

QLoRA fine-tuned domain model (legal, medical, or customer support) — benchmark vs base model on domain-specific evals. Published to HuggingFace Hub.

Fine-tuning is not always the answer. The first question is always: can prompting + RAG get me there? Fine-tune only when the answer is definitively no.

Week 33–35

projectpapers

Multimodal models — vision, audio, and documents

CLIP: contrastive pretraining on image-text pairs — how it creates a shared embedding space for both modalities, zero-shot image classification, image-text similarity. LLaVA architecture: visual encoder (CLIP) + projection layer + LLM decoder — the projection layer is what maps visual tokens into the LLM's token space. GPT-4V/Claude's vision capabilities conceptually. Whisper for ASR: encoder-decoder transformer trained on 680K hours — build audio transcription pipeline. Document intelligence: OCR (Tesseract, AWS Textract) + layout understanding + LLM reasoning over structured documents.

Project

Multimodal document intelligence pipeline — process PDFs/images with OCR + layout detection + LLM reasoning. Answer questions about charts, tables, and mixed-content documents.

Week 36–38

projectdifferentiator

Knowledge graphs & KG-augmented retrieval

Neo4j graph database: nodes, relationships, properties, Cypher query language. Entity extraction: spaCy NLP pipeline, GLiNER (generalised NER), LLM-based extraction. Relationship extraction: RE models, LLM-based relation triples. Entity linking: resolving extracted mentions to canonical entities. KG construction pipeline: extract → normalise → link → store. KG-augmented RAG: when graph traversal beats vector search — multi-hop reasoning ("find all papers by authors who collaborated with researchers who cited paper X"), relationship queries, structured fact retrieval. Microsoft GraphRAG: community detection, hierarchical summarisation, global query capability.

Project

KG-RAG system over a domain corpus (academic papers, legal documents, or company knowledge base) — demonstrate cases where graph traversal outperforms pure vector search.

Week 39–42

project

Guardrails, safety & hallucination control

Hallucination taxonomy: intrinsic (contradicts source) vs extrinsic (unverifiable fabrication). Detection methods: SelfCheckGPT (sample multiple times, check consistency), G-Eval (LLM-as-judge with fine-grained criteria), grounding scores (does the answer appear in the retrieved context?). Guardrails frameworks: NVIDIA NeMo Guardrails (dialogue flow control), Guardrails.ai (output validation with validators). Structured output forcing: JSON schema validation, Pydantic model enforcement, retry loops on validation failure. Constitutional AI: principle-based critique and revision. Input/output filtering: topic classifiers, PII detection, toxicity filtering. Basic red-teaming: prompt injection, jailbreaking patterns, indirect injection via retrieved content.

Project

Add a full guardrails + evaluation layer to your Phase 2 RAG system — input filtering, output validation, hallucination scoring dashboard.

Week 43–44

project

Big data pipelines for AI — Spark, Hadoop, MongoDB

Apache Spark with PySpark: RDDs vs DataFrames, transformations (map, filter, groupBy) vs actions (collect, count, write), lazy evaluation, shuffle operations and why they're expensive, partitioning strategy. Spark for ML data preprocessing at scale: feature engineering pipelines, data cleaning, tokenisation over 100M+ records. MongoDB: document model, aggregation pipeline, Atlas Vector Search for hybrid document + vector queries. Building end-to-end data pipelines that feed ML training and inference systems. Your cloud background means you can deploy these on EMR, Dataproc, or Azure HDInsight natively.

Project

PySpark preprocessing pipeline for an ML dataset (1M+ records) + MongoDB Atlas vector store integration for the resulting embeddings.

Phase 3 Exit Condition

Fine-tuned model + multimodal + KG pipeline + guardrails — integrated and measured

You must be able to demonstrate a system that integrates all four Phase 3 components: a domain fine-tuned model feeding into a multimodal pipeline with KG-augmented retrieval and production guardrails. Show benchmark numbers comparing your integrated system to the Phase 2 baseline. Document the architecture decisions. Publish the results as a technical blog post or GitHub README with benchmarks.

Skills acquired in Phase 3

LoRA / QLoRA

RLHF / DPO

SFT data pipelines

HuggingFace TRL

CLIP / LLaVA

Multimodal pipelines

Whisper ASR

Neo4j / Cypher

Entity extraction

GraphRAG

Guardrails.ai

Hallucination detection

PySpark

MongoDB Atlas

Red-teaming basics

Phase Four

Production ML &
MLOps Engineering

This is where your existing cloud background becomes a genuine competitive advantage. Most ML engineers hit a wall in production. You already live here — apply it to AI systems specifically. Cost, latency, monitoring, and reliability at scale.

Months 11–13 · Weeks 45–56 Accelerated by cloud background

Week 45–48

projectcore

Model serving at scale — inference optimisation

Quantisation methods: GGUF (CPU-optimised, llama.cpp), GPTQ (GPU, post-training), AWQ (activation-aware, better quality at 4-bit), FP8 (emerging standard). Speculative decoding: small draft model proposes tokens, large model verifies — how it achieves latency reduction without quality loss. vLLM: PagedAttention — manages the KV cache as non-contiguous paged memory (like OS virtual memory), eliminating fragmentation waste. TGI (Text Generation Inference) by HuggingFace. Continuous batching vs static batching: why continuous batching achieves 10–23x throughput improvement. Latency vs throughput optimisation trade-offs. Load test everything — measure p50, p95, p99 latencies under load.

Project

Benchmark vLLM vs naive HuggingFace serving on same model — document throughput, latency at load, GPU memory utilisation, and cost per 1K tokens.

Week 49–51

project

MLOps — CI/CD, monitoring, model versioning

ML-specific CI/CD: automated evaluation gates before every deployment (if RAGAS score drops by >5%, block the merge), model versioning with MLflow (experiment tracking, model registry, stage promotion: Staging → Production), DVC for dataset versioning. Production monitoring: latency p95/p99, token usage and cost per request, output quality score sampling, hallucination rate via automated detection. Data drift detection: statistical tests (KS test, Population Stability Index) on embedding distributions. Concept drift: when model performance degrades on production data even when data distribution appears stable. A/B testing model versions: traffic splitting, statistical significance. Feedback loops: human rating collection → retraining pipeline.

Project

Full MLOps pipeline with MLflow registry, automated eval gates in CI/CD, and Grafana monitoring dashboard with drift detection alerts.

Week 52–54

project

Cloud-native AI deployment — leverage your existing expertise

Apply your cloud knowledge directly to AI workloads. AWS: SageMaker endpoints (real-time, async, batch), Bedrock for managed LLMs, ECS/EKS for custom model serving, Cost Explorer for GPU cost optimisation. Azure: Azure ML managed endpoints, Azure OpenAI Service, AKS with GPU node pools. GCP: Vertex AI Model Garden, Cloud Run for serverless inference, GKE Autopilot. GPU instance selection: A10G vs A100 vs H100 — when each makes sense economically. Autoscaling for variable AI workloads: CPU-based vs custom metrics (queue depth, token throughput). Spot/preemptible instances for training; on-demand for inference.

Project

Production AI deployment with autoscaling, multi-region failover, cost dashboard, and documented cost-per-query at different load levels.

Week 55–56

project

Enterprise AI integration patterns

Multi-tenant AI API design: tenant isolation (separate vector namespaces, API key scoping, rate limiting per tenant), audit logging (who queried what, when, with what result — required for regulated industries). Data residency: keeping data in specific regions for GDPR/data sovereignty compliance. Enterprise data integration: SharePoint via MS Graph API, Salesforce SOQL + Bulk API for CRM data, JDBC for legacy databases. PII detection and redaction before sending data to LLMs. Token budget governance: per-user, per-department cost caps with graceful degradation. SLA management: what happens when the LLM provider is down — fallback models, cached responses, circuit breakers.

Project

Enterprise-grade AI integration with multi-tenancy, audit logging, PII redaction, token cost governance, and circuit breaker fallback.

Phase 4 Exit Condition

A production AI system you can monitor, version, roll back, and cost-justify

You must be able to show: documented cost per query at different load levels, automated quality gates in CI/CD that block regressions, real-time monitoring dashboards, and a rollback procedure. If the system goes down at 2am, can you diagnose it from metrics alone? If a new model version degrades quality, does your pipeline catch it before production? That's the production bar.

Skills acquired in Phase 4

vLLM / PagedAttention

Quantisation (AWQ/GPTQ)

Speculative decoding

Continuous batching

MLflow registry

DVC versioning

Drift detection

A/B testing models

SageMaker / Vertex AI

GPU cost optimisation

Multi-tenant design

Audit logging

Circuit breakers

PII redaction

Phase Five

Lead-Level Mastery —
Systems Thinking, R&D & Mentorship

Technical skill alone does not make a Lead. This phase builds the judgment, communication, and research translation ability that distinguishes someone who builds AI systems from someone who leads teams that build AI systems.

Months 13–15 · Weeks 57–65 The hardest phase to fake

Week 57–58

lead skill

Architecture decision-making under constraints

The Lead role requires designing AI systems, not just implementing them. Practice whiteboard architecture sessions on real constraints: "design a customer support AI for 10M users on $50K/month cloud budget." Make explicit trade-offs — not "RAG is better than fine-tuning" but "given these latency requirements and data update frequency, RAG is better because..." Practice the questions that expose shallow thinking: What breaks at 10x load? What's the rollback plan if this model degrades? How does this scale to 50 languages? How do we handle GDPR deletion requests on vector stores? A Lead who can't say "this approach fails at X scale because Y" is not ready to own an architecture.

The fastest way to prepare for lead-level interviews: practice answering "why not the other approach?" for every design decision you make. The alternative matters as much as the choice.

Week 59–60

weekly habitpapers

Research translation — reading papers and knowing what matters

One paper per week from arXiv (cs.LG, cs.CL), HuggingFace Daily Papers, or Papers with Code. The goal is not to read papers — it's to assess them: What is the core contribution? Does it beat existing baselines and by how much? What are the failure modes the authors don't discuss? Would this change how I build anything in production? Start an internal technical notes document — one paragraph per paper: contribution, key result, production relevance. This habit, started now and continued forever, is what "staying current" in the job description actually means.

Recommended starting papers

RAGAs (evaluation), Self-RAG, RAPTOR, ColBERT, Flash Attention 2, Mixtral (MoE), LLaVA-1.5, GraphRAG, Constitutional AI, DPO.

Week 61–62

lead skill

Mentorship & technical communication

The ability to explain transformer attention to a non-ML engineer, or to translate a business requirement ("we need the chatbot to stop hallucinating product specs") into a precise technical intervention (grounding via RAG + output validation against product database), is 50% of a Lead role. Practice three levels: peer-level explanation (full technical depth), junior explanation (mechanism without mathematical notation), executive explanation (business impact without mechanism). Write internal documentation for everything you've built. Give a talk — at a meetup, internally, or record for YouTube. The act of teaching reveals every gap in your own understanding.

If you can't explain a concept to someone two levels below you without losing the mechanism, you don't fully understand it yet.

Week 63–65

capstoneportfolio

Capstone — end-to-end enterprise AI system

Build the most complex system you've built. It must use everything: multimodal input processing, knowledge graph construction from raw documents, KG-augmented hybrid RAG with reranking, fine-tuned domain model for generation, agentic orchestration via LangGraph, production FastAPI serving with vLLM, MLflow versioning with automated eval gates, Grafana monitoring, and a guardrails layer. Deploy publicly. Document every architecture decision with explicit trade-offs. Present it at a meetup or publish a detailed technical blog post. This is your portfolio centrepiece. It must be live and demoable in interviews.

Capstone deliverables

Public GitHub repo + architecture blog post with benchmarks + live demo + 10-minute recorded walkthrough. Present at one meetup or conference.

Final Exit Condition — The Lead Bar

Can you design, build, deploy, monitor, and teach?

Three questions that define the Lead bar. One: can you design a production AI system on a whiteboard, with explicit trade-offs, failure modes, and cost estimates — without referencing documentation? Two: can you read a new paper published this week and tell a junior engineer within 30 minutes whether it's worth trying and why? Three: can you explain the KV cache to a product manager and write a production implementation in the same afternoon? If all three: you're ready.

Skills acquired in Phase 5

Architecture design

Trade-off reasoning

Paper reading

Research translation

Technical writing

Mentorship

Stakeholder comms

System design interviews

Full-stack AI projects

Public speaking

Essential Resources

Phase 1 — Core

Andrej Karpathy — Neural Networks: Zero to Hero

Free YouTube series. Build micrograd → makemore → nanoGPT. The single best starting resource for Phase 1 that exists.

Phase 1 — Math

Mathematics for Machine Learning (Deisenroth)

Free PDF. Linear algebra, probability, and optimisation with ML framing. Chapter 5 (vector calculus) is what backprop rests on.

Phase 1 — DL

Deep Learning (Goodfellow, Bengio, Courville)

Free online. Chapters 6–8 (feedforward, regularisation, optimisation) and Chapter 10 (sequence models) are essential.

Phase 2 — LLMs

Hugging Face NLP Course

Free. Transformers, tokenisers, fine-tuning, and the full HuggingFace ecosystem. Chapters 1–4 are required for Phase 2.

Phase 2 — Agents

LangGraph Documentation + Tutorials

Official docs are excellent. Work through every tutorial in order. The persistence and human-in-the-loop sections are most important.

Phase 2 — RAG

RAGAs Library + Documentation

Learn to measure before you optimise. The RAGAS paper (Shahul et al. 2023) explains the metrics. Use it as your eval standard from day one.

Phase 3 — Fine-tuning

Unsloth + HuggingFace TRL

Unsloth makes QLoRA training 2x faster with 60% less memory. TRL provides SFT, DPO, and PPO trainers. Both have excellent Colab notebooks.

Phase 3 — Papers

Papers With Code + arXiv cs.CL

Track State of the Art on key tasks. arXiv cs.CL for language models, cs.LG for ML methods. Subscribe to Andrej Karpathy and Yann LeCun for curation.

Phase 4 — Serving

vLLM Documentation + Blog

The vLLM blog posts explain PagedAttention better than any other source. The docs are production-quality. Read both before deploying.

Phase 4 — MLOps

Chip Huyen — Designing Machine Learning Systems

The definitive book on production ML. Chapters 7–9 (data distribution shifts, continual learning, monitoring) are directly relevant to Phase 4.

Phase 5 — Research

HuggingFace Daily Papers

Curated daily. Better signal-to-noise than raw arXiv. Build the habit of reading 3–5 abstracts daily and one full paper weekly.

Phase 5 — Systems

System Design Interview Vol. 2 (Alex Xu)

Not ML-specific but essential for the architecture thinking that Lead roles require. Chapter 11 (design a news feed) + Chapter 14 (design YouTube) teach the constraint-first approach.

The North Star

At the end of 15 months, you should be able to design a production AI system on a whiteboard with explicit trade-offs, build it in code, deploy it to cloud infrastructure, monitor it in production, and teach every component to someone who's never seen it. That's the Lead bar. Not the tools you know — the judgment you've built.

Tool users get replaced. System thinkers build the tools. Lead engineers teach the system thinkers.

From Software Engineer to Lead GenAI Engineer

ML & Deep LearningFoundations

GenAI Core —LLMs, RAG & Agents

Advanced Systems —Fine-tuning, Multimodal & Knowledge Graphs

Production ML &MLOps Engineering

Lead-Level Mastery —Systems Thinking, R&D & Mentorship

Essential Resources

The North Star

You may also be interested in

From Software Engineer
to Lead GenAI Engineer

ML & Deep Learning
Foundations

GenAI Core —
LLMs, RAG & Agents

Advanced Systems —
Fine-tuning, Multimodal & Knowledge Graphs

Production ML &
MLOps Engineering

Lead-Level Mastery —
Systems Thinking, R&D & Mentorship