Lead GenAI Engineer — 15-Month Roadmap
Complete Engineering Roadmap · 2026–2027

From Software Engineer
to Lead GenAI Engineer

A first-principles, project-driven curriculum for engineers who want to build the most consequential AI systems of this decade — not just use them.

15
Months
5
Phases
65
Weeks
12+
Projects
Phase 1
ML & DL Foundations
Months 1–3
Phase 2
GenAI Core
Months 4–7
Phase 3
Advanced Systems
Months 8–11
Phase 4
Production MLOps
Months 11–13
Phase 5
Lead Mastery
Months 13–15
01
Phase One

ML & Deep Learning
Foundations

You cannot lead what you don't understand from the inside. This phase builds the mathematical and architectural intuition that every advanced GenAI concept rests on. Skip it and you'll hit a ceiling — every time.

Months 1–3 · Weeks 1–12 ~3h daily commitment
Week 1–2
prerequisite
Mathematics that actually matters for ML
Linear algebra: matrix operations, eigenvalues, SVD — understand these as transformations of space, not just arithmetic. Probability: Bayes' theorem, distributions, MLE — this is the language all ML speaks. Calculus: gradients, chain rule, partial derivatives — gradient descent is just "follow the slope downhill," and that metaphor only clicks once you feel what a gradient is geometrically.
Don't memorise formulas. Build intuition. A gradient is a direction in high-dimensional space. That's all gradient descent ever is.
Week 3–4
project
Classical ML from scratch — no sklearn
Linear regression, logistic regression, decision trees, SVMs, ensemble methods (bagging, boosting). Implement at least two in pure NumPy. You must understand loss functions, regularisation (L1/L2), and overfitting before touching neural networks. The bias-variance tradeoff is not an abstraction — it's a design decision you'll make constantly.
Project
Housing price predictor in NumPy — linear regression, gradient descent, no libraries. Plot the loss curve.
Week 5–6
projectcore
Neural networks & backpropagation
Build a multilayer perceptron in pure NumPy: forward pass, compute cross-entropy loss, backpropagate gradients using the chain rule, update weights with SGD. Then rebuild it in PyTorch and verify your NumPy gradients match PyTorch's autograd. Understanding what autograd is actually computing is what separates engineers from ML engineers.
Project
MNIST digit classifier — built twice: NumPy first (verify each gradient by hand), then PyTorch. Results must match.
Week 7–8
project
CNNs, RNNs, and the vanishing gradient problem
Convolutional layers: why local connectivity and weight sharing work for images, receptive fields, pooling, architecture intuition (LeNet, VGG, ResNet residual connections). Then RNNs: sequential computation, hidden state, BPTT. LSTMs: gates, cell state, why they solved vanishing gradients. Understand this history — it leads directly and inevitably to the transformer.
Projects
CIFAR-10 CNN classifier + IMDb sentiment analysis with LSTM. Compare LSTM vs vanilla RNN on long sequences.
Week 9–10
foundationproject
Transformer architecture from scratch
Implement every component by hand: byte-pair encoding tokenisation, learned token embeddings, sinusoidal and learned positional encodings, scaled dot-product attention (QKV), multi-head self-attention, layer normalisation, position-wise FFN, residual connections, causal masking for decoders. Read the original "Attention Is All You Need" paper — not just summaries. Understand why this architecture parallelises where RNNs couldn't.
Project
Shakespeare-level character language model — transformer-based, trained from scratch. Reference: Karpathy's nanoGPT walkthrough.
Week 11–12
project
PyTorch mastery & production training pipelines
DataLoaders and custom Dataset classes, learning rate schedulers (cosine annealing, warmup), gradient clipping, mixed-precision training (FP16 with torch.cuda.amp), gradient accumulation for large effective batch sizes, model checkpointing and resuming, Weights & Biases experiment tracking. Train something non-trivial on a GPU. Profile GPU memory usage and know how to reduce it.
Project
Full training pipeline with W&B experiment dashboard, LR scheduling, mixed precision, and checkpoint resuming.
Phase 1 Exit Condition
Build, explain, and train a transformer from scratch
You must be able to implement a transformer without reference code and explain what every component does mechanically — not just what it's called. If you can't answer "what is LayerNorm doing and why does it go before the attention in pre-norm transformers?" you are not ready to advance. Repeat weeks 9–10 before moving on. This is not optional.
Skills acquired in Phase 1
NumPy from scratch
PyTorch autograd
Backpropagation
CNN architectures
LSTM / GRU
Transformer internals
Multi-head attention
Training pipelines
Mixed precision
W&B tracking
Linear algebra
Probability & stats
02
Phase Two

GenAI Core —
LLMs, RAG & Agents

The heart of the job description. You will build real GenAI systems — not toy examples — and measure them with real metrics. By the end, you'll have a deployed production API and a multi-tool agent running on cloud infrastructure.

Months 4–7 · Weeks 13–28 Cloud background accelerates this
Week 13–14
paper
LLM internals — how GPT-style models actually work
Decoder-only architecture (GPT family) vs encoder-decoder (T5, BART) vs encoder-only (BERT). Tokenisation deep-dive: BPE algorithm, WordPiece, SentencePiece — implement BPE by hand. Context window mechanics. KV cache: what it stores, why it trades memory for latency, how PagedAttention improves on it. Autoregressive generation: temperature, top-p, top-k, repetition penalty — implement each sampling strategy. Read the GPT-2 and GPT-3 papers.
Project
Run LLM inference from scratch using Hugging Face transformers — implement your own sampling loop with all decoding strategies.
Week 15–16
project
Prompt engineering — systematic, not intuitive
Zero-shot, few-shot, chain-of-thought (CoT), tree-of-thought, self-consistency, ReAct prompting. Output structuring: JSON mode, XML schemas, function calling for structured extraction. System prompts, role prompting, constitutional AI prompting. The key insight: prompt engineering without evaluation is guessing. Build an automated evaluation harness that scores prompt variants — LLM-as-judge, BLEU, ROUGE, custom rubrics.
Project
Prompt evaluation framework — 5 prompt variants, automated LLM-as-judge scoring, statistical significance testing.
Week 17–18
projectcore
Embeddings & vector search — the geometry of meaning
Dense embeddings: what they encode (semantic relationships as geometric proximity), cosine similarity mechanics. Embedding model selection: MTEB benchmark, task-specific vs general models, dimensionality trade-offs. Sparse embeddings: TF-IDF, BM25 — when keyword matching beats semantics. Vector databases: Pinecone, Weaviate, pgvector. HNSW indexing internals: the hierarchical graph structure, how it achieves O(log n) approximate nearest neighbour search, recall vs speed trade-off.
Project
Semantic search engine over 10,000+ document corpus — compare cosine similarity, HNSW, and flat exact search on latency and recall.
Week 19–21
projectcore
RAG systems — beyond the naive baseline
Start with naive RAG (chunk → embed → retrieve → generate) and measure it. Then fix its failure modes one by one. Chunking: fixed-size vs semantic chunking vs late chunking — understand what information is lost at chunk boundaries. Hybrid retrieval: dense + sparse with Reciprocal Rank Fusion — measure improvement on your dataset. Re-ranking: cross-encoder models (ColBERT, BGE-reranker) and why they outperform bi-encoders for precision. Query expansion, HyDE (hypothetical document embeddings), contextual compression. RAGAS evaluation framework: faithfulness, answer relevancy, context precision, context recall.
Project
Production RAG system with RAGAS eval scores above 0.8 — document baseline vs hybrid vs reranked improvements with metrics.
A RAG system without an eval framework is a guess. Every architectural decision must be measured. If you can't show the improvement numerically, it didn't happen.
Week 22–24
projectcore
Agentic systems — planning, memory, tool use
ReAct loop: Reason + Act, the observation feedback cycle. OpenAI function calling and Anthropic tool use — implement both. LangGraph: stateful agents as directed graphs with cycles, conditional edges, checkpointing, human-in-the-loop interrupts. Memory architecture: in-context (limited), external vector store (semantic retrieval), episodic (conversation history), procedural (tool selection memory). Multi-agent patterns: supervisor-worker, peer-to-peer (AutoGen), sequential chains. Agent evaluation: where it fails (tool selection errors, infinite loops, off-task drift) and how to diagnose each.
Project
Multi-tool research agent with web search, code execution, and external memory — LangGraph-based with automated evaluation harness.
Week 25–28
project
FastAPI + backend engineering for AI systems
Build production AI APIs: FastAPI async endpoints with streaming responses via Server-Sent Events (SSE), Pydantic v2 request/response validation, middleware (auth, rate limiting, logging), error handling and retry logic, background tasks. Containerise with Docker. Your cloud background is an enormous advantage here — you already understand deployment. The focus is on AI-specific patterns: streaming token generation to the client, async LLM calls without blocking, cost tracking middleware.
Project
Deployed streaming RAG API on your cloud platform — with auth, rate limiting, cost tracking, and load testing results.
Phase 2 Exit Condition
A deployed multi-tool agent with measured RAG performance
Deploy a multi-tool agentic system backed by hybrid RAG retrieval, served via a streaming FastAPI endpoint on real cloud infrastructure. Present RAGAS scores for your RAG system. Explain every architectural decision — why hybrid retrieval, why that reranker, why LangGraph instead of LangChain chains. If you can't defend the choices with numbers, go back and measure.
Skills acquired in Phase 2
LLM internals
KV cache
Prompt engineering
LLM-as-judge eval
Dense embeddings
HNSW indexing
Hybrid RAG
Re-ranking
RAGAS evaluation
LangGraph agents
Tool use & function calling
FastAPI / streaming SSE
Agent memory types
Vector databases
03
Phase Three

Advanced Systems —
Fine-tuning, Multimodal & Knowledge Graphs

This is where "beyond basic RAG" begins. Fine-tuning, multimodal architectures, knowledge graph integration, and production guardrails — the systems that define the Lead level and separate you from every candidate who only knows how to chain prompts.

Months 8–11 · Weeks 29–44 Where most engineers plateau
Week 29–32
projectpapers
LLM fine-tuning — LoRA, QLoRA, SFT, DPO
First, the decision framework: when does fine-tuning beat prompting? (Domain-specific vocabulary, consistent format requirements, proprietary knowledge too large for context, latency-critical applications). LoRA: low-rank decomposition of weight updates — mathematically, you're constraining ΔW to be a product of two small matrices. QLoRA: 4-bit NF4 quantisation + LoRA — trains a 70B model on one consumer GPU. Supervised fine-tuning (SFT) data preparation: instruction-response pair curation, data quality over quantity. HuggingFace TRL + Unsloth for efficient training. RLHF conceptually: reward model training, PPO optimisation loop. DPO (Direct Preference Optimisation): why it replaces PPO in most practical settings — simpler, more stable, no separate reward model. Push your fine-tuned model to HuggingFace Hub.
Project
QLoRA fine-tuned domain model (legal, medical, or customer support) — benchmark vs base model on domain-specific evals. Published to HuggingFace Hub.
Fine-tuning is not always the answer. The first question is always: can prompting + RAG get me there? Fine-tune only when the answer is definitively no.
Week 33–35
projectpapers
Multimodal models — vision, audio, and documents
CLIP: contrastive pretraining on image-text pairs — how it creates a shared embedding space for both modalities, zero-shot image classification, image-text similarity. LLaVA architecture: visual encoder (CLIP) + projection layer + LLM decoder — the projection layer is what maps visual tokens into the LLM's token space. GPT-4V/Claude's vision capabilities conceptually. Whisper for ASR: encoder-decoder transformer trained on 680K hours — build audio transcription pipeline. Document intelligence: OCR (Tesseract, AWS Textract) + layout understanding + LLM reasoning over structured documents.
Project
Multimodal document intelligence pipeline — process PDFs/images with OCR + layout detection + LLM reasoning. Answer questions about charts, tables, and mixed-content documents.
Week 36–38
projectdifferentiator
Knowledge graphs & KG-augmented retrieval
Neo4j graph database: nodes, relationships, properties, Cypher query language. Entity extraction: spaCy NLP pipeline, GLiNER (generalised NER), LLM-based extraction. Relationship extraction: RE models, LLM-based relation triples. Entity linking: resolving extracted mentions to canonical entities. KG construction pipeline: extract → normalise → link → store. KG-augmented RAG: when graph traversal beats vector search — multi-hop reasoning ("find all papers by authors who collaborated with researchers who cited paper X"), relationship queries, structured fact retrieval. Microsoft GraphRAG: community detection, hierarchical summarisation, global query capability.
Project
KG-RAG system over a domain corpus (academic papers, legal documents, or company knowledge base) — demonstrate cases where graph traversal outperforms pure vector search.
Week 39–42
project
Guardrails, safety & hallucination control
Hallucination taxonomy: intrinsic (contradicts source) vs extrinsic (unverifiable fabrication). Detection methods: SelfCheckGPT (sample multiple times, check consistency), G-Eval (LLM-as-judge with fine-grained criteria), grounding scores (does the answer appear in the retrieved context?). Guardrails frameworks: NVIDIA NeMo Guardrails (dialogue flow control), Guardrails.ai (output validation with validators). Structured output forcing: JSON schema validation, Pydantic model enforcement, retry loops on validation failure. Constitutional AI: principle-based critique and revision. Input/output filtering: topic classifiers, PII detection, toxicity filtering. Basic red-teaming: prompt injection, jailbreaking patterns, indirect injection via retrieved content.
Project
Add a full guardrails + evaluation layer to your Phase 2 RAG system — input filtering, output validation, hallucination scoring dashboard.
Week 43–44
project
Big data pipelines for AI — Spark, Hadoop, MongoDB
Apache Spark with PySpark: RDDs vs DataFrames, transformations (map, filter, groupBy) vs actions (collect, count, write), lazy evaluation, shuffle operations and why they're expensive, partitioning strategy. Spark for ML data preprocessing at scale: feature engineering pipelines, data cleaning, tokenisation over 100M+ records. MongoDB: document model, aggregation pipeline, Atlas Vector Search for hybrid document + vector queries. Building end-to-end data pipelines that feed ML training and inference systems. Your cloud background means you can deploy these on EMR, Dataproc, or Azure HDInsight natively.
Project
PySpark preprocessing pipeline for an ML dataset (1M+ records) + MongoDB Atlas vector store integration for the resulting embeddings.
Phase 3 Exit Condition
Fine-tuned model + multimodal + KG pipeline + guardrails — integrated and measured
You must be able to demonstrate a system that integrates all four Phase 3 components: a domain fine-tuned model feeding into a multimodal pipeline with KG-augmented retrieval and production guardrails. Show benchmark numbers comparing your integrated system to the Phase 2 baseline. Document the architecture decisions. Publish the results as a technical blog post or GitHub README with benchmarks.
Skills acquired in Phase 3
LoRA / QLoRA
RLHF / DPO
SFT data pipelines
HuggingFace TRL
CLIP / LLaVA
Multimodal pipelines
Whisper ASR
Neo4j / Cypher
Entity extraction
GraphRAG
Guardrails.ai
Hallucination detection
PySpark
MongoDB Atlas
Red-teaming basics
04
Phase Four

Production ML &
MLOps Engineering

This is where your existing cloud background becomes a genuine competitive advantage. Most ML engineers hit a wall in production. You already live here — apply it to AI systems specifically. Cost, latency, monitoring, and reliability at scale.

Months 11–13 · Weeks 45–56 Accelerated by cloud background
Week 45–48
projectcore
Model serving at scale — inference optimisation
Quantisation methods: GGUF (CPU-optimised, llama.cpp), GPTQ (GPU, post-training), AWQ (activation-aware, better quality at 4-bit), FP8 (emerging standard). Speculative decoding: small draft model proposes tokens, large model verifies — how it achieves latency reduction without quality loss. vLLM: PagedAttention — manages the KV cache as non-contiguous paged memory (like OS virtual memory), eliminating fragmentation waste. TGI (Text Generation Inference) by HuggingFace. Continuous batching vs static batching: why continuous batching achieves 10–23x throughput improvement. Latency vs throughput optimisation trade-offs. Load test everything — measure p50, p95, p99 latencies under load.
Project
Benchmark vLLM vs naive HuggingFace serving on same model — document throughput, latency at load, GPU memory utilisation, and cost per 1K tokens.
Week 49–51
project
MLOps — CI/CD, monitoring, model versioning
ML-specific CI/CD: automated evaluation gates before every deployment (if RAGAS score drops by >5%, block the merge), model versioning with MLflow (experiment tracking, model registry, stage promotion: Staging → Production), DVC for dataset versioning. Production monitoring: latency p95/p99, token usage and cost per request, output quality score sampling, hallucination rate via automated detection. Data drift detection: statistical tests (KS test, Population Stability Index) on embedding distributions. Concept drift: when model performance degrades on production data even when data distribution appears stable. A/B testing model versions: traffic splitting, statistical significance. Feedback loops: human rating collection → retraining pipeline.
Project
Full MLOps pipeline with MLflow registry, automated eval gates in CI/CD, and Grafana monitoring dashboard with drift detection alerts.
Week 52–54
project
Cloud-native AI deployment — leverage your existing expertise
Apply your cloud knowledge directly to AI workloads. AWS: SageMaker endpoints (real-time, async, batch), Bedrock for managed LLMs, ECS/EKS for custom model serving, Cost Explorer for GPU cost optimisation. Azure: Azure ML managed endpoints, Azure OpenAI Service, AKS with GPU node pools. GCP: Vertex AI Model Garden, Cloud Run for serverless inference, GKE Autopilot. GPU instance selection: A10G vs A100 vs H100 — when each makes sense economically. Autoscaling for variable AI workloads: CPU-based vs custom metrics (queue depth, token throughput). Spot/preemptible instances for training; on-demand for inference.
Project
Production AI deployment with autoscaling, multi-region failover, cost dashboard, and documented cost-per-query at different load levels.
Week 55–56
project
Enterprise AI integration patterns
Multi-tenant AI API design: tenant isolation (separate vector namespaces, API key scoping, rate limiting per tenant), audit logging (who queried what, when, with what result — required for regulated industries). Data residency: keeping data in specific regions for GDPR/data sovereignty compliance. Enterprise data integration: SharePoint via MS Graph API, Salesforce SOQL + Bulk API for CRM data, JDBC for legacy databases. PII detection and redaction before sending data to LLMs. Token budget governance: per-user, per-department cost caps with graceful degradation. SLA management: what happens when the LLM provider is down — fallback models, cached responses, circuit breakers.
Project
Enterprise-grade AI integration with multi-tenancy, audit logging, PII redaction, token cost governance, and circuit breaker fallback.
Phase 4 Exit Condition
A production AI system you can monitor, version, roll back, and cost-justify
You must be able to show: documented cost per query at different load levels, automated quality gates in CI/CD that block regressions, real-time monitoring dashboards, and a rollback procedure. If the system goes down at 2am, can you diagnose it from metrics alone? If a new model version degrades quality, does your pipeline catch it before production? That's the production bar.
Skills acquired in Phase 4
vLLM / PagedAttention
Quantisation (AWQ/GPTQ)
Speculative decoding
Continuous batching
MLflow registry
DVC versioning
Drift detection
A/B testing models
SageMaker / Vertex AI
GPU cost optimisation
Multi-tenant design
Audit logging
Circuit breakers
PII redaction
05
Phase Five

Lead-Level Mastery —
Systems Thinking, R&D & Mentorship

Technical skill alone does not make a Lead. This phase builds the judgment, communication, and research translation ability that distinguishes someone who builds AI systems from someone who leads teams that build AI systems.

Months 13–15 · Weeks 57–65 The hardest phase to fake
Week 57–58
lead skill
Architecture decision-making under constraints
The Lead role requires designing AI systems, not just implementing them. Practice whiteboard architecture sessions on real constraints: "design a customer support AI for 10M users on $50K/month cloud budget." Make explicit trade-offs — not "RAG is better than fine-tuning" but "given these latency requirements and data update frequency, RAG is better because..." Practice the questions that expose shallow thinking: What breaks at 10x load? What's the rollback plan if this model degrades? How does this scale to 50 languages? How do we handle GDPR deletion requests on vector stores? A Lead who can't say "this approach fails at X scale because Y" is not ready to own an architecture.
The fastest way to prepare for lead-level interviews: practice answering "why not the other approach?" for every design decision you make. The alternative matters as much as the choice.
Week 59–60
weekly habitpapers
Research translation — reading papers and knowing what matters
One paper per week from arXiv (cs.LG, cs.CL), HuggingFace Daily Papers, or Papers with Code. The goal is not to read papers — it's to assess them: What is the core contribution? Does it beat existing baselines and by how much? What are the failure modes the authors don't discuss? Would this change how I build anything in production? Start an internal technical notes document — one paragraph per paper: contribution, key result, production relevance. This habit, started now and continued forever, is what "staying current" in the job description actually means.
Recommended starting papers
RAGAs (evaluation), Self-RAG, RAPTOR, ColBERT, Flash Attention 2, Mixtral (MoE), LLaVA-1.5, GraphRAG, Constitutional AI, DPO.
Week 61–62
lead skill
Mentorship & technical communication
The ability to explain transformer attention to a non-ML engineer, or to translate a business requirement ("we need the chatbot to stop hallucinating product specs") into a precise technical intervention (grounding via RAG + output validation against product database), is 50% of a Lead role. Practice three levels: peer-level explanation (full technical depth), junior explanation (mechanism without mathematical notation), executive explanation (business impact without mechanism). Write internal documentation for everything you've built. Give a talk — at a meetup, internally, or record for YouTube. The act of teaching reveals every gap in your own understanding.
If you can't explain a concept to someone two levels below you without losing the mechanism, you don't fully understand it yet.
Week 63–65
capstoneportfolio
Capstone — end-to-end enterprise AI system
Build the most complex system you've built. It must use everything: multimodal input processing, knowledge graph construction from raw documents, KG-augmented hybrid RAG with reranking, fine-tuned domain model for generation, agentic orchestration via LangGraph, production FastAPI serving with vLLM, MLflow versioning with automated eval gates, Grafana monitoring, and a guardrails layer. Deploy publicly. Document every architecture decision with explicit trade-offs. Present it at a meetup or publish a detailed technical blog post. This is your portfolio centrepiece. It must be live and demoable in interviews.
Capstone deliverables
Public GitHub repo + architecture blog post with benchmarks + live demo + 10-minute recorded walkthrough. Present at one meetup or conference.
Final Exit Condition — The Lead Bar
Can you design, build, deploy, monitor, and teach?
Three questions that define the Lead bar. One: can you design a production AI system on a whiteboard, with explicit trade-offs, failure modes, and cost estimates — without referencing documentation? Two: can you read a new paper published this week and tell a junior engineer within 30 minutes whether it's worth trying and why? Three: can you explain the KV cache to a product manager and write a production implementation in the same afternoon? If all three: you're ready.
Skills acquired in Phase 5
Architecture design
Trade-off reasoning
Paper reading
Research translation
Technical writing
Mentorship
Stakeholder comms
System design interviews
Full-stack AI projects
Public speaking

Essential Resources

Phase 1 — Core
Andrej Karpathy — Neural Networks: Zero to Hero
Free YouTube series. Build micrograd → makemore → nanoGPT. The single best starting resource for Phase 1 that exists.
Phase 1 — Math
Mathematics for Machine Learning (Deisenroth)
Free PDF. Linear algebra, probability, and optimisation with ML framing. Chapter 5 (vector calculus) is what backprop rests on.
Phase 1 — DL
Deep Learning (Goodfellow, Bengio, Courville)
Free online. Chapters 6–8 (feedforward, regularisation, optimisation) and Chapter 10 (sequence models) are essential.
Phase 2 — LLMs
Hugging Face NLP Course
Free. Transformers, tokenisers, fine-tuning, and the full HuggingFace ecosystem. Chapters 1–4 are required for Phase 2.
Phase 2 — Agents
LangGraph Documentation + Tutorials
Official docs are excellent. Work through every tutorial in order. The persistence and human-in-the-loop sections are most important.
Phase 2 — RAG
RAGAs Library + Documentation
Learn to measure before you optimise. The RAGAS paper (Shahul et al. 2023) explains the metrics. Use it as your eval standard from day one.
Phase 3 — Fine-tuning
Unsloth + HuggingFace TRL
Unsloth makes QLoRA training 2x faster with 60% less memory. TRL provides SFT, DPO, and PPO trainers. Both have excellent Colab notebooks.
Phase 3 — Papers
Papers With Code + arXiv cs.CL
Track State of the Art on key tasks. arXiv cs.CL for language models, cs.LG for ML methods. Subscribe to Andrej Karpathy and Yann LeCun for curation.
Phase 4 — Serving
vLLM Documentation + Blog
The vLLM blog posts explain PagedAttention better than any other source. The docs are production-quality. Read both before deploying.
Phase 4 — MLOps
Chip Huyen — Designing Machine Learning Systems
The definitive book on production ML. Chapters 7–9 (data distribution shifts, continual learning, monitoring) are directly relevant to Phase 4.
Phase 5 — Research
HuggingFace Daily Papers
Curated daily. Better signal-to-noise than raw arXiv. Build the habit of reading 3–5 abstracts daily and one full paper weekly.
Phase 5 — Systems
System Design Interview Vol. 2 (Alex Xu)
Not ML-specific but essential for the architecture thinking that Lead roles require. Chapter 11 (design a news feed) + Chapter 14 (design YouTube) teach the constraint-first approach.

The North Star

At the end of 15 months, you should be able to design a production AI system on a whiteboard with explicit trade-offs, build it in code, deploy it to cloud infrastructure, monitor it in production, and teach every component to someone who's never seen it. That's the Lead bar. Not the tools you know — the judgment you've built.

Tool users get replaced. System thinkers build the tools. Lead engineers teach the system thinkers.