LLM internals — how GPT-style models actually work
Decoder-only architecture (GPT family) vs encoder-decoder (T5, BART) vs encoder-only (BERT). Tokenisation deep-dive: BPE algorithm, WordPiece, SentencePiece — implement BPE by hand. Context window mechanics. KV cache: what it stores, why it trades memory for latency, how PagedAttention improves on it. Autoregressive generation: temperature, top-p, top-k, repetition penalty — implement each sampling strategy. Read the GPT-2 and GPT-3 papers.
Project
Run LLM inference from scratch using Hugging Face transformers — implement your own sampling loop with all decoding strategies.
Prompt engineering — systematic, not intuitive
Zero-shot, few-shot, chain-of-thought (CoT), tree-of-thought, self-consistency, ReAct prompting. Output structuring: JSON mode, XML schemas, function calling for structured extraction. System prompts, role prompting, constitutional AI prompting. The key insight: prompt engineering without evaluation is guessing. Build an automated evaluation harness that scores prompt variants — LLM-as-judge, BLEU, ROUGE, custom rubrics.
Project
Prompt evaluation framework — 5 prompt variants, automated LLM-as-judge scoring, statistical significance testing.
Embeddings & vector search — the geometry of meaning
Dense embeddings: what they encode (semantic relationships as geometric proximity), cosine similarity mechanics. Embedding model selection: MTEB benchmark, task-specific vs general models, dimensionality trade-offs. Sparse embeddings: TF-IDF, BM25 — when keyword matching beats semantics. Vector databases: Pinecone, Weaviate, pgvector. HNSW indexing internals: the hierarchical graph structure, how it achieves O(log n) approximate nearest neighbour search, recall vs speed trade-off.
Project
Semantic search engine over 10,000+ document corpus — compare cosine similarity, HNSW, and flat exact search on latency and recall.
RAG systems — beyond the naive baseline
Start with naive RAG (chunk → embed → retrieve → generate) and measure it. Then fix its failure modes one by one. Chunking: fixed-size vs semantic chunking vs late chunking — understand what information is lost at chunk boundaries. Hybrid retrieval: dense + sparse with Reciprocal Rank Fusion — measure improvement on your dataset. Re-ranking: cross-encoder models (ColBERT, BGE-reranker) and why they outperform bi-encoders for precision. Query expansion, HyDE (hypothetical document embeddings), contextual compression. RAGAS evaluation framework: faithfulness, answer relevancy, context precision, context recall.
Project
Production RAG system with RAGAS eval scores above 0.8 — document baseline vs hybrid vs reranked improvements with metrics.
A RAG system without an eval framework is a guess. Every architectural decision must be measured. If you can't show the improvement numerically, it didn't happen.
Agentic systems — planning, memory, tool use
ReAct loop: Reason + Act, the observation feedback cycle. OpenAI function calling and Anthropic tool use — implement both. LangGraph: stateful agents as directed graphs with cycles, conditional edges, checkpointing, human-in-the-loop interrupts. Memory architecture: in-context (limited), external vector store (semantic retrieval), episodic (conversation history), procedural (tool selection memory). Multi-agent patterns: supervisor-worker, peer-to-peer (AutoGen), sequential chains. Agent evaluation: where it fails (tool selection errors, infinite loops, off-task drift) and how to diagnose each.
Project
Multi-tool research agent with web search, code execution, and external memory — LangGraph-based with automated evaluation harness.
FastAPI + backend engineering for AI systems
Build production AI APIs: FastAPI async endpoints with streaming responses via Server-Sent Events (SSE), Pydantic v2 request/response validation, middleware (auth, rate limiting, logging), error handling and retry logic, background tasks. Containerise with Docker. Your cloud background is an enormous advantage here — you already understand deployment. The focus is on AI-specific patterns: streaming token generation to the client, async LLM calls without blocking, cost tracking middleware.
Project
Deployed streaming RAG API on your cloud platform — with auth, rate limiting, cost tracking, and load testing results.