category: tutorial

How To Break Into AI Hardware

// A field guide for software brains who want to go closer to the metal.

Apr 21, 2026 12 min read
🧱 ⚡ 🖥️

__global__ void matmul(...) { /* vibes */ }

Every app you've ever shipped eventually hits a wall of silicon. This is a guide for the day you decide to stop shouting at the wall and start building it.

I'm a GenAI guy. I ship apps. I write prompts. I duct-tape APIs together until they behave. But every time a model gets bigger, a latency number gets worse, or an inference bill gets scary, I end up in the same place: somebody, somewhere, is writing a CUDA kernel that decides whether my product works or not. In 2026, with hyperscaler capex pushing past half a trillion dollars and custom chips multiplying like rabbits, that somebody is increasingly in demand. This post is my attempt to write down, in one place, how a software brain actually pivots into the AI hardware stack.

Why bother in 2026?

The "NVIDIA-only datacenter" is over. Microsoft Maia 200, Meta MTIA 300, AWS Trainium 3, and Google's 7th-gen TPU Ironwood are all in production. Custom ASIC shipments are projected to grow at a ~45% CAGR this year — roughly 3× the growth of GPUs. Inference is now two-thirds of all AI compute, and that's where every accelerator startup is placing its bet.

Translation: every one of those chips needs kernel engineers, compiler engineers, perf modelers, and systems people. You don't need a PhD in physics. You need taste, a GPU, and the patience to read benchmarks until your eyes bleed.

The map: what "AI hardware" actually means

People throw the phrase around like it's one field. It isn't. Here's the rough territory:

  • GPUs — NVIDIA (Hopper, Blackwell), AMD (MI300/MI350). Still the gravity well.
  • Hyperscaler ASICs — Google TPU, Meta MTIA, Microsoft Maia, AWS Trainium/Inferentia, Tesla Dojo.
  • Accelerator startups — Cerebras (wafer-scale), Groq (LPU, 80 TB/s on-die), SambaNova (reconfigurable dataflow), Tenstorrent (Jim Keller, RISC-V, open source), Etched (transformer-only ASIC, still not shipping — cautionary tale).
  • Interconnects — NVLink 5, the open UALink 2.0 (1024-accelerator scale-out), CXL 4.0 memory pooling, Huawei's UB-Mesh.
  • The exotic corner — photonic (Lightmatter, Neurophos), in-memory compute, neuromorphic. Great for PhDs, thin for early-career hires.

If you're optimising for "where will I get paid in 2028," the boring answer is: GPUs, hyperscaler ASICs, and the software stacks around them. The exotic stuff is fun. The interconnect and kernel layer is where the jobs are.

The skills stack (in the order I'd actually learn them)

1. Computer architecture foundations

Before you write a single Verilog line, you need pipelining, hazards, caches, and memory hierarchies wired into your brain. Start with Harris & Harris (Digital Design and Computer Architecture). Graduate to Hennessy & Patterson (Computer Architecture: A Quantitative Approach). For the lectures, Onur Mutlu's course on YouTube is the deep bench — nothing else comes close for free.

2. HDL and a toy CPU

Pick Verilog / SystemVerilog because that's what the industry ships. Then, if you want productivity, learn Chisel (Berkeley, underlies Rocket / BOOM / Chipyard) or SpinalHDL. Simulate everything in Verilator. Start with nand2tetris, then build a single-cycle RISC-V core via riscv-sodor, then pipeline it.

3. GPU architecture internals

SMs, warps, tensor cores, the HBM → L2 → shared memory → registers hierarchy. Hopper's TMA and async copies. Blackwell's FP4. This is not optional. Canonical book: Kirk & Hwu, Programming Massively Parallel Processors. Canonical blog post: Horace He's "Making Deep Learning Go Brrrr From First Principles". If you only read one link from this whole post, read that one.

4. Kernels: CUDA and Triton

Write a naive CUDA sgemm. Then optimise it. The canonical public worklog is Simon Boehm's matmul series — he goes from a toy kernel to 95% of cuBLAS throughput in about a dozen carefully explained iterations. The repo is your lab. Once you've done that, learn Triton — the iteration loop is 10× faster than raw CUDA and every model lab uses it now.

5. Model-level hardware awareness

Why does FlashAttention matter? Because it tiles from HBM to SRAM and recomputes instead of materialising an attention matrix you can't afford. Why does GQA exist? KV cache memory. Why FP8/FP4? Memory bandwidth. You cannot write a useful kernel without knowing why the model wants it. Read Tri Dao's FlashAttention-3 post. Read anything Tim Dettmers has written about k-bit quantization.

6. Compilers (the fastest-growing slice)

MLIR, XLA, TVM, Triton's IR. Every new accelerator ships a compiler and every compiler needs humans. Write a loop-fusion pass. Implement tiling. Read the Triton source. If you already like writing software, this is where you get paid like a senior SWE for doing hardware-adjacent work.

7. Real silicon, for cheap

Tiny Tapeout puts real packaged chips in your hands for around $300 per slot. The 2026 shuttles (TTGF26a, TTSKY26a) are ongoing. You design in Verilog, they tape out with OpenROAD on the SkyWater PDK, and four months later you get a physical die you can point at in interviews. Unreasonably effective signal-to-noise.

// The 3-question gut check:
> Can you explain why a matmul is compute-bound at large M,N,K but memory-bound at small ones?
> Can you draw the GPU memory hierarchy from HBM down to a register?
> Can you describe what FlashAttention actually does in one sentence?

If yes to all three, you're already more hireable than you think.

Courses that are actually worth your time

The project ladder (this is what actually gets you hired)

Courses are context. Projects are proof. Ship these in order, ideally in public with writeups:

  1. FizzBuzz in Verilog, simulated in Verilator, with a testbench. Your first real tangible commit.
  2. Single-cycle RISC-V core via riscv-sodor.
  3. 5-stage pipelined RISC-V with hazard detection and forwarding.
  4. Systolic array matmul in SystemVerilog, verified against a NumPy reference.
  5. CUDA SGEMM from naive to 80%+ of cuBLAS at a chosen tile size. Write it up publicly. This single repo has hired more people than any resume line I know of.
  6. Triton kernel for a non-trivial attention variant — sliding window, GQA, ALiBi, MLA. Beat PyTorch SDPA on a specific shape.
  7. Submit a design to Tiny Tapeout. Real silicon in your portfolio.
  8. Rank on the GPU MODE KernelBot leaderboard. NVIDIA has publicly written about topping it. That's the neighbourhood you want to be seen in.
  9. Upstream a kernel to tinygrad, vLLM, SGLang, or flash-attention. Open-source contribution is the highest-trust signal in this field.

Communities you should be in yesterday

What the jobs actually look like

Titles, roughly ordered from "most software-flavoured" to "hardcore silicon":

  • ML Systems / Inference Engineer — serving, batching, scheduling. Pure SWE leverage.
  • GPU Kernel / Performance Engineer — CUDA, Triton, HIP. The sweet spot for software brains pivoting in.
  • ML Compiler Engineer — MLIR, XLA, TVM. Software, but proximity to the hardware roadmap.
  • DL Architect / Perf Modeling — roofline analysis, simulation, what-to-build-next.
  • RTL / Verification Engineer — SystemVerilog, UVM, formal. Hardest pivot from a SWE start.
  • Physical Design / DFT — realistically needs targeted schooling.

Where they live: NVIDIA, AMD, Intel (Gaudi), Apple Silicon, Google TPU, Meta MTIA, Microsoft Maia, AWS Trainium, Tesla Dojo. Then the startups: Cerebras, Groq, SambaNova, Tenstorrent, Etched, MatX, d-Matrix, Lightmatter, Rain, Fractile. Then the frontier labs with in-house kernel teams: Anthropic, OpenAI, xAI, DeepSeek. Then inference platforms: Baseten, Together, Fireworks, Modal.

Comp, roughly, in 2026 US markets (levels.fyi): NVIDIA SWE median ~$329k, IC5 ~$238k base + $167k stock, IC7 breaks $1M. ML Engineer bands track $205k–$331k. AMD lags meaningfully. Hyperscalers (Google, Meta, Microsoft) track NVIDIA closely for ML infra. Startups: competitive cash, wild equity variance.

Honest takes

Don't go full RTL if you're starting from SWE

Going from webapp to physical design is a 2–4 year investment, often with a master's attached, and markets you to a narrower employer set. The kernel and compiler layer is the actual sweet spot: you keep your SWE leverage, you get paid like a senior SWE++, and MLIR/Triton skills are portable across NVIDIA, AMD, TPU, MTIA, Maia, and Trainium. Fast kernels are provable with a benchmark. RTL craft only shows up in tapeouts.

The geopolitics matter more than you'd think

NVIDIA's moat is CUDA, not silicon — 18 years and millions of developers. Huawei's CANN Next is a deliberate near-drop-in replacement, Ascend 950PR is shipping, and the target is 750k+ Ascend chips in 2026. A full Chinese AI stack — Ascend + Cambricon + CANN + DeepSeek — already exists with zero US components. For US-based engineers: export controls are a tailwind for domestic hardware roles. For globally mobile engineers: CANN/Ascend skills are a surprisingly smart hedge.

Is this a good 5-year bet?

Yes — but precisely. Hyperscaler capex is projected past $500B in 2026. Every GW of new IT capacity needs kernel, compiler, interconnect, and perf modelling people. Hardcore RTL at a pre-tapeout startup is high-variance (Etched is 20+ months post-announcement and still hasn't shipped a customer chip — that's normal, not unusual). The sustained, low-variance upside is in the software-adjacent hardware layer.

Pro Tip

Don't try to learn everything. Pick one vertical slice — e.g. CUDA matmul → Triton attention → MLIR pass → KernelBot submission — and go deep. Breadth follows depth. A single great CUDA writeup beats ten half-finished Verilog repos.

The three links that matter most

If you stop reading my post right now and read only these three, you'll still be ahead of 95% of people saying they want to get into AI hardware:

  1. Making Deep Learning Go Brrrr From First Principles by Horace He. The mental model: compute-bound vs memory-bound vs overhead-bound. Every hardware conversation reduces to this.
  2. How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance by Simon Boehm. The template for how you document your own learning journey in public, so people can hire you off the post.
  3. SemiAnalysis by Dylan Patel. The news source. Pair with the Dwarkesh × Dylan × Asianometry episode for a one-stop industry primer.

Where I'd actually start this week

  1. Read Horace He's Brrrr post. Really read it.
  2. Open a Colab or grab a cloud GPU. Write a naive matmul. Benchmark against cuBLAS. Be humbled.
  3. Work the first four chapters of Simon Boehm's series. Reproduce his numbers.
  4. Join the GPU MODE Discord. Lurk for a week. Then enter a KernelBot problem.
  5. Start the MIT 6.5940 YouTube series in parallel. One lecture per day.

That's the month-one plan. Month two, you branch into Triton or into Verilog — whichever scratched more of the itch. By month six you have something shippable in public, and that changes everything about what recruiters see when they type your name.

Closing

Software is infinite, but it runs on atoms. The people who understand both sides of that statement — who can write a webapp in the morning and a fused kernel in the afternoon — are the ones who compound the hardest over the next decade. You don't need to be one of them. But if the pull towards the metal is real, the ladder is right there, and nobody's guarding the bottom rung.

Ship Fast. Think In Cycles.

Mann Jadwani

Mann Jadwani

GenAI Gremlin. I build things that shouldn't work, but somehow do. Currently breaking prod at 3am.