How To Break Into AI Hardware
// A field guide for software brains who want to go closer to the metal.
__global__ void matmul(...) { /* vibes */ }
Every app you've ever shipped eventually hits a wall of silicon. This is a guide for the day you decide to stop shouting at the wall and start building it.
I'm a GenAI guy. I ship apps. I write prompts. I duct-tape APIs together until they behave. But every time a model gets bigger, a latency number gets worse, or an inference bill gets scary, I end up in the same place: somebody, somewhere, is writing a CUDA kernel that decides whether my product works or not. In 2026, with hyperscaler capex pushing past half a trillion dollars and custom chips multiplying like rabbits, that somebody is increasingly in demand. This post is my attempt to write down, in one place, how a software brain actually pivots into the AI hardware stack.
Why bother in 2026?
The "NVIDIA-only datacenter" is over. Microsoft Maia 200, Meta MTIA 300, AWS Trainium 3, and Google's 7th-gen TPU Ironwood are all in production. Custom ASIC shipments are projected to grow at a ~45% CAGR this year — roughly 3× the growth of GPUs. Inference is now two-thirds of all AI compute, and that's where every accelerator startup is placing its bet.
Translation: every one of those chips needs kernel engineers, compiler engineers, perf modelers, and systems people. You don't need a PhD in physics. You need taste, a GPU, and the patience to read benchmarks until your eyes bleed.
The map: what "AI hardware" actually means
People throw the phrase around like it's one field. It isn't. Here's the rough territory:
- GPUs — NVIDIA (Hopper, Blackwell), AMD (MI300/MI350). Still the gravity well.
- Hyperscaler ASICs — Google TPU, Meta MTIA, Microsoft Maia, AWS Trainium/Inferentia, Tesla Dojo.
- Accelerator startups — Cerebras (wafer-scale), Groq (LPU, 80 TB/s on-die), SambaNova (reconfigurable dataflow), Tenstorrent (Jim Keller, RISC-V, open source), Etched (transformer-only ASIC, still not shipping — cautionary tale).
- Interconnects — NVLink 5, the open UALink 2.0 (1024-accelerator scale-out), CXL 4.0 memory pooling, Huawei's UB-Mesh.
- The exotic corner — photonic (Lightmatter, Neurophos), in-memory compute, neuromorphic. Great for PhDs, thin for early-career hires.
If you're optimising for "where will I get paid in 2028," the boring answer is: GPUs, hyperscaler ASICs, and the software stacks around them. The exotic stuff is fun. The interconnect and kernel layer is where the jobs are.
The skills stack (in the order I'd actually learn them)
1. Computer architecture foundations
Before you write a single Verilog line, you need pipelining, hazards, caches, and memory hierarchies wired into your brain. Start with Harris & Harris (Digital Design and Computer Architecture). Graduate to Hennessy & Patterson (Computer Architecture: A Quantitative Approach). For the lectures, Onur Mutlu's course on YouTube is the deep bench — nothing else comes close for free.
2. HDL and a toy CPU
Pick Verilog / SystemVerilog because that's what the industry ships. Then, if you want productivity, learn Chisel (Berkeley, underlies Rocket / BOOM / Chipyard) or SpinalHDL. Simulate everything in Verilator. Start with nand2tetris, then build a single-cycle RISC-V core via riscv-sodor, then pipeline it.
3. GPU architecture internals
SMs, warps, tensor cores, the HBM → L2 → shared memory → registers hierarchy. Hopper's TMA and async copies. Blackwell's FP4. This is not optional. Canonical book: Kirk & Hwu, Programming Massively Parallel Processors. Canonical blog post: Horace He's "Making Deep Learning Go Brrrr From First Principles". If you only read one link from this whole post, read that one.
4. Kernels: CUDA and Triton
Write a naive CUDA sgemm. Then optimise it. The canonical public worklog is
Simon Boehm's matmul series
— he goes from a toy kernel to 95% of cuBLAS throughput in about a dozen carefully explained
iterations. The repo is your lab.
Once you've done that, learn Triton — the
iteration loop is 10× faster than raw CUDA and every model lab uses it now.
5. Model-level hardware awareness
Why does FlashAttention matter? Because it tiles from HBM to SRAM and recomputes instead of materialising an attention matrix you can't afford. Why does GQA exist? KV cache memory. Why FP8/FP4? Memory bandwidth. You cannot write a useful kernel without knowing why the model wants it. Read Tri Dao's FlashAttention-3 post. Read anything Tim Dettmers has written about k-bit quantization.
6. Compilers (the fastest-growing slice)
MLIR, XLA, TVM, Triton's IR. Every new accelerator ships a compiler and every compiler needs humans. Write a loop-fusion pass. Implement tiling. Read the Triton source. If you already like writing software, this is where you get paid like a senior SWE for doing hardware-adjacent work.
7. Real silicon, for cheap
Tiny Tapeout puts real packaged chips in your hands for around $300 per slot. The 2026 shuttles (TTGF26a, TTSKY26a) are ongoing. You design in Verilog, they tape out with OpenROAD on the SkyWater PDK, and four months later you get a physical die you can point at in interviews. Unreasonably effective signal-to-noise.
> Can you explain why a matmul is compute-bound at large M,N,K but memory-bound at small ones?
> Can you draw the GPU memory hierarchy from HBM down to a register?
> Can you describe what FlashAttention actually does in one sentence?
If yes to all three, you're already more hireable than you think.
Courses that are actually worth your time
- MIT 6.5940 — TinyML & Efficient Deep Learning Computing (Song Han). Pruning, quantization, distillation, LLM inference, on-device training. Full lecture set on YouTube. Homework mirror at yifanlu0227/MIT-6.5940.
- Stanford CS149 (parallel computing) and CS217 (hardware accelerators).
- CMU 15-418 / 15-442 — parallel and ML systems.
- Berkeley CS152 / CS252A (Sp26) — Chipyard + Sodor labs, taught by people who actually tape out chips.
The project ladder (this is what actually gets you hired)
Courses are context. Projects are proof. Ship these in order, ideally in public with writeups:
- FizzBuzz in Verilog, simulated in Verilator, with a testbench. Your first real tangible commit.
- Single-cycle RISC-V core via riscv-sodor.
- 5-stage pipelined RISC-V with hazard detection and forwarding.
- Systolic array matmul in SystemVerilog, verified against a NumPy reference.
- CUDA SGEMM from naive to 80%+ of cuBLAS at a chosen tile size. Write it up publicly. This single repo has hired more people than any resume line I know of.
- Triton kernel for a non-trivial attention variant — sliding window, GQA, ALiBi, MLA. Beat PyTorch SDPA on a specific shape.
- Submit a design to Tiny Tapeout. Real silicon in your portfolio.
- Rank on the GPU MODE KernelBot leaderboard. NVIDIA has publicly written about topping it. That's the neighbourhood you want to be seen in.
- Upstream a kernel to tinygrad, vLLM, SGLang, or flash-attention. Open-source contribution is the highest-trust signal in this field.
Communities you should be in yesterday
- GPU MODE Discord — 26k+ members, weekly lectures on YouTube, an actual kernel leaderboard. NVIDIA engineers show up. So should you.
- Zero to ASIC — Matt Venn's course and Discord. The on-ramp for the Tiny Tapeout pipeline.
- r/FPGA, r/chipdesign.
- Sasha Rush's GPU Puzzles and Triton-Puzzles. Interactive. Annoyingly addictive.
What the jobs actually look like
Titles, roughly ordered from "most software-flavoured" to "hardcore silicon":
- ML Systems / Inference Engineer — serving, batching, scheduling. Pure SWE leverage.
- GPU Kernel / Performance Engineer — CUDA, Triton, HIP. The sweet spot for software brains pivoting in.
- ML Compiler Engineer — MLIR, XLA, TVM. Software, but proximity to the hardware roadmap.
- DL Architect / Perf Modeling — roofline analysis, simulation, what-to-build-next.
- RTL / Verification Engineer — SystemVerilog, UVM, formal. Hardest pivot from a SWE start.
- Physical Design / DFT — realistically needs targeted schooling.
Where they live: NVIDIA, AMD, Intel (Gaudi), Apple Silicon, Google TPU, Meta MTIA, Microsoft Maia, AWS Trainium, Tesla Dojo. Then the startups: Cerebras, Groq, SambaNova, Tenstorrent, Etched, MatX, d-Matrix, Lightmatter, Rain, Fractile. Then the frontier labs with in-house kernel teams: Anthropic, OpenAI, xAI, DeepSeek. Then inference platforms: Baseten, Together, Fireworks, Modal.
Comp, roughly, in 2026 US markets (levels.fyi): NVIDIA SWE median ~$329k, IC5 ~$238k base + $167k stock, IC7 breaks $1M. ML Engineer bands track $205k–$331k. AMD lags meaningfully. Hyperscalers (Google, Meta, Microsoft) track NVIDIA closely for ML infra. Startups: competitive cash, wild equity variance.
Honest takes
Don't go full RTL if you're starting from SWE
Going from webapp to physical design is a 2–4 year investment, often with a master's attached, and markets you to a narrower employer set. The kernel and compiler layer is the actual sweet spot: you keep your SWE leverage, you get paid like a senior SWE++, and MLIR/Triton skills are portable across NVIDIA, AMD, TPU, MTIA, Maia, and Trainium. Fast kernels are provable with a benchmark. RTL craft only shows up in tapeouts.
The geopolitics matter more than you'd think
NVIDIA's moat is CUDA, not silicon — 18 years and millions of developers. Huawei's CANN Next is a deliberate near-drop-in replacement, Ascend 950PR is shipping, and the target is 750k+ Ascend chips in 2026. A full Chinese AI stack — Ascend + Cambricon + CANN + DeepSeek — already exists with zero US components. For US-based engineers: export controls are a tailwind for domestic hardware roles. For globally mobile engineers: CANN/Ascend skills are a surprisingly smart hedge.
Is this a good 5-year bet?
Yes — but precisely. Hyperscaler capex is projected past $500B in 2026. Every GW of new IT capacity needs kernel, compiler, interconnect, and perf modelling people. Hardcore RTL at a pre-tapeout startup is high-variance (Etched is 20+ months post-announcement and still hasn't shipped a customer chip — that's normal, not unusual). The sustained, low-variance upside is in the software-adjacent hardware layer.
Pro Tip
Don't try to learn everything. Pick one vertical slice — e.g. CUDA matmul → Triton attention → MLIR pass → KernelBot submission — and go deep. Breadth follows depth. A single great CUDA writeup beats ten half-finished Verilog repos.
The three links that matter most
If you stop reading my post right now and read only these three, you'll still be ahead of 95% of people saying they want to get into AI hardware:
- Making Deep Learning Go Brrrr From First Principles by Horace He. The mental model: compute-bound vs memory-bound vs overhead-bound. Every hardware conversation reduces to this.
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance by Simon Boehm. The template for how you document your own learning journey in public, so people can hire you off the post.
- SemiAnalysis by Dylan Patel. The news source. Pair with the Dwarkesh × Dylan × Asianometry episode for a one-stop industry primer.
Where I'd actually start this week
- Read Horace He's Brrrr post. Really read it.
- Open a Colab or grab a cloud GPU. Write a naive matmul. Benchmark against cuBLAS. Be humbled.
- Work the first four chapters of Simon Boehm's series. Reproduce his numbers.
- Join the GPU MODE Discord. Lurk for a week. Then enter a KernelBot problem.
- Start the MIT 6.5940 YouTube series in parallel. One lecture per day.
That's the month-one plan. Month two, you branch into Triton or into Verilog — whichever scratched more of the itch. By month six you have something shippable in public, and that changes everything about what recruiters see when they type your name.
Closing
Software is infinite, but it runs on atoms. The people who understand both sides of that statement — who can write a webapp in the morning and a fused kernel in the afternoon — are the ones who compound the hardest over the next decade. You don't need to be one of them. But if the pull towards the metal is real, the ladder is right there, and nobody's guarding the bottom rung.
Ship Fast. Think In Cycles.