AI Compilers & Systems

I'm Patrick Toulme

Compiler and Performance engineer on the MTIA team at Meta, optimizing and enabling workloads on MTIA custom silicon. Previously at AWS Neuron working on the Trainium compiler and NKI kernels. I write about AI Compilers, AI systems, JAX/PyTorch on my blog JustAByte.

Location New York, NY

Education Georgia Tech MS, UVA BA

Experience

Where I've Worked

2025 — Present

Compiler Engineer

Meta — MTIA Compiler Performance Team

Primarily working on GenAI inference compilation right now in vLLM. Working on high performance inference compilation from FX IR to MTIA ISA. I also actively work on the bringup of new MTIA silicon.

PyTorch FX IR MLIR Custom Silicon

2023 — 2025

Machine Learning Engineer II

Amazon AWS Neuron — Annapurna Labs

I led the bringup of the JAX backend for Trainium. I also made GSPMD work on Trainium and work performantly. Worked on native codegeneration and the NKI compiler. I also worked on the bringup of Trainium2. I also wrote the first collective matmul for Trainium — open sourced here.

JAX XLA PJRT GSPMD NKI

2022 — 2023

Software Engineer

Amazon AWS AI Bedrock — Titan Model Training

Worked on the early Bedrock training service. Trained very large LLMs on GPUs.

PyTorch NeMo Distributed Training A100

Writing

Just a Byte

I write a blog about AI compilers, custom silicon, and AI systems. Exploring JAX and PyTorch targeting custom hardware, with a focus on high-performance LLMs.

CuTile on Blackwell: NVIDIA's Compiler Moat Is Already Built

Tracing a Mixture of Experts kernel through CuTile's compilation stages. 86 lines of Python expand into 180KB of shared memory, tcgen05 instructions, and orchestration patterns that form NVIDIA's deepening compiler moat.

January 2026 · NVIDIA, CuTile, Blackwell, Compilers

When XLA Isn't Enough: From Pallas to VLIW with Splash Attention on TPU

When does XLA hit its limits? How do you write the TPU Pallas kernel that the compiler cannot automatically find? Why can't XLA generate Splash Attention?

January 2026 · Pallas, TPU, Splash Attention

From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack

Eight lines of JAX code become 250 VLIW bundles across 5 fused kernels. A deep dive into what happens between jax.jit(f)(x) and electrons moving through a TPU.

December 2025 · AI Compilers, TPU, JAX

Subscribe to Just a Byte

Personal Projects

Building a TPU from Scratch

I designed and built a custom tensor processing unit — from RTL hardware to an MLIR compiler and PJRT runtime — that runs JAX and executes Llama as a single fused megakernel.

Custom TPU

JAX → PJRT Runtime → MLIR Compiler → RTL

A fully custom tensor processing unit built from the ground up: Verilog RTL hardware, an MLIR-based compiler (JAX → HLO → MLIR → ASM → VLIW → Binary), and a PJRT runtime that plugs directly into JAX. No CUDA, no hand-written kernels — pure compiler codegen. The compiler fuses entire models into single megakernel binaries, and it now runs Llama end-to-end.

Blog: Built a TPU from Scratch Blog: Custom TPU Runs Llama X Post X Post

Verilog MLIR PJRT JAX VLIW Llama

Education

Academic Background

Georgia Institute of Technology

M.S. Computer Science — Artificial Intelligence

Graduated December 2025

University of Virginia

B.A. Computer Science

Graduated May 2022

Thomas Jefferson High School

Science and Technology

Graduated May 2018

Publication

Research

🏆 Best Paper Award

Marcus: A Chatbot for Depression Screening Based on the PHQ-9 Assessment

A comprehensive study contrasting the effectiveness of screenings by "Marcus," a BERT-based chatbot, against traditional PHQ-9 assessments. Developed a prototype application integrating BERT for linguistic analysis with DialogFlow and Kommunicate APIs.

ACHI 2023 — The Sixteenth International Conference on Advances in Computer-Human Interactions

Expertise

Technical Skills

Languages & Compilers

Python, C++
PyTorch, JAX
XLA, GSPMD, PJRT
MLIR, LLVM
FX Graph IR

Hardware & Kernels

MTIA (Meta)
GPU, TPU
Neuron Core (AWS)
Triton, Pallas
NKI

Machine Learning

Transformer Models
Distributed Training
LLM Pretraining
Mixture of Experts
RLHF