notes / journal

Hello, world.

May 9, 2026 5 min read#intro#plan

I am starting this notebook the way every program does: with a small, declarative “hello.” A low-stakes thing that says I am here, I intend to learn, and I am going to leave a trail.

For the last decade I’ve shipped software for a living. Now I want to climb underneath it. The plan is old-fashioned: start at the math, work up through the silicon, and only then, slowly, through the systems that train and serve modern models.

A small map of where I’m going

There are roughly four directions I want to push at, and they will take turns being interesting:

Math. Linear algebra first, then multivariable calculus, then probability. Not a survey; I want to be able to derive, not just recognize.
Software. Python by day, C and Rust on the weekends, with detours into CUDA when I am brave.
Hardware. What an SM is, what an HBM stack costs you, why bandwidth and not flops is the budget that matters.
Models. Eventually. First as a reader, then a re-implementer, then, hopefully, as someone who can have an opinion.

A taste of what posts will look like

To shake the site out, here is one math idea, one diagram, and one snippet of code. The three things I expect to lean on most.

Math

The gradient of a scalar field $f : \mathbb{R}^n \to \mathbb{R}$ is the vector of its partial derivatives. It points in the direction of steepest ascent, and its negation is the workhorse of every optimizer I will write this year:

$$ \nabla f(\mathbf{x}) \;=\; \left[ \frac{\partial f}{\partial x_1},\; \frac{\partial f}{\partial x_2},\; \ldots,\; \frac{\partial f}{\partial x_n} \right]^{\top}. $$

Gradient descent is then nothing more than the rule $\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t)$, and most of training is a discussion about how to be clever with $\eta$.

Diagrams

Here is the mental model I keep returning to: the loop a learning system runs forever.

The training loop: forward pass left-to-right, gradient flowing back.

Everything else (Adam, batch norm, attention, scaling laws) is commentary on this picture.

For diagrams that need annotated cells with pointers, like “what’s on the stack right now,” I use a tiny declarative DSL that renders to a real hand-drawn SVG. rough.js under the hood, same engine as Excalidraw:

An iterate buffer: each cell is a step along the descent path.

Just write the schema in JSON inside the figure; the page renders the picture on load. For one-off illustrations I’ll also reach for Excalidraw: draw it, Export → SVG, paste the SVG into a <figure class="diagram">. The two together cover most of what Crafting Interpreters and Game Programming Patterns do by hand.

Code

A tiny gradient descent in Python that I will reuse and abuse for months:

import numpy as np

def gradient_descent(grad, x0, lr=1e-2, steps=1_000):
    """Take `steps` along -∇f starting from x0."""
    x = np.asarray(x0, dtype=float)
    for _ in range(steps):
        x -= lr * grad(x)
    return x

# Minimize f(x, y) = (x - 3)^2 + (y + 1)^2
grad_f = lambda v: np.array([2 * (v[0] - 3), 2 * (v[1] + 1)])
print(gradient_descent(grad_f, [0.0, 0.0]))   # → ~[3, -1]

Note This is a notebook, not a textbook. Posts will be short, opinionated, and occasionally wrong. I will edit shamelessly and date the corrections.

What’s next

Three posts queued in my head:

Why a matrix is a function, and what changes when you start believing it.
A from-scratch autograd in 90 lines, with no magic.
What an H100 actually does in a microsecond, a budget post.

If any of that sounds like fun, subscribe to the feed or just check back. The site is small and the writing will be slow, but consistent.

Onward.