Known physics, unknown constants
Scientists usually know the shape of an equation — Coulomb's law, a pendulum's swing, how a population grows — but not its exact constants. Classical machine learning forces a bad trade-off.
Write fixed code
Fast and exact — but frozen. You cannot gradient-optimize the constants you don't know. Every unknown becomes a manual sweep.
exact · not learnableTrain a neural net
Trainable — but an opaque approximation of structure you already knew. Data-hungry, and it extrapolates badly outside the training set.
learnable · opaqueCompile a program
Keep the structure exactly as written and frozen; make only its numeric constants learnable. Hard physics, soft parameters — the best of both.
exact and learnableThe first insight
A program's structure (its control flow, recursion, composition) can stay frozen while its numeric constants become trainable parameters fit by gradient descent. This erases the boundary between fixed code and trainable network. In the original Neural Compiler, a compiled 1-D heat-equation model recovers the diffusivity to 0.0% error — where a physics-informed neural net (PINN) lands at 92.7%.
A program is just an expression
The system speaks a small subset of Scheme, a minimalist Lisp. A program is an S-expression — a parenthesized tree — and it evaluates to a number. Try it: edit the program or the inputs and run it. This little interpreter runs entirely in your browser.
The pieces you'll need
Everything in this tutorial is built from these forms. Click a chip to load it into the editor.
| (+ - * /) | arithmetic |
| (if t a b) | conditional |
| (let ((x 1)) …) | local binding |
| (lambda (x) …) | function / closure |
| (define (f x) …) | named definition |
| sin cos exp log √ | math primitives |
Here's Coulomb's law as a program, with an unknown constant k:
(/ (* k (* q1 q2)) (* r r)). Hold that thought — in a moment we'll learn
k from data by descending through this exact expression.
Parse → ANF → Compute Graph
To make a program differentiable, the compiler lowers it through three stages. Type a program below and step through each one — the highlighting links every stage to the next.
Parse
A hand-written tokenizer + recursive-descent parser turns text into an
immutable AST. Surface forms like cond, let* and when
desugar to a small core.
A-Normal Form
Every compound subexpression is hoisted into a named let-binding,
so each binding maps 1:1 to a graph node. Think "SSA for functional languages" — showing your
work line by line.
Compute Graph
Each binding becomes a node in a dataflow DAG. Walk it in PyTorch and, because every primitive is differentiable, autograd flows from a loss back to the constants.
Gradient descent through a program
Here is the idea made tangible. We have data and a program with an unknown constant. Press play and watch real reverse-mode autodiff — running in your browser on the graph we just built — pull the constant to its true value. This is exactly what the paper does, scaled up.
Why this beats a neural network here
The program contributes zero approximation error on its valid domain — all the error lives in the one constant we're learning. That's why a model with 1–4 parameters routinely beats an 8,500-parameter MLP, and why it extrapolates. On the LLM-generated benchmark, the compiled model's loss is up to 4,100× lower than a pure MLP — which fails outright on Coulomb's 1/r² singularity.
Don't compile the program.
Compile the interpreter.
The original Neural Compiler compiled each program on its own. But the compiler is expressive enough to compile something far more powerful: a Scheme interpreter written in Scheme. Compile that once, and every program becomes data you hand to it.
The central insight
"The model is not differentiable; the interpreter is." A scientific model expressed as a Scheme program is not itself a differentiable computation graph — it is input data to the compiled interpreter, which is a differentiable computation graph. Because the interpreter is compiled once and verified once, any program it evaluates automatically inherits differentiability. Gradients flow through the interpreter's dispatch logic, environment management, and heap operations to reach the learnable parameters θ inside P(θ).
This is a real, self-hosting interpreter
The evaluator below is the actual core of bootstrap/compiler.scm —
215 lines of Scheme, 25 primitives. The compiler turns this into the differentiable
graph. Notice it's just a cond over expression shapes: a number evaluates to itself,
a symbol looks up the environment, a list dispatches on its head.
Self-hosting is the proof of expressiveness: the compiler can compile its own evaluator, which can then evaluate Scheme — including itself.
(define (scheme-eval expr env)
(cond
((number? expr) expr) ; numbers self-evaluate
((symbol? expr) (env-lookup expr env)) ; variables → environment
((pair? expr)
(let ((head (car expr)))
(cond
((eq? head 'quote) (car (cdr expr)))
((eq? head 'if)
(let ((t (scheme-eval (car (cdr expr)) env)))
(if t (scheme-eval (car (cdr (cdr expr))) env)
(scheme-eval (car (cdr (cdr (cdr expr)))) env))))
((eq? head 'lambda)
(list 'closure (car (cdr expr)) ; capture params,
(car (cdr (cdr expr))) env)) ; body, & environment
(#t (eval-apply head (cdr expr) env)))))
(#t 0)))
Compile once. Ship it. Run any program.
from neural_compiler import compile_interpreter, evaluate_program, save_compiled, load_compiled
interp = compile_interpreter() # compile the evaluator ONCE
save_compiled(interp, "interpreter.ncg") # ~285 KB portable JSON — ship it anywhere
interp = load_compiled("interpreter.ncg") # consumer: no Scheme toolchain, no recompilation
y = evaluate_program(interp, "(* a (exp (* (- 0 b) x)))", {"x": 1.5, "a": 2.5, "b": 0.8})
z = evaluate_program(interp, "(+ (* a x) (* b (* x x)))", {"x": 2.0, "a": 3.0, "b": 1.5})
# Two different programs — same compiled graph. Gradients flow to a, b in both.
Three tricks make it differentiable
For autograd to flow through an interpreter, every value, every memory operation, and every branch has to stay on the differentiable path. Three mechanisms make that true.
Tagged values
Every runtime value — a number, a boolean, a pair, even a closure — is the same shape: a fixed 14-dimensional tensor. The first 10 dimensions are a one-hot type tag that routes dispatch; the last 4 are a payload. Crucially, gradients flow through the payload while the tag merely decides what to do. Click a type to inspect it.
A write-once heap
Pairs, lists and closures live on a dictionary-backed heap keyed by integer
addresses. Because the language is pure, every slot is written exactly once and
never mutated. PyTorch's autograd version-counter is never tripped, so the gradient chain survives
arbitrarily many cons / car / cdr operations.
Step through building a list on both a write-once heap and a naive mutating buffer — watch the autograd chain stay intact on one and shatter on the other.
Soft control flow
For non-recursive conditionals whose branches are both total, if is a
differentiable multiplexer: sel·then + (1−sel)·else (recursive conditionals
use lazy evaluation to bound depth). Branch decisions are made by program structure, not by
the constants we're learning — so the shape of the autograd tape stays fixed even as parameter values
change. Slide the test value to see the gate blend continuously.
The honest caveat
If a learnable parameter sits inside a branch
condition — say (< x α) — the comparison is a hard 0/1 step and α receives
zero gradient. The compiler doesn't introduce this; it inherits the source program's
own non-differentiability. This is a real boundary, not a bug, and the paper demonstrates it directly.
Compile the interpreter once, verify once
Because correctness is established for the interpreter, it holds for every program the interpreter runs. Three theorems, stated over a 13-primitive core language.
Compilation correctness
Every supported program, run by the compiled graph, produces the same value as the source semantics — to floating-point precision. Proved by structural induction. Empirically, compiled-interpreter and direct runs are bit-for-bit identical.
Gradient correctness a.e.
For learnable constants θ, the gradient through the compiled graph equals the true gradient for almost every θ — a set of full Lebesgue measure. The exceptions are a measure-zero union of branch boundaries, where the program inherits the source's own non-differentiability.
Composition preservation
Compose two almost-everywhere-gradient-correct programs and the result is too. The boundary set stays measure zero. This is what makes runtime composition of models trustworthy.
"Almost everywhere" is load-bearing
Gradients are exact on the open trace-constant region Θtc — where the discrete execution trace (which branches are taken, which tags dispatch) doesn't change with θ. Drag the point around the parameter space: inside a region the gradient is well-defined; crossing a boundary, a branch flips and the gradient is undefined exactly there.
The reassuring part: this region has full measure. A random initialization lands inside it with probability 1, and gradient steps keep you there until you deliberately cross a decision boundary.
Correctness ≠ success. Compilation makes the gradients correct; recovering a constant still depends on the loss landscape and the optimizer. Gradient descent is not magic — but now it's available, everywhere the math allows.
What the experiments show
Ten experiments (A–J) stress the system from gradient fidelity to throughput at scale. Every chart below is drawn from the paper's committed result data. Hover for exact values.
① Gradient equivalence & convergence Exp A & C
DMCI, direct compilation, and a hand-coded PyTorch interpreter trace the same convergence curve — they overlap to within 7×10⁻⁷ final-loss difference. A pure MLP, given the same task, diverges by orders of magnitude.
② LLM-generated models: physics vs MLP Exp B
An LLM wrote 15 scientific models from natural-language descriptions. All 15 compiled unmodified. DMCI matched direct compilation on every one (75/75 zero loss difference). A pure MLP fails badly where structure matters — note the log scale.
③ The latency cost — and how batching erases it Exp H
A single DMCI evaluation is ~14× slower than direct compilation. But that's a latency cost, not a throughput cost: over 99.8% of sequential time is Python bookkeeping paid once per batch. Batching amortizes it into an 852× speedup at batch size 1024.
④ Compile once, and the honest limits Exp E, I, J
Compiling the interpreter is O(1) in program size, while per-program approaches scale linearly — a flat-versus-linear crossover. DMCI is also the most robust optimizer as models grow — but it is deliberately not a discrete program-synthesis engine.
The case is reach and assurance — not speed
Great fit
- Fitting continuous constants inside a known program structure
- Programs generated by an LLM — differentiable with zero extra work
- Composing or hot-swapping models at runtime (20–35 ms, gradients flow through the composite)
- Recursive / iterative scientific models JAX's
vmapcan't vectorize - One source → four backends (PyTorch, JAX, NumPy, CuPy), verified once
Reach for something else
- Latency-critical single evaluations (≈14× overhead per call)
- Discrete program synthesis — only ~11% operator recovery vs 100% exhaustive
- Optimizing a constant that lives inside a branch condition (zero gradient)
- Problems with no known structure at all — train a network
"The ability to generate programs via LLMs, swap programs at runtime, or compose them programmatically — with gradient flow through the composite — is the architectural advantage that justifies the latency cost."
Read it, run it, cite it
Two papers and one open-source repository. Everything you need to reproduce the results or build on the system.
Install & first run
git clone https://github.com/sheneman/nncompile.git
cd nncompile && pip install -e .
python -c "from neural_compiler.compiler import run_scheme; \
print(run_scheme('(+ (* 3 4) 5)'))" # 17.0
Fit a program's constants
graph = compile_program("(* a x)", inputs={"a": None, "x": None})
a = torch.nn.Parameter(torch.tensor(0.0))
opt = torch.optim.Adam([a], lr=0.1)
for _ in range(200):
pred = unwrap_number(evaluate(graph, {"a": make_float(a), "x": make_float(x)}))
loss = (pred - y) ** 2
opt.zero_grad(); loss.backward(); opt.step()
print(a.item()) # ≈ 3.0 — recovered by descending through the program
The code →
The full compiler, runtime, four backends, experiments, and 800+ tests. MIT licensed.
Paper 2 — DMCI →
Compile Once, Differentiate Everywhere: A Differentiable Meta-Circular Interpreter. The method and experiments on this page. (arXiv ID pending — see the repo.)
Paper 1 — Neural Compiler →
The Neural Compiler: Program-to-Network Translation for Hybrid Scientific Machine Learning. The predecessor DMCI builds on.
Cite this work
@article{sheneman2026dmci,
title = {Compile Once, Differentiate Everywhere:
A Differentiable Meta-Circular Interpreter},
author = {Sheneman, Lucas},
year = {2026},
note = {code at https://github.com/sheneman/nncompile}
}
@article{sheneman2026neuralcompiler,
title = {The Neural Compiler: Program-to-Network Translation
for Hybrid Scientific Machine Learning},
author = {Sheneman, Lucas},
journal = {arXiv preprint arXiv:2605.22498},
year = {2026}
}