Open-source edge AI

Run PyTorch on Nordic's AXON NPU

The first open-source ExecuTorch backend for Nordic's AXON NPU — running on Zephyr RTOS.

Write the model in PyTorch, deploy it to Nordic silicon in minutes. The gap between data scientist and embedded developer just got small — and you can even be both!

56× speedup · 27.6 ms / token
PyTorch Model Standard training INT8 Quantization torch.ao.quantization TOSA Lowering Tensor Operator Set AXON Compiler Nordic toolchain nRF54LM20B On-device inference

Standard PyTorch workflow — no custom framework. The same tools ML engineers already use.

The 56× optimization

A 192k-parameter Mamba1 language model optimized from 1,550 ms to 27.6 ms per token on the nRF54LM20B — a 56× speedup through progressive optimizations. A case study of what the backend enables on production workloads.

Stage Latency Speedup What changed
fp32 portable baseline 1,550 ms ExecuTorch portable runtime
AXON NPU delegation 1,000 ms 1.5× Linear layers offloaded to the NPU
Fused custom operators 55 ms 28× 22 op-fusion patterns (PTSiLU, PTSoftplus, RMSNorm)
q11.12 fixed-point 35 ms 44× Power-of-two bit-shifts replace FPU ops
Persistent scan state 27.6 ms 56× Zero-copy q12 state between tokens

Three showcase models

Each one is a complete PyTorch model — training notebook, quantization recipe, and deployment to the nRF54LM20 DK in a single flow. The kind of thing a data scientist can run end-to-end without needing a separate embedded team.

Sensor fault detection

Anomaly Detection

Industrial IoT predictive maintenance

Shape recognition

Image Classifier

Visual inspection

Audio keyword detection

Keyword Spotting

Always-on wake word

How it works

Delegate pattern. ExecuTorch partitions a model graph and hands supported subgraphs to backend delegates. Our AXON delegate claims the ops the NPU can accelerate, and ExecuTorch falls back to the portable CPU runtime for everything else. No model surgery required.

TOSA composition. PyTorch ops are lowered to the Tensor Operator Set Architecture — the same stable IR used by Arm's Ethos-U toolchain. The backend composes with ExecuTorch's shared TOSABackend infrastructure, reusing roughly 80% of the Ethos-U code path. New PyTorch ops that decompose to TOSA primitives AXON already supports get coverage automatically; expanding beyond that is a matter of hardware op support plus a short converter, not a new frontend.

Delegated op set. Fully Connected, Conv1D / Conv2D, Depthwise Conv, Average / Max Pool, Add, Multiply, and the ReLU family run directly on the NPU. Op extensions cover Sigmoid, Tanh, and Softmax. Anything outside this set falls back to the portable CPU runtime.

Op-fusion passes. The biggest wins come from pattern-matching fused operators (PTSiLU, PTSoftplus, RMSNorm, and 19 others) that collapse multiple TOSA ops into single NPU kernels. Combined with q11.12 fixed-point arithmetic — where scaling becomes a bit-shift — the compiler can emit hand-tuned kernels from a standard PyTorch model.

Zephyr RTOS. Inference runs inside a Zephyr application — ExecuTorch and Nordic's sdk-edge-ai are both Zephyr modules pulled in via west. The result coexists with the rest of the firmware (BLE stack, sensors, peripherals) using Zephyr's cooperative and preemptive scheduling, no bare-metal gymnastics required.

PyTorch model .pt / quantized ExecuTorch runtime graph partitioner AXON delegate TOSA → NPU kernels Zephyr RTOS NPU device driver AXON NPU nRF54LM20B silicon

Get started

One Docker command. Everything pre-installed: NCS toolchain, ExecuTorch, PyTorch, Jupyter.

bash
docker build -t axon-ai:latest docker/
docker run --rm -it \
  -v $(pwd):/workspace/axon-ai \
  -v ~/sdk-edge-ai:/opt/sdk-edge-ai:ro \
  -p 8888:8888 axon-ai:latest