No GPU. No VRAM. No tensor cores. No driver stack.
Just cache lines and clock cycles.
Modern AI is built on matrix multiplication — dense linear algebra that maps perfectly to GPU hardware. Transformers multiply enormous weight matrices thousands of times per token. Without a GPU, they crawl.
The Evolved Neural Circuit has no weight matrices. No matrix multiplies. No dense linear algebra at all. Each computational unit runs a tiny program — 16 bytes of branching, pointer chasing, and conditional logic. This is exactly what CPUs are built for, and exactly what GPUs are bad at.
The result: an AI architecture that is faster on CPU than it would be on GPU. Not "tolerable on CPU." Not "also runs on CPU." Genuinely faster, because the work is irregular, branching, and cache-local — the CPU's home turf.
GPUs are SIMD machines — they execute the same instruction across thousands of threads simultaneously. This is perfect for matrix multiplication, where every element does the same operation. But when threads need to branch differently, the GPU stalls. It's called thread divergence, and it destroys GPU utilization.
Every unit in the Evolved Neural Circuit runs a different tape. Each tape branches differently, reads from different pointers, hits different lookup tables. On a GPU, this means every thread in a wavefront would diverge on every instruction. The GPU would serialize what was supposed to be parallel work.
Each of the 2,048 units per layer executes its own tape with unique control flow. CPUs handle branch prediction per core. GPUs force lockstep execution across warps — one divergent branch stalls the entire group.
Units read from arbitrary locations in arbitrary layers via pointer connections. This random-access pattern thrives on CPU cache prefetching and out-of-order execution. GPU memory access needs coalesced, predictable patterns to hit bandwidth targets.
Each layer is exactly L1 cache size on Apple Silicon. The entire working set is always in the fastest memory on the chip. No cache misses. No memory stalls. The GPU equivalent — shared memory — requires explicit management and is smaller.
Zero GEMMs. Zero tensor core operations. The thing GPUs are specifically designed to accelerate doesn't exist in this architecture. Putting this workload on a GPU would be like hiring a crane to move a coffee cup.
8 independent boxes run on 8 CPU cores via GCD dispatch. Each box processes its own context slice. No synchronization between boxes during forward pass. Clean scaling with core count — the way CPUs are meant to parallelize.
Tapes contain loops, conditional writes, and state-dependent branches. This is control-flow-heavy code — exactly what CPU branch predictors and speculative execution were engineered for over 30 years.
| Transformer LLM | Evolved Neural Circuit | |
|---|---|---|
| Primary operation | Dense matrix multiply (GEMM) | Branching tape execution |
| Optimal hardware | GPU / TPU | CPU |
| Memory footprint | 4–70 GB (VRAM) | As small as 4 MB per region |
| Memory access pattern | Sequential, coalesced | Random, pointer-driven |
| Thread behavior | Uniform (SIMD-friendly) | Divergent (CPU-friendly) |
| Inference speed | 30–100 tok/s (consumer GPU) | CPU-native, no GPU needed |
| Power draw | 150–400W (GPU) | Milliwatts (single core) |
| Required driver stack | CUDA / ROCm / Metal | None — pure C++ |
| Runs on phones | Barely — quantized, slow | Native speed |
| Runs on embedded | No | Yes — any ARM/x86 |
A transformer needs a datacenter to think. The Evolved Neural Circuit needs a cache line.
This isn't a compromise — running on CPU because GPU isn't available. The architecture was evolved to exploit CPU strengths: branch prediction, out-of-order execution, deep cache hierarchies, and per-core independence. Moving it to a GPU would make it slower, not faster.
The brain has no maximum size. It grows by adding functional regions — the way the human cerebral cortex is organized into 180 distinct areas per hemisphere. The hippocampus decides when to create new regions and routes attention to the right ones. Each region is as small as 4 MB, fits in L2 cache, and runs on its own CPU core. When the brain outgrows available RAM, specialized knowledge swaps to disk and back on demand. The brain scales with the hardware it's given.
AI that runs at full speed on a phone. On a Raspberry Pi. On a medical device, a hearing aid, a car's ECU, a satellite. No cloud. No API call. No GPU allocation queue. No power budget measured in hundreds of watts. The entire history of deep learning has been a story of scaling GPU compute. The Evolved Neural Circuit asks a different question: what if the architecture was designed for the processor that's already in every device on Earth?