When the model becomes the chip

A Canadian startup is etching neural networks directly into silicon. The implications go further than the benchmarks suggest.

In February, a 24-person Canadian startup called Taalas did something the semiconductor industry had not seen before. They took Meta's Llama 3.1 8B language model and physically etched it into silicon. Not loaded it. Not cached it in high-bandwidth memory. Etched it. The model's weights are encoded permanently into the chip's metal layers using a structure called a mask ROM recall fabric. Write-once silicon. The result is a chip that does not run Llama 3.1. It is Llama 3.1.

The performance numbers are difficult to dismiss. Taalas claims roughly 17,000 tokens per second per user on their HC1 chip. For context, an Nvidia H200 achieves around 230 tokens per second on the same model. Cerebras, widely considered the fastest inference provider in the industry, manages around 1,936 tokens per second. The HC1 is operating at a different order of magnitude. It draws 200 to 250 watts per card, compared to 350 to 400 watts for an H100. A standard air-cooled rack with ten HC1 cards runs at 2.5 kilowatts and delivers throughput that would otherwise require an entire GPU cluster with liquid cooling infrastructure.

How you build a model into a chip

The approach is conceptually simple, even if the engineering is not. Taalas uses a compiler-like system that takes a model's computational graph and translates it directly into a physical chip layout. The model weights are stored using a one-transistor-per-weight density, with 8 billion parameters (in quantised formats) fitting on a single 815 mm² die fabricated on TSMC's 6nm process. The chip contains roughly 53 billion transistors. No HBM stacks. No 3D stacking. No exotic cooling.

The key engineering insight is that out of a chip's 100-plus fabrication layers, only the top two metal layers need to be customised to store the weights. The rest are standard. This means TSMC can produce the chip in roughly two months, compared to six months for a typical custom ASIC. Taalas has built what amounts to a weights-to-silicon pipeline: hand them a trained model, and they can have chips coming off the line in eight weeks.

There is an obvious limitation. The HC1 can only run Llama 3.1 8B. You cannot load a different model onto it. You cannot retrain it. It does one thing. But Taalas has included a programmable SRAM recall fabric that supports Low-Rank Adaptation (LoRA) and fine-tuning, which means you can specialise the base model for particular tasks or domains without new silicon. The base model is frozen in the metal. The adaptations live in reprogrammable memory.

The economics of doing one thing

The semiconductor industry has a long history of this trade-off between generality and efficiency. ASICs (application-specific integrated circuits) have always been faster and more power-efficient than general-purpose processors for any given task. The reason GPUs dominate AI is not that they are optimal. It is that the field has been moving too fast to commit to fixed silicon. When the best model changes every six months, you need hardware that can run whatever comes next.

But inference is different from training. When you deploy a model at scale, you are running the same model, with the same architecture, billions of times. The research phase is over. You are in production. And in production, the calculus shifts. A chip that costs 20x less to manufacture and uses 10x less power starts to look less like a niche curiosity and more like the obvious architecture for high-volume inference.

Taalas has raised $219 million. That is not curiosity money. The founding team came from AMD and Tenstorrent, which is to say they understand the chip business and chose to build something that breaks most of its conventions.

The timeline for specialised silicon

Taalas is the most visible example, but the broader trend is unmistakable. Custom ASIC shipments from cloud providers are projected to grow 44.6% in 2026, compared to 16.1% growth for GPU shipments. ASIC share in the AI inference market is expected to grow from 15% in 2024 to 40% in 2026. The industry has a name for this moment: the "inference flip," where inference workloads now account for two-thirds of all AI compute, surpassing training for the first time.

Google's TPU v7 (Ironwood) achieves 9.6 terabits per second per chip and supports pods of 9,216 accelerators, recently used to train Gemini 3 entirely on Google's own silicon. Amazon's Trainium2 uses custom NeuronLink interconnects at 1 TB/s chip-to-chip. Microsoft has built Maia 100, its first in-house AI chip. The hyperscalers are not waiting for Nvidia to solve the inference efficiency problem. They are building their own answers.

The Taalas approach is the extreme end of this spectrum. Google and Amazon are building chips optimised for AI workloads in general. Taalas is building chips optimised for a single model. The question is whether the economics of high-volume inference make that extreme position rational. If you are running Llama at scale (and a lot of companies are running Llama at scale), a 10x power reduction and 10x speed improvement may be worth the inflexibility.

Taalas' roadmap gives some indication of the pace. Their second model, a mid-sized reasoning LLM on the same HC1 platform, is expected in labs this spring. A frontier model exceeding 20 billion parameters on their second-generation HC2 platform is planned for winter 2026-27. HC2 adopts standard 4-bit floating-point and supports multi-chip designs. If that timeline holds, by early 2027 we will have hardwired inference chips running frontier-class models.

What the landscape looks like then

If specialised inference silicon works at scale, the implications ripple outward in ways that are not obvious from the benchmarks.

The first is geographic. GPU-based inference requires enormous power density, liquid cooling, and proximity to high-bandwidth networks. A rack of air-cooled HC1 cards running at 2.5 kilowatts could operate in a standard server room. That changes where inference can happen. Edge deployment becomes feasible for workloads that currently require cloud infrastructure. Countries and organisations that cannot build hyperscale datacenters get access to high-throughput AI. The centralisation that currently characterises AI deployment starts to loosen.

The second is economic. If the cost of running inference drops by an order of magnitude, the business models built on selling inference compute change shape. The margin structure of API providers shifts. The threshold for building custom AI applications drops. Use cases that are currently uneconomic (real-time AI in low-margin industries, always-on assistants for individual users, AI-powered monitoring of physical systems) become viable. When compute gets cheap enough, people find things to do with it that nobody predicted. This has happened with every prior generation of silicon.

The third is environmental. If global AI inference is heading toward 90 TWh per year, and specialised chips can deliver the same throughput at one-tenth the power, the energy arithmetic changes materially. Not enough to make AI's footprint vanish. But enough to change the trajectory from "requires new power plants" to "requires better engineering." Those are different conversations.

What becomes the next frontier

If inference becomes cheap, fast, and widely distributed, the bottleneck moves. It always does. The question is where.

One candidate is model quality at small scale. If you can run an 8-billion parameter model at 17,000 tokens per second, the constraint is no longer speed. It is whether an 8-billion parameter model is good enough for your task. The race shifts from "how do we run large models efficiently" to "how do we make small models smarter." Distillation techniques, architecture search, and training data curation become the valuable engineering problems. The chip solved the compute problem. Now you need to solve the intelligence-per-parameter problem.

A second candidate is orchestration. When individual inference calls are near-instantaneous and cheap, you can compose them. Multi-agent systems, chain-of-thought workflows, AI systems that call other AI systems hundreds of times per task. The latency budget that currently constrains agentic AI disappears. The complexity of what you can build with AI as a component goes up dramatically. The new bottleneck becomes coordination: how do you design systems that use thousands of cheap inference calls to do something coherent.

A third, and perhaps the most consequential, is the relationship between inference and the physical world. Today, AI models sit in datacenters and process text, images, and code. If inference moves to the edge (embedded in vehicles, industrial systems, medical devices, infrastructure monitoring), AI starts interacting with physical systems at machine speed. The frontier shifts from intelligence to embodiment. From "can the model answer the question" to "can the model act on the answer before the situation changes."

There is a version of this that is utopian. Cheap, distributed, energy-efficient AI running specialised models at the edge, embedded in everything, making every system more adaptive. There is also a version that is concerning. A world saturated with AI inference, running models whose behaviour is frozen in silicon and whose adaptations are controlled by whoever owns the LoRA weights. The centralisation does not disappear. It just moves from the datacenter to the model layer.

The pattern beneath the pattern

When I worked on the Muon g-2 experiment at Fermilab, we spent enormous effort on systematic uncertainties. Not the statistical noise. The structural biases in the measurement apparatus itself. The things that were true about every data point, invisibly, because of how the detector was built.

Hardwired inference silicon has its own version of this. When you etch a model into a chip, you are committing to a particular set of weights, a particular quantisation scheme, a particular view of what that model should do. The HC1 uses aggressive 3-bit and 6-bit quantisation, which introduces quality degradations relative to full-precision inference. Those degradations are not random. They are structural. Every response from that chip carries the systematic uncertainty of those design choices, invisibly, because of how the silicon was built.

That is not a reason to dismiss the approach. It is a reason to understand it clearly. The shift from general-purpose to specialised inference hardware is real, it is accelerating, and it will change the economics and geography of AI deployment. But every design choice that makes it faster also makes it less flexible. Every watt saved is a degree of freedom lost. The companies that navigate this trade-off well will define the next phase of AI infrastructure. The ones that mistake speed for completeness will learn the difference the hard way.

For researchers trying to understand AI's trajectory, the lesson from the interconnect piece holds: follow the hardware. But now the hardware is not just the chips and the plumbing. It is the model itself, frozen in metal, running at 17,000 tokens per second, doing one thing extraordinarily well while being structurally incapable of doing anything else.

The future of AI inference will not be one architecture. It will be a spectrum: general-purpose GPUs for training and experimentation, optimised ASICs for high-volume production inference, and at the far end, silicon that has been permanently fused with a specific model. The interesting question is not which wins. It is where each sits in the stack, and who controls the boundary between them.

← Back to Writing