02 26 February 2026

When the model becomes the chip

A Canadian startup is etching neural networks directly into silicon. The implications go further than the benchmarks suggest.

In February, a 24-person Canadian startup called Taalas did something the semiconductor industry had not seen before. They took Meta's Llama 3.1 8B language model and physically etched it into silicon. Not loaded it. Not cached it in high-bandwidth memory. Etched it. The model's weights are encoded permanently into the chip's metal layers using a structure called a mask ROM recall fabric. Write-once silicon. The result is a chip that does not run Llama 3.1. It is Llama 3.1.

The performance numbers are difficult to dismiss. Taalas claims roughly 17,000 tokens per second per user on their HC1 chip. For context, an Nvidia H200 achieves around 230 tokens per second on the same model. Cerebras, widely considered the fastest inference provider in the industry, manages around 1,936 tokens per second. The HC1 is operating at a different order of magnitude. It draws 200 to 250 watts per card, compared to up to 700 watts for the datacenter-class H100s it would replace. A standard air-cooled rack with ten HC1 cards runs at 2.5 kilowatts and delivers throughput that would otherwise require an entire GPU cluster with liquid cooling infrastructure.

How you build a model into a chip

The approach is conceptually simple, even if the engineering is not. Taalas uses a compiler-like system that takes a model's computational graph and translates it directly into a physical chip layout. The model weights are stored using a one-transistor-per-weight density, with 8 billion parameters (in quantised formats) fitting on a single 815 mm² die fabricated on TSMC's 6nm process. The chip contains roughly 53 billion transistors. No HBM stacks. No 3D stacking. No exotic cooling.

The key engineering insight is that out of a chip's 100-plus fabrication layers, only the top two metal layers need to be customised to store the weights. The rest are standard. This means TSMC can produce the chip in roughly two months, compared to six months for a typical custom ASIC. Taalas has built what amounts to a weights-to-silicon pipeline: hand them a trained model, and they can have chips coming off the line in eight weeks.

There is an obvious limitation. The HC1 can only run Llama 3.1 8B. You cannot load a different model onto it. You cannot retrain it. It does one thing. But Taalas has included a programmable SRAM recall fabric that supports Low-Rank Adaptation (LoRA) and fine-tuning, which means you can specialise the base model for particular tasks or domains without new silicon. The base model is frozen in the metal. The adaptations live in reprogrammable memory.

The economics of doing one thing

The semiconductor industry has a long history of this trade-off between generality and efficiency. ASICs (application-specific integrated circuits) have always been faster and more power-efficient than general-purpose processors for any given task. GPUs dominate AI not because they are optimal, but because the field has been moving too fast to commit to fixed silicon. When the best model changes every six months, you need hardware that can run whatever comes next.

But inference is different from training. When you deploy a model at scale, you are running the same model, with the same architecture, billions of times. The research phase is over. You are in production. And in production, the economics change. A chip that costs 20x less to manufacture and uses 10x less power starts to look less like a niche curiosity and more like the obvious architecture for high-volume inference.

Taalas has raised $219 million, which is a lot of money for a chip that can only run one model. The founding team came from AMD and Tenstorrent. They understand the chip business and chose to build something that breaks most of its conventions.

The timeline for specialised silicon

Taalas is the most visible example, but the broader trend is unmistakable. Custom ASIC shipments from cloud providers are projected to grow 44.6% in 2026, compared to 16.1% growth for GPU shipments. ASIC share in the AI inference market is expected to grow from 15% in 2024 to 40% in 2026. The industry has a name for this moment: the "inference flip," the point at which inference overtook training as the majority of AI compute. Estimates of the split range from two-thirds to 90 percent, but the direction is not in dispute.

Google's TPU v7 (Ironwood) achieves 9.6 terabits per second per chip and supports pods of 9,216 accelerators, recently used to train Gemini 3 entirely on Google's own silicon. Amazon's Trainium2 uses custom NeuronLink interconnects at 1 TB/s chip-to-chip. Microsoft has built Maia 100, its first in-house AI chip. The hyperscalers are not waiting for Nvidia to solve the inference efficiency problem. They are building their own answers.

The Taalas approach is the extreme end of this spectrum. Google and Amazon are building chips optimised for AI workloads in general. Taalas is building chips optimised for a single model. For anyone running Llama at scale (and a lot of companies are), a 10x power reduction and 10x speed improvement may justify the inflexibility.

Taalas' roadmap gives some indication of the pace. Their second model, a mid-sized reasoning LLM on the same HC1 platform, is expected in labs this spring. A frontier model exceeding 20 billion parameters on their second-generation HC2 platform is planned for winter 2026-27. HC2 adopts standard 4-bit floating-point and supports multi-chip designs. If that timeline holds (and semiconductor roadmaps frequently slip), by early 2027 we will have hardwired inference chips running frontier-class models.

What changes then

If specialised inference silicon works at scale, the effects extend well beyond the throughput numbers.

The first is geographic. GPU-based inference requires enormous power density, liquid cooling, and proximity to high-bandwidth networks. A rack of air-cooled HC1 cards running at 2.5 kilowatts could operate in a standard server room. Edge deployment becomes practical for workloads that currently require cloud infrastructure, and the centralisation that currently characterises AI deployment starts to loosen.

The second is economic. If the cost of running inference drops by an order of magnitude, the business models built on selling inference compute change shape. The margin structure of API providers shifts. The threshold for building custom AI applications drops. Use cases that are currently uneconomic (real-time AI in low-margin industries, always-on assistants for individual users, AI-powered monitoring of physical systems) become viable. When compute gets cheap enough, people find things to do with it that nobody predicted.

The third is environmental. If global AI inference is heading toward 90 TWh per year, and specialised chips can deliver the same throughput at one-tenth the power, the energy arithmetic changes materially. AI's energy footprint does not vanish, but the conversation moves from building new power plants to better engineering.

What becomes the next frontier

If inference becomes cheap, fast, and widely distributed, the bottleneck moves elsewhere.

One candidate is model quality at small scale. At 17,000 tokens per second, an 8-billion parameter model is no longer bottlenecked by speed. Whether 8 billion parameters are good enough becomes the binding question, and for a lot of what people actually want to use these for, eight billion is not enough. Distillation techniques, architecture search, and training data curation turn into the valuable engineering problems, because the chip already solved throughput.

A second candidate is orchestration. When individual inference calls are near-instantaneous and cheap, you can compose them. Multi-agent systems, chain-of-thought workflows, AI systems that call other AI systems hundreds of times per task. The latency budget that currently constrains agentic AI disappears. Suddenly you can build things that were architecturally impossible when each call cost time and money. The new bottleneck becomes coordination: how do you design systems that use thousands of cheap inference calls to do something coherent.

A third, and perhaps the most consequential, is the relationship between inference and the physical world. Today, AI models sit in datacenters and process text, images, and code. If inference moves to the edge (embedded in vehicles, industrial systems, medical devices, infrastructure monitoring), AI starts interacting with physical systems at machine speed. The frontier becomes embodiment: can the model act on its answer before the situation changes.

This could mean cheap, distributed, energy-efficient AI running specialised models at the edge, making every system more adaptive. It could also mean a world saturated with frozen inference, where whoever controls the LoRA weights controls the behaviour of silicon that cannot be patched. Either way, centralisation moves from the datacenter to the model layer.

The cost of permanence

Hardwired inference silicon has a property that benchmarks do not capture. When you etch a model into a chip, you freeze not just the weights but every structural choice that produced those weights: the training data distribution, the alignment tuning, the quantisation scheme, the architecture. Those choices become permanent physical facts about the chip. They cannot be revisited. The consequences of that permanence show up in at least four ways.

The first is uneven degradation across tasks. The HC1 uses aggressive 3-bit and 6-bit quantisation to fit 8 billion parameters onto a single die. Quantisation at these levels does not degrade all capabilities equally. Simple factual retrieval and short-form question answering tend to survive well. Nuanced multi-step reasoning, low-frequency factual knowledge, and multilingual performance suffer disproportionately. The model loses resolution where it was already weakest. A chip optimised and benchmarked for token throughput on standard evaluations may silently underperform on the long tail of real deployment tasks: the ambiguous queries, the domain-specific reasoning, and the languages with less training data. And because the weights are in metal, there is no patch. The LoRA fabric can adjust surface behaviour for specific fine-tuning tasks, but it cannot recover base model capacity that was lost during quantisation.

The second is frozen alignment. Safety tuning is not a one-time process. It evolves continuously as new failure modes are discovered, as adversarial techniques improve, as the understanding of harmful outputs changes. A model etched into silicon carries the alignment understanding of its fabrication date.

When the GPU-served version of Llama receives updated safety tuning (new refusals, corrected edge cases, mitigations for newly discovered jailbreaks), the silicon version cannot incorporate those changes at the base model level. LoRA adaptations can modify output distributions for specific prompts, but they cannot alter the underlying representations that the base model uses to process and generate language. The safety properties of the chip are fixed at manufacture. Everything learned about the model's failure modes after that date can only be addressed at the adaptation layer, which is a fundamentally weaker intervention.

The third is invisible capability ceilings. At launch, users and operators will benchmark hardwired chips against full-precision GPU inference, find the quality acceptable for their use cases, and integrate them into production systems. Over the following months, the GPU-served version of the same model will receive updates, or be superseded by better models, or benefit from improved serving techniques. The silicon version remains exactly what it was on fabrication day. It becomes a progressively worse approximation of the state of the art, but nothing in the system flags this. There is no automatic comparison, no degradation alert. Operators who validated quality once at deployment have no structural reason to re-evaluate, and most will not.

The fourth is an accountability gap. When a model runs as software on general-purpose hardware, the chain of responsibility is relatively legible. The organisation that deploys it chose the model, chose the serving configuration, and can modify both. When the model is silicon, a fabrication decision made months before deployment (the quantisation scheme, the training checkpoint selected, the alignment tuning applied) shapes every output the chip will ever produce. The deploying organisation did not make those choices and cannot change them. The fabricating organisation made those choices but does not control how the chip is used. The chain of responsibility becomes longer, the causal links become harder to trace, and the ability to remediate problems after deployment is structurally limited in ways that software deployment is not. Nobody has tested this chain in court yet.

All four problems compound over time. The chip does not degrade, but everything around it improves, and the gap is where the risk accumulates.

None of this invalidates the approach, but it sharpens the picture of what is being traded. The shift from general-purpose to specialised inference hardware is real, it is accelerating, and it will change the economics and geography of AI deployment. But every design choice that makes these chips faster also makes them less adaptable. Taalas' own LoRA fabric acknowledges as much: an escape hatch built into a chip designed to have no moving parts. Freezing a model into silicon freezes the entire context in which that model was produced, and fabrication-time validation alone will not account for what comes after.

For researchers trying to understand AI's trajectory, the old lesson holds: follow the hardware. But now the hardware includes the model itself, frozen in metal, running at 17,000 tokens per second, doing one thing extraordinarily well while being structurally incapable of doing anything else.

The future of AI inference is a spectrum: general-purpose GPUs for training and experimentation, optimised ASICs for high-volume production inference, and at the far end, silicon permanently fused with a specific model. The interesting question is where each sits in the stack, and who controls the boundary between them.

← Back to Writing