Energy-efficient AI model could be a game changer, 50 times better efficiency with no performance hit

Shawn Knight

Posts: 15,385   +193
Staff member
Cutting corners: Researchers from the University of California, Santa Cruz, have devised a way to run a billion-parameter-scale large language model using just 13 watts of power – about as much as a modern LED light bulb. For comparison, a data center-grade GPU used for LLM tasks requires around 700 watts.

AI up to this point has largely been a race to be first, with little consideration for metrics like efficiency. Looking to change that, the researchers trimmed out an intensive technique called matrix multiplication. This technique assigns words to numbers, stores them in matrices, and multiples them together to create language. As you can imagine, it is rather hardware intensive.

The team's revised approach instead forces all of the numbers in their neural network matrices to be ternary, meaning they can only have one of three values: negative one, zero, or positive one. This key change was inspired by a paper from Microsoft, and means that all computation involves summing rather than multiplying – an approach that is far less hardware intensive.

Speaking of, the team also created custom hardware using a highly-customizable circuit called a field-programmable gate array (FPGA). The custom hardware allowed them to maximize all of the energy-saving features baked into the neural network.

Running on the custom hardware, the team's neural network is more than 50 times more efficient than a typical setup. Best yet, it provides the same sort of performance as a top-tier model like Meta's Llama.

It's worth noting that custom hardware isn't necessary with the new approach – it's just icing on the cake. The neural network was designed to run on standard GPUs that are common in the AI industry, and testing revealed roughly 10 times less memory consumption compared to a multiplication-based neural network. Requiring less memory could open the door to full-fledged neural networks on mobile devices like smartphones.

With these sort of efficiency gains in play and given a full data center worth of power, AI could soon take another huge leap forward.

Image credit: Emrah Tolu, Pixabay

Permalink to story:

 
One group claimed gains from building hardware focused on matrix multiplication.

Another group now claims gains from eliminating it.
 
It’s been tested on significantly lower workloads than the nvidia GPUs are being used. About 1000x lower. Still impressive and interested to see how it scales up
 
Watts during a workload is not a measure of efficiency. Watt seconds (joules) to complete is. I see this much too often from media outlets, and I assume it isn’t what the paper actually said, because researchers would know this… please report on power consumed, not power. The latter figure is only interesting when buying a PSU or configuring your electrical network.
 
Watts during a workload is not a measure of efficiency. Watt seconds (joules) to complete is. I see this much too often from media outlets, and I assume it isn’t what the paper actually said, because researchers would know this… please report on power consumed, not power. The latter figure is only interesting when buying a PSU or configuring your electrical network.
Exactly this. The term "efficiency" is very poorly used in this context. The actual paper talks about using 13W at "beyond human readable throughput" which is far slower that the higher power soutions.
 
Game changer? What game exactly, there isn't even a finished game yet, this stuff is in full development. And you know what this means: it'll enable someone to make more complex systems that might do more per Watt, but use more energy in the end.
 
This technology is still new, so breakthroughs are certainly possible. If we see a bunch of new papers exploring this approach then maybe it has promise. Right now any radically different solution should be treated with scepticism as there is lots of money and hype.
 
Back