back to blog

The era of 1-bit LLMs: lower compute and costs

Read Time 5 mins | Written by: Cole

The era of 1-bit LLMs: lower compute and costs

LLMs like Llama use 16 or 8 bits to represent each of their billions or trillions of parameters. A new study from AI researchers shows that 1.58 bits can be used to scale LLMs and have huge business benefits. 

1-bit LLMs will require much less memory, compute resources, and energy to run. They’re not scalable yet, but it’s important to know about the Bitnet b1.58 model for LLMs.

This could make way for a whole new set of hardware optimized for 1-bit LLMs.

How do current LLMs compare to new BitNet b1.58 LLMs

Most current widely-used large language models – GPT-3, PaLM, Llama 2, etc. – use 16-bit floating-point (FP16) or 8-bit floating-point (FP8) representations for their parameters. This allows for a high degree of precision in representing the weights of the neural network.

In contrast, the BitNet b1.58 model introduced in this paper uses a ternary representation, where each parameter is represented using only 1.58 bits, taking on values of {-1, 0, 1}. This is a much more compact representation compared to the 16 or 8 bits used in conventional LLMs.

The key difference is that BitNet b1.58 can achieve similar performance to FP16 models with the same size and training data, while requiring significantly less memory, computation, and energy. This is because the 1.58-bit representation allows for more efficient storage and manipulation of the model parameters.

However, existing LLMs have not yet adopted this 1-bit architecture, and still rely on higher-precision representations.

There are a few reasons for this:

  1. Established infrastructure: Many current ML frameworks and hardware are optimized for FP16 or FP8 computations, so there will be switching costs to adopting 1-bit architectures.
  2. Proven performance: While the BitNet paper shows promising results, more research may be needed to demonstrate that 1-bit LLMs can consistently match the performance of higher-precision models across a wide range of tasks and domains.
  3. Training challenges: Training 1-bit models requires specialized techniques like quantization-aware training, which is less mature and well-understood than traditional training methods.

Business benefits of 1-Bit LLMs

Basically you’ll need less of everything to run 1-bit LLMs. Less memory, less GPUs, less electricity, less data centers,  and so on.

That means 1-bit LLMs can save you a lot of money and open up new potential for generative AI applications.

  • Lower infrastructure costs: The reduced memory footprint and faster inference times of 1-bit LLMs would allow businesses to run these models on less expensive hardware or cloud instances, reducing the overall infrastructure costs associated with AI deployments.
  • Improved scalability: With lower resource requirements, businesses could scale their AI applications more easily, serving more users or processing more data without hitting hardware limitations as quickly.
  • Edge and mobile deployment: The efficiency of 1-bit LLMs makes it more feasible to run them on edge devices or mobile phones, enabling new types of applications that require local, real-time language processing.
  • Faster development cycles: The ability to train and iterate on LLMs more quickly and cheaply could accelerate the development of new AI features and products, allowing businesses to bring innovations to market faster.
  • Greener AI: The energy savings of 1-bit LLMs could help businesses reduce the carbon footprint of their AI workloads, which is becoming an increasingly important consideration.
  • Wider accessibility: Lowering the barriers to training and deploying LLMs in terms of cost and resources could make this technology accessible to a broader range of businesses, including smaller organizations with limited AI budgets.
  • New architectures and use cases: The unique properties of 1-bit LLMs, such as their suitability for hardware acceleration, could inspire new AI architectures and enable novel applications that were previously infeasible with conventional LLMs.

Overall, the introduction of 1-bit LLMs like BitNet b1.58 can significantly lower the barriers to entry for businesses looking to deploy large-scale language models, while enabling new applications and business models that were previously not possible due to resource constraints.

Bitnet b1.58 LLM Technical Paper details

This paper introduces BitNet b1.58, a variant of the BitNet architecture where every parameter is ternary, taking on values of {-1, 0, 1}. 

The key points are:

  • BitNet b1.58 matches the performance of full-precision (FP16) Transformer LLMs with the same model size and training tokens, while being significantly more efficient in terms of latency, memory consumption, throughput, and energy consumption.
  • It uses a quantization function to constrain weights to -1, 0, or +1, and adopts components from the LLaMA architecture like RMSNorm, SwiGLU, and rotary embedding.
  • Experiments show BitNet b1.58 matches LLaMA LLM at 3B model size in terms of perplexity and zero-shot accuracy on end tasks, while being 2.7x faster and using 3.5x less memory. A 3.9B BitNet b1.58 outperforms the 3B LLaMA LLM.
  • BitNet b1.58's latency and memory advantages increase with model scale. A 70B BitNet b1.58 is 4.1x faster and uses 7.2x less memory than a 70B LLaMA LLM. It also enables 8.9x higher throughput.
  • The authors argue BitNet b1.58 defines a new scaling law and computation paradigm for LLMs. They discuss implications for Mixture-of-Expert models, long sequence handling, edge/mobile deployment, and call for new hardware optimized for 1-bit LLMs.

Read the full BitNet b1.58 paper here.

Want to hire AI experts to build your LLMs?s

To build cost-effective LLMs with the latest tech stack you need AI experts. Hiring internally could take 6-18 months but you need to start building AI solutions now, not next year. That’s why Codingscape exists. 

We can assemble a senior AI software engineering team for you in 4-6 weeks. It’ll be faster to get started, more cost-efficient than internal hiring, and we’ll deliver high-quality results quickly. We’ve been busy building LLM capabilities for our partners and helping them accomplish their AI roadmaps in 2024.

 Zappos, Twilio, and Veho are just a few companies that trust us to build their software and systems with a remote-first approach.

You can schedule a time to talk with us here. No hassle, no expectations, just answers.

Don't Miss
Another Update

Subscribe to be notified when
new content is published
Cole

Cole is Codingscape's Content Marketing Strategist & Copywriter.