Why Using Local LLMs Might Be Better Than ChatGPT, Claude, or Gemini: A Simple Cost Breakdown
Large language models (LLMs) are now a key part of many applications and industries, from chatbots to creating content.
With big names like ChatGPT, Claude, and Gemini leading the way, a lot of people are starting to look at the perks of running LLMs on their own systems.
This article takes a closer look at why using local LLMs might be a better choice than popular cloud services, breaking down the costs, privacy benefits, and performance differences.
What Is Local LLM?
Local LLMs are large language models that you run on your own computer or server, instead of using a cloud-based service.
These models, which can be open-source or bought for on-premises use, are trained to understand and generate text that sounds like it’s written by a human.
One big advantage of running LLMs locally is that it boosts your data privacy and security. Since everything stays on your own hardware, your data isn’t sent over the internet, which lowers the chances of breaches or unauthorized access.
What is a Token?
In the context of LLMs, a token is a basic unit of text that the model processes, which can represent whole words, parts of words, or individual characters.
Tokens are categorized into input tokens (derived from user prompts) and output tokens (generated by the model in response).
Different models use various tokenization methods, impacting how text is divided into tokens. Many cloud-based LLM services charge based on the number of tokens processed, this is why it’s essential to understand token counts to manage costs.
For example, if a model handles 1,000 input tokens and 1,500 output tokens, the total usage of 2,500 tokens would be used to calculate the cost under token-based pricing.
How Do ChatGPT/Claude/Gemini Work?
ChatGPT, Claude, and Gemini are advanced large language models that use machine learning and ML development to generate human-like text based on input prompts.
Here’s a brief overview of how each model works and their pricing structures:
- ChatGPT: Made by OpenAI, ChatGPT uses a type of AI called a transformer to understand and generate text. It’s trained on a wide range of internet content, so it can handle tasks like answering questions and chatting.
- Claude: Created by Anthropic, Claude also uses transformer tech but focuses on safety and ethical responses. It’s designed to be more aligned and to avoid harmful outputs.
- Gemini: Developed by Google DeepMind, Gemini models use a similar transformer approach and are trained on huge amounts of data to produce high-quality text and understand language well.
Pricing and Token Usage
Pricing for these models typically depends on the number of tokens processed, including both input and output tokens. Here’s a quick glance at the pricing and sample calculations:
- ChatGPT (3.5/4/4o): Pricing varies based on the model version. For instance, ChatGPT 4 might be priced differently from ChatGPT 3.5, with costs calculated per million tokens.
- Claude (3/3.5): Similar to ChatGPT, Claude’s pricing is based on token usage, with rates applied to both input and output tokens.
- Gemini: Pricing for Gemini models is also based on the number of tokens processed, with specific rates for different versions of the model.
This way, if you make 3,000 requests, each with 1,000 input tokens and 1,500 output tokens, the total token usage is 7,500,000. The cost is then determined based on the pricing per million tokens for the respective model.
A Detailed Overview of LLM Costs
When figuring out the cost of using large language models, you need to think about things like hardware needs, different model types, and ongoing expenses. Let’s dive into what it costs to run LLMs whether you’re doing it locally or using cloud services.
Memory Requirements for Popular Models
- Llama 3:
- 8B Model: Requires approximately 32GB of GPU VRAM.
- 70B Model: Requires around 280GB of GPU VRAM, necessitating multiple high-end GPUs or a specialized server.
- Mistral 7B: Requires around 28GB of GPU VRAM.
- Gemma:
- 2B Model: Requires about 12GB of GPU VRAM.
- 9B Model: Requires approximately 36GB of GPU VRAM.
- 27B Model: Requires about 108GB of GPU VRAM, often necessitating a multi-GPU setup or high-performance cloud instance.
Quantized LLMs
Quantization involves reducing the precision of the model weights to save memory and improve performance. While quantized models consume less memory, they may exhibit slightly reduced accuracy.
- Q4_K_M Quantization: This is an optimal balance between memory savings and performance. For instance, a quantized 70B model might require only around 140GB of VRAM compared to the 280GB required for its non-quantized version.
Costs of Hardware and Operation
The costs associated with owning and operating hardware to run LLMs locally include the initial hardware investment, ongoing electricity costs, and maintenance expenses.
Hardware Costs
- Nvidia RTX 3090:
- 1x Setup: Approximately $1,500 (initial cost).
- Electricity + Maintenance: Around $100 per month.
- Performance: Approximately 35 TFLOPS.
- Tokens per Second: Typically 10,000 tokens/sec, depending on the model and batch size.
- Nvidia RTX 4090:
- 1x Setup: Approximately $2,000 (initial cost).
- Electricity + Maintenance: Around $100 per month.
- Performance: Approximately 70 TFLOPS.
- Tokens per Second: Higher than RTX 3090, potentially 20,000 tokens/sec.
Multi-GPU Setups
- 2x RTX 4090:
- Initial Cost: $4,000.
- Electricity + Maintenance: Around $150 per month.
- 4x RTX 4090:
- Initial Cost: $8,000.
- Electricity + Maintenance: Around $200 per month.
Performance and Efficiency
The performance of local LLMs is significantly influenced by the GPU setup. For instance:
- Single GPU: Best suited for smaller models or lower usage scenarios.
- Dual GPU Setup: Provides better performance for mid-sized models and higher throughput.
- Quadruple GPU Setup: Ideal for handling large models and high-volume requests, with increased efficiency in token processing.
Conclusion
Deciding between local LLMs and cloud-based models really comes down to your needs and priorities.
Local LLMs give you more control, better privacy, and can be cheaper in the long run if you use them a lot. But, they need a big upfront investment in hardware and ongoing maintenance.
Cloud services like ChatGPT, Claude, and Gemini are convenient, easy to scale, and don’t require a big initial investment. However, they might cost more over time and could raise some data privacy issues.
To figure out what’s best for you, think about how you’ll use the model, your budget, and how important data security is.
For long-term use or if you need extra privacy, local LLMs might be the way to go. For short-term needs or if you need something that scales easily, cloud services could be a better fit.
Want to see how SCAND can help with custom LLM and AI development? Drop us a line and let’s chat about what we can do for you.