
# How to Deploy an Open Source LLM for Your Business: From Selection to Production
You’ve decided an open source AI model makes sense for your business. Maybe the API costs are adding up, or you need to keep customer data on your own infrastructure, or you want a model fine-tuned on your specific domain. Whatever the reason, you’re now facing the practical question: how do you actually get this running?
We’ve deployed open source LLMs for businesses ranging from 5-person law firms to 200-person manufacturing companies across Southern California. The process is more straightforward than most technical articles make it seem, but there are real decisions to make and genuine pitfalls to avoid.
This guide walks you through the complete process — from choosing the right model and hardware to running a production deployment that your team can rely on daily.
Step 1: Choose Your Model
We covered model comparisons in detail in our open source AI models guide, but here’s the short version for deployment purposes:
If this is your first deployment, use Llama 4 8B. It’s small enough to run on modest hardware, capable enough for most straightforward business tasks, and has the best ecosystem support for troubleshooting.
If you need production quality, use Llama 4 70B or DeepSeek V3.2. These compete with closed-source models on most business tasks but require more serious hardware.
If hardware budget is tight, use Mistral Small 3.1 (24B). It delivers 70B-class performance on a single consumer GPU.
If you’ll fine-tune later, choose a model from a family with multiple sizes. Start with the small version for testing, then scale up. Llama and Qwen both offer this flexibility.
Quantization: Trading Precision for Speed
You’ll encounter terms like Q4_K_M, Q5_K_S, GPTQ, and AWQ. These are quantization methods — ways to compress a model so it uses less memory and runs faster, with a small quality trade-off.
Here’s what you need to know:
- Full precision (FP16/BF16): Maximum quality, maximum memory usage. A 70B model needs ~140GB of GPU memory.
- 8-bit quantization (Q8): Negligible quality loss, roughly half the memory. That 70B model needs ~70GB.
- 4-bit quantization (Q4): Slight quality reduction (2-5% on benchmarks), one-quarter the memory. The 70B model fits in ~35GB — a single high-end GPU.
- GPTQ/AWQ: GPU-optimized 4-bit formats. Faster inference than basic Q4 methods.
For business applications, 4-bit quantization (specifically Q4_K_M for Ollama or AWQ/GPTQ for vLLM) is the practical sweet spot. The quality difference is minimal for tasks like customer support, document processing, and content generation.
Step 2: Choose Your Hardware
This is where the rubber meets the road. Your hardware determines which models you can run and how fast they’ll respond.
Option A: Your Existing Computer (Testing and Light Use)
If you have a recent computer with a decent GPU, you can run smaller models right now:
| Hardware | Models That Fit | Response Speed |
|———-|—————-|—————-|
| 8GB GPU (RTX 3060) | 7-8B models (Q4) | 15-25 tokens/sec |
| 12GB GPU (RTX 3060 12GB) | 13-14B models (Q4) | 12-20 tokens/sec |
| 16GB GPU (RTX 4060 Ti) | 24B models (Q4) | 10-18 tokens/sec |
| 24GB GPU (RTX 4090) | 70B models (Q4, partial) | 8-15 tokens/sec |
| Apple M2/M3 Ultra (192GB) | 70B models (Q4) | 10-20 tokens/sec |
For reference, a comfortable reading speed is about 4-5 tokens per second. Even modest hardware generates text faster than most people can read it.
Option B: Cloud GPU Instances (Production)
If you need reliable uptime, multiple concurrent users, or models too large for your hardware:
| Provider | GPU | Monthly Cost | Best For |
|———-|—–|————-|———-|
| Lambda Labs | A100 80GB | ~$900 | 70B models, multiple users |
| RunPod | A100 80GB | ~$1,000 | 70B models, spot pricing available |
| AWS (g5.xlarge) | A10G 24GB | ~$500 | 8-24B models, AWS ecosystem |
| AWS (p4d.24xlarge) | 8x A100 40GB | ~$8,000 | 405B models, heavy workloads |
| GCP (a2-highgpu-1g) | A100 40GB | ~$750 | 70B models (Q4), GCP ecosystem |
| Vast.ai | Various | ~$200-600 | Budget option, variable availability |
Our recommendation for most small businesses: Start with RunPod or Lambda Labs. They’re GPU-focused providers with simpler pricing than AWS/GCP, and their spot instances (interruptible but much cheaper) are perfect for testing.
Option C: On-Premise Server (High Volume, Long Term)
If you’ll run AI workloads consistently for 12+ months, buying hardware saves money:
| Setup | Cost | Monthly Operating | Break-Even vs. Cloud |
|——-|——|——————-|———————|
| RTX 4090 workstation | ~$3,000 | ~$30 (electricity) | 4 months |
| 2x RTX 4090 workstation | ~$5,500 | ~$50 (electricity) | 5 months |
| Mac Studio M3 Ultra 192GB | ~$7,000 | ~$15 (electricity) | 7 months |
| NVIDIA A6000 48GB server | ~$8,000 | ~$60 (electricity) | 8 months |
After the break-even point, you’re running at electricity cost only. For a business committed to AI, on-premise hardware pays for itself within a year.
Step 3: Local Development with Ollama
Ollama is the fastest way to get a model running. It handles downloading, quantization, and serving in a single tool. Think of it as “Docker for AI models.”
Install Ollama
Windows: Download from ollama.com and run the installer.
Mac:
“bash
brew install ollama
``
Linux:
bash
curl -fsSL https://ollama.com/install.sh | sh
`
Download and Run Your First Model
`bash
# Download and start Llama 4 8B
ollama run llama4:8b
# You're now in an interactive chat. Type your questions.
# Press Ctrl+D to exit.
`
That's it. One command. The model downloads automatically (~4.7GB for the 8B Q4 version) and you're chatting with a local AI.
Run Ollama as an API Server
For integration with your applications, Ollama serves an OpenAI-compatible API:
`bash
# Start the server (it runs in the background by default on install)
ollama serve
# Test with curl
curl http://localhost:11434/api/chat -d '{
"model": "llama4:8b",
"messages": [{"role": "user", "content": "Summarize this invoice..."}],
"stream": false
}'
`
Because the API is OpenAI-compatible, most tools that work with GPT (LangChain, LlamaIndex, n8n, your custom code) work with Ollama by changing the base URL to http://localhost:11434.
Create a Custom Model for Your Business
Ollama lets you create customized models with a Modelfile:
`
FROM llama4:8b
SYSTEM """
You are a customer support assistant for [Company Name], an IT services
company in Southern California. You help customers with technical issues,
service inquiries, and account questions.
Always be professional and helpful. If you don't know something specific
about our services, say so rather than guessing. Offer to connect the
customer with a human specialist for complex issues.
"""
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
`
Build and run it:
`bash
ollama create company-support -f Modelfile
ollama run company-support
`
Step 4: Production Deployment with vLLM
Ollama is great for development and small-scale use, but for production workloads serving multiple users simultaneously, vLLM is the standard choice. It uses PagedAttention for efficient memory management and handles concurrent requests with high throughput.
Install vLLM
`bash
pip install vllm
`
Start a Production Server
`bash
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-70B-Instruct \
--quantization awq \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--port 8000
`
Key parameters explained:
- –quantization awq
— Use AWQ 4-bit quantization to reduce memory - –tensor-parallel-size 2
— Split the model across 2 GPUs - –max-model-len 8192
— Maximum context length (increase if you process long documents) - –port 8000
— API port
vLLM Serves an OpenAI-Compatible API
`bash
curl http://localhost:8000/v1/chat/completions -d '{
"model": "meta-llama/Llama-4-70B-Instruct",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 200
}'
`
Your existing code that calls OpenAI just needs a base URL change. No other modifications needed.
Production Configuration Tips
Set up a reverse proxy. Put nginx in front of vLLM to handle SSL, rate limiting, and basic authentication:
`nginx
server {
listen 443 ssl;
server_name ai.yourcompany.com;
ssl_certificate /path/to/cert.pem;
ssl_certificate_key /path/to/key.pem;
location /v1/ {
proxy_pass http://localhost:8000;
proxy_set_header Host $host;
# Basic rate limiting
limit_req zone=ai_api burst=20 nodelay;
}
}
`nvidia-smi
Monitor GPU usage. Set up basic monitoring with logged to a file or a proper monitoring stack (Grafana + Prometheus with the DCGM exporter):`
bash
# Simple monitoring: log GPU stats every 30 seconds
watch -n 30 "nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv >> /var/log/gpu_stats.csv"
``
Set up health checks. Your load balancer or process manager should ping the health endpoint:
bash
curl http://localhost:8000/health
``
Use a process manager. Run vLLM under systemd (Linux) or as a Docker container so it restarts automatically after crashes:
yaml
# docker-compose.yml
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
command: >
--model meta-llama/Llama-4-70B-Instruct
--quantization awq
--tensor-parallel-size 2
restart: always
`
Step 5: Fine-Tuning for Your Domain
A general-purpose model knows a lot about everything but might not know the specifics of your industry, your products, or your communication style. Fine-tuning bridges that gap.
When Fine-Tuning Makes Sense
- Your model consistently gets domain-specific details wrong
- You need a specific output format that prompt engineering can't reliably produce
- You want to match your company's writing style or terminology
- A smaller fine-tuned model could replace a larger general model (saving compute costs)
When Fine-Tuning Is Overkill
- Prompt engineering with examples (few-shot) solves the problem
- Retrieval-Augmented Generation (RAG) can provide the needed context
- You have fewer than 200 training examples
Quick Fine-Tuning with Unsloth
Unsloth is the fastest way to fine-tune an open source model. It optimizes the training process to run 2x faster with 60% less memory than standard methods.
`python
from unsloth import FastLanguageModel
# Load the base model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-4-8B-Instruct",
max_seq_length=4096,
load_in_4bit=True,
)
# Add LoRA adapters (only trains ~2% of parameters)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Prepare your training data
# Format: list of {"instruction": "...", "input": "...", "output": "..."} dicts
# You need 500-2,000 examples for good results
# Train
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=your_dataset,
max_seq_length=4096,
num_train_epochs=3,
)
trainer.train()
# Save the fine-tuned model
model.save_pretrained("company-model-v1")
“
Fine-tuning an 8B model on 1,000 examples takes about 1-2 hours on a single A100 GPU (or 3-4 hours on an RTX 4090).
Step 6: Security Considerations
Running your own AI model means you’re responsible for security. Here are the non-negotiable items:
Network isolation. Your model’s API should not be accessible from the public internet without authentication. Use a VPN, private network, or at minimum, API key authentication behind a reverse proxy.
Input sanitization. Users (or attackers) can craft prompts that attempt to extract training data, bypass system prompts, or cause the model to generate harmful content. Implement input filtering that rejects obviously malicious prompts and limit maximum input length.
Output filtering. Add a post-processing layer that catches and redacts sensitive information (SSNs, credit card numbers, internal system details) before returning responses to users.
Access logging. Log every request with the user identity, input, output, and timestamp. This is essential for compliance, debugging, and detecting misuse.
Model access control. Restrict who can load, modify, or replace model weights. A compromised model file could produce subtly wrong outputs that are hard to detect.
Data isolation. If you fine-tuned on sensitive data, treat the model weights as sensitive data themselves — they can potentially leak training examples.
Real Example: Deploying Llama 4 70B for an Internal Knowledge Base
One of our clients, a law firm with 30 attorneys, needed to search and summarize internal documents — case files, memos, research papers. They were spending $2,800/month on a closed-source AI API and were concerned about sending client documents to external servers.
What we deployed:
- Hardware: Leased server with 2x A100 80GB GPUs ($1,400/month from Lambda Labs, later replaced with purchased hardware at $18,000 one-time)
- Model: Llama 4 70B (full precision, no quantization needed with 160GB total GPU memory)
- Serving: vLLM with OpenAI-compatible API
- RAG pipeline: LlamaIndex connected to their document management system
- Fine-tuning: 800 examples of legal document summaries in the firm’s preferred format
Results after 3 months:
- Monthly cost went from $2,800 to $1,400 (cloud) to ~$60 (on-premise, electricity only)
- All document processing stayed on their own network — no compliance concerns
- Search quality improved 30% after fine-tuning (measured by attorney satisfaction surveys)
- Average response time: 4 seconds for a 500-word summary
The hardware paid for itself in under 5 months.
Common Deployment Mistakes
Skipping the testing phase. Don’t deploy directly to production. Run the model for at least 2 weeks internally, collecting real queries and evaluating output quality before exposing it to customers.
Over-sizing the model. A fine-tuned 8B model often outperforms a general 70B model on your specific tasks. Test smaller models first — you might not need the GPU budget you think you do.
Ignoring latency. Users expect responses in 2-5 seconds. If your model takes 15 seconds to generate an answer, the user experience suffers regardless of quality. Measure time-to-first-token and total response time, not just throughput.
No fallback plan. Hardware fails. GPUs overheat. Drivers crash after updates. Have a fallback — even if it’s a temporary switch to an API provider — so your business operations continue while you fix the issue.
Frequently Asked Questions
How much technical expertise do I need to deploy an open source LLM?
For Ollama (local development), you need basic comfort with a command line — roughly the level of installing software via terminal commands. For vLLM production deployments, you need someone familiar with Linux server administration, Docker, and basic networking concepts. If you have an IT person on staff who manages your servers, they can handle this. If not, an IT partner can set up the initial deployment in 1-2 days, after which maintenance is minimal (mostly updating the model and monitoring resource usage).
Can I run a useful AI model without a GPU?
Technically yes, but practically it’s painful. CPU inference on a 7B model produces about 1-3 tokens per second — usable but slow. Apple Silicon Macs (M1/M2/M3) are the exception: they use unified memory that both CPU and GPU share, making them surprisingly capable for AI inference without a discrete GPU. An M3 MacBook Pro with 36GB RAM can run a 24B model at 8-12 tokens per second, which is perfectly usable.
What’s the difference between Ollama and vLLM? Do I need both?
Ollama is designed for simplicity and local use. It’s a single binary that handles everything — downloading models, quantization, and serving. vLLM is designed for production throughput. It handles many concurrent requests efficiently through advanced memory management. Use Ollama for development, testing, and single-user setups. Use vLLM when you need to serve multiple users simultaneously or need maximum throughput. Some businesses use Ollama permanently for internal tools with 1-5 users, and that’s perfectly fine.
How do I handle model updates when a new version is released?
Treat model updates like software updates: test before deploying. When Llama 4.1 comes out, download it to a staging environment, run your evaluation dataset against it, and compare results to your current model. If it’s better (or equivalent), swap it into production during a maintenance window. Keep the previous model available for rollback. If you’ve fine-tuned your current model, you’ll need to re-fine-tune on the new base — keep your training data and scripts versioned so this is repeatable.
What about regulatory compliance (HIPAA, SOC 2, GDPR)?
Self-hosting actually simplifies compliance for data residency requirements — your data stays in your infrastructure. For HIPAA, you need Business Associate Agreements with cloud GPU providers if you’re processing PHI in the cloud. Lambda Labs and certain AWS configurations support this. For SOC 2, document your model deployment as part of your system description, including access controls, logging, and data handling procedures. For GDPR, self-hosting means you control data processing entirely and can guarantee data doesn’t leave the EU if you deploy on EU-based servers. We recommend working with your compliance team or legal counsel to map your specific requirements to the deployment architecture.
Your Deployment Roadmap
Here’s the path we recommend for most businesses:
Each step has a clear go/no-go decision point. You’re never committed until you choose to be.
Need help deploying open source AI models for your business? WinTechnology Inc. helps Southern California businesses adopt AI and digital strategy with hands-on expertise. Contact us for a free consultation.