
Run your own AI assistant on low-power hardware. Complete guide to Ollama and Open WebUI setup with Docker on Intel N100.
What if you could run your own ChatGPT-like assistant entirely on your home server? No API costs, no data leaving your network, and complete control over your AI experience. With Ollama and Open WebUI, this is not only possible—it's surprisingly accessible, even on low-power hardware like the Intel N100.
This guide walks you through setting up a complete self-hosted AI stack, from choosing the right models for your hardware to optimizing performance for CPU-only inference.

Before diving into the technical setup, let's understand why running AI locally makes sense for home server enthusiasts.

Every query you send to ChatGPT, Claude, or Gemini travels through third-party servers. Your conversations about personal finances, health questions, work projects, and private ideas become training data for corporations. With a self-hosted AI:

| Service | Monthly Cost | Annual Cost |
|---|---|---|
| ChatGPT Plus | $20 | $240 |
| Claude Pro | $20 | $240 |
| Self-Hosted (electricity only) | ~$2-5 | ~$24-60 |
For households with multiple users, self-hosting becomes even more economical. A single Ollama instance can serve unlimited family members.
Self-hosted AI works during internet outages, travel without connectivity, or in isolated network environments. Perfect for:
Running AI locally is computationally intensive, but modern small language models (SLMs) have made it practical on modest hardware.
For a functional self-hosted AI on budget hardware:
| Component | Minimum | Recommended |
|---|---|---|
| CPU | Intel N100/N95 | Intel N305, Ryzen 5600U |
| RAM | 16GB | 32GB |
| Storage | 50GB free | 100GB+ SSD |
| Network | Gigabit Ethernet | Gigabit Ethernet |
The Intel N100 is the sweet spot for budget self-hosting. Its 6W TDP keeps electricity costs minimal while providing enough processing power for small language models.
RAM is the primary bottleneck for running LLMs. Here's how model size relates to memory requirements:
| Model Parameters | Quantization | RAM Required | Example Models |
|---|---|---|---|
| 0.5B | Q4 | 1-2GB | Qwen 2.5 0.5B |
| 1.5B-3B | Q4 | 2-4GB | Llama 3.2 1B, Phi-3 Mini |
| 7B | Q4 | 6-8GB | Llama 3.1 7B, Mistral 7B |
| 13B | Q4 | 10-12GB | Llama 2 13B |
| 70B | Q4 | 40-48GB | Llama 3.1 70B |
Pro tip: With 16GB RAM, you can comfortably run 7B models. With 32GB, you unlock 13B models and can run multiple smaller models simultaneously.
The Intel N100 uses single-channel RAM, which significantly impacts LLM performance. Memory bandwidth directly affects tokens-per-second:
If AI performance is your priority, consider dual-channel systems like the AMD Ryzen 5600U or 5800U, which offer nearly 2x faster inference for similar power consumption.
Setting realistic expectations is crucial. Self-hosted AI on consumer hardware won't match cloud services, but it can be surprisingly useful.
LLM performance is measured in tokens per second (tok/s). A token is roughly 4 characters or 0.75 words.
| Speed | Experience | Use Case |
|---|---|---|
| 1 tok/s | Painfully slow | Background processing only |
| 5 tok/s | Usable | Simple questions, short responses |
| 10 tok/s | Comfortable | General chat, coding assistance |
| 20+ tok/s | Real-time | Streaming responses, interactive use |
On an Intel N100:
Quantization reduces model precision to decrease memory usage and increase speed. The format is typically expressed as Q4, Q5, Q8:
| Quantization | Size Reduction | Quality Impact | Use Case |
|---|---|---|---|
| Q2 | 75% smaller | Noticeable degradation | Extremely constrained hardware |
| Q4_K_M | 60% smaller | Minimal impact | Best balance for most users |
| Q5_K_M | 50% smaller | Very slight impact | Quality-focused |
| Q8 | 25% smaller | Nearly lossless | Maximum quality |
| F16 | Baseline | Full precision | Research, fine-tuning |
Recommendation: Start with Q4_K_M quantization for the best balance of speed and quality.
Model selection is crucial for a good experience on low-power hardware. Here are the best options for Intel N100 and similar systems:
These models run smoothly on N100 hardware:
| Model | Parameters | Best For | Speed (N100) |
|---|---|---|---|
| Qwen 2.5 0.5B | 0.5B | Quick answers, simple tasks | 10-15 tok/s |
| Llama 3.2 1B | 1B | General chat, summarization | 8-12 tok/s |
| Phi-3.5 Mini | 1.5B | Reasoning, coding help | 5-8 tok/s |
| Qwen 2.5 1.5B | 1.5B | Balanced performance | 5-8 tok/s |
Usable on N100, better on dual-channel systems:
| Model | Parameters | Best For | Speed (N100) |
|---|---|---|---|
| Llama 3.2 3B | 3B | General assistant | 2-4 tok/s |
| Phi-3 Medium | 3.8B | Coding, reasoning | 2-3 tok/s |
| Qwen 2.5 3B | 3B | Multi-language, coding | 2-4 tok/s |
| Gemma 2 2B | 2B | Efficient general use | 3-5 tok/s |
Requires patience on N100, or better hardware:
| Model | Parameters | Best For | Speed (N100) |
|---|---|---|---|
| Llama 3.1 8B | 8B | Complex reasoning | 0.5-1.5 tok/s |
| Mistral 7B | 7B | Strong all-rounder | 0.5-1.5 tok/s |
| DeepSeek Coder 6.7B | 6.7B | Code generation | 0.5-1.5 tok/s |
| Use Case | Recommended Model | Notes |
|---|---|---|
| Coding | DeepSeek Coder, CodeLlama | IDE integration ready |
| Creative Writing | Llama 3.2, Mistral | Uncensored versions available |
| Summarization | Qwen 2.5, Phi-3 | Excellent at condensing text |
| Vision (image analysis) | LLaVA, Llama 3.2 Vision | Requires more RAM |
| Embeddings | nomic-embed-text | For RAG applications |
Now let's set up your self-hosted AI stack with Docker Compose.
Docker & Docker Compose Installation:
# Debian/Ubuntu
sudo apt update
sudo apt install docker.io docker-compose-plugin
sudo systemctl enable --now docker
sudo usermod -aG docker $USER
# Log out and back in for group changes
# Verify installation
docker --version
docker compose version
System Preparation:
# Create directory structure
mkdir -p ~/ai-stack/{ollama,open-webui}
cd ~/ai-stack
# Check available RAM
free -h
Ollama is the LLM runtime that downloads, manages, and serves AI models.
Option 1: Docker Compose (Recommended)
Create docker-compose.yml:
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ./ollama:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_KEEP_ALIVE=5m
- OLLAMA_NUM_PARALLEL=1
- OLLAMA_MAX_LOADED_MODELS=1
# For low-power systems, limit CPU usage
deploy:
resources:
limits:
cpus: '4'
memory: 12G
Start Ollama:
docker compose up -d ollama
# Check logs
docker compose logs -f ollama
Pulling Your First Model:
# Pull a lightweight model for testing
docker exec -it ollama ollama pull qwen2.5:1.5b
# List available models
docker exec -it ollama ollama list
# Test the model
docker exec -it ollama ollama run qwen2.5:1.5b "Hello! What can you help me with?"
Recommended Models to Pull:
# Fast, everyday assistant
docker exec -it ollama ollama pull llama3.2:1b
# More capable, slower
docker exec -it ollama ollama pull llama3.2:3b
# Coding assistant
docker exec -it ollama ollama pull deepseek-coder:1.3b
# For embeddings (RAG)
docker exec -it ollama ollama pull nomic-embed-text
Open WebUI provides a beautiful ChatGPT-like interface for interacting with your local models.
Add to docker-compose.yml:
services:
ollama:
# ... (previous ollama config)
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- ./open-webui:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
- WEBUI_SECRET_KEY=your-secure-secret-key-change-this
- DEFAULT_USER_ROLE=user
- ENABLE_SIGNUP=true
depends_on:
- ollama
Deploy the complete stack:
docker compose up -d
# Watch the logs
docker compose logs -f
Access Open WebUI:
http://your-server-ip:3000 in your browserHere's the full production-ready configuration:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
- OLLAMA_KEEP_ALIVE=5m
- OLLAMA_NUM_PARALLEL=1
- OLLAMA_MAX_LOADED_MODELS=1
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
deploy:
resources:
limits:
cpus: '4'
memory: 12G
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
ports:
- "3000:8080"
volumes:
- openwebui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
- WEBUI_SECRET_KEY=${WEBUI_SECRET_KEY:-changeme}
- DEFAULT_USER_ROLE=user
- ENABLE_SIGNUP=true
- ENABLE_RAG_WEB_SEARCH=false
- ENABLE_IMAGE_GENERATION=false
depends_on:
ollama:
condition: service_healthy
volumes:
ollama_data:
openwebui_data:
Create a .env file:
WEBUI_SECRET_KEY=your-long-random-secret-key-here
Getting the best experience on an N100 requires careful tuning.
RAM Available → Model Choice
├── 8GB → Qwen 0.5B, Llama 3.2 1B only
├── 16GB → Llama 3.2 3B, Phi-3 Mini (comfortable)
├── 32GB → Llama 3.1 8B, Mistral 7B (usable)
└── 64GB → Any model, multiple models loaded
Add these environment variables for low-power optimization:
environment:
# Reduce context length to save RAM
- OLLAMA_NUM_CTX=2048
# Single model at a time (saves RAM)
- OLLAMA_MAX_LOADED_MODELS=1
# Unload models faster (saves RAM)
- OLLAMA_KEEP_ALIVE=2m
# Limit concurrent requests
- OLLAMA_NUM_PARALLEL=1
# Use all available threads
- OLLAMA_NUM_THREAD=4
Context length (num_ctx) determines how much text the model can "remember" in a conversation:
| Context Length | RAM Impact | Speed Impact | Use Case |
|---|---|---|---|
| 512 | Minimal | Fastest | Quick Q&A |
| 2048 | Moderate | Good | Standard chat |
| 4096 | Significant | Slower | Document analysis |
| 8192 | High | Much slower | Long conversations |
For N100 systems, stick with 2048 context for the best balance.
# Check your CPU cores
nproc
# For Intel N100 (4 cores), use all cores:
OLLAMA_NUM_THREAD=4
Monitor memory usage and swap:
# Watch memory in real-time
watch -n 1 free -h
# Check if swapping (bad for performance)
vmstat 1
If you see heavy swapping, either:
Here are actual benchmarks from the community on Intel N100 hardware with 16GB RAM:
| Model | Prompt Eval | Generation | Notes |
|---|---|---|---|
| Qwen 2.5 0.5B (Q4) | 50 tok/s | 12 tok/s | Very responsive |
| Llama 3.2 1B (Q4) | 35 tok/s | 8 tok/s | Good daily driver |
| Qwen 2.5 1.5B (Q4) | 25 tok/s | 5 tok/s | Best quality/speed |
| Llama 3.2 3B (Q4) | 15 tok/s | 3 tok/s | Usable, patient users |
| Phi-3 Medium (Q4) | 12 tok/s | 2.5 tok/s | Good for coding |
| Llama 3.1 8B (Q4) | 5 tok/s | 1 tok/s | Background tasks only |
For a simple question ("What is the capital of France?"):
| Model | Time to First Token | Complete Response |
|---|---|---|
| Qwen 0.5B | 0.5s | 2s |
| Llama 3.2 1B | 1s | 4s |
| Llama 3.2 3B | 2s | 10s |
| Llama 3.1 8B | 5s | 30s |
| Metric | Self-Hosted (N100) | ChatGPT |
|---|---|---|
| Response Speed | 2-10 tok/s | 50-100 tok/s |
| Privacy | Full | None |
| Monthly Cost | ~$3 electricity | $20 subscription |
| Offline Use | Yes | No |
| Custom Models | Yes | No |
Once running, here's what you can actually do with self-hosted AI:
Upload PDFs, research papers, or long articles and get concise summaries. Particularly useful for:
Models like DeepSeek Coder and Phi-3 excel at:
Connect Ollama to Home Assistant for:
With Open WebUI's RAG features:
Symptom: Container crashes or "failed to allocate memory"
Solutions:
# Use smaller model
docker exec -it ollama ollama pull qwen2.5:0.5b
# Reduce context length
# Add to docker-compose.yml:
environment:
- OLLAMA_NUM_CTX=1024
# Check actual memory usage
docker stats ollama
Symptom: Very slow responses (under 1 tok/s)
Solutions:
free -h)sensors)Symptom: Open WebUI can't connect to Ollama
Solutions:
# Verify Ollama is responding
curl http://localhost:11434/api/tags
# Check container networking
docker network ls
docker network inspect ai-stack_default
# Ensure both containers on same network
docker compose down && docker compose up -d
Symptom: Model pull hangs or fails
Solutions:
# Check available disk space
df -h
# Pull with verbose output
docker exec -it ollama ollama pull llama3.2:1b --verbose
# Manual download (if registry issues)
# Download from huggingface, place in ./ollama/models/
Symptom: Ollama uses CPU even without requests
Solutions:
# Add keep-alive timeout
environment:
- OLLAMA_KEEP_ALIVE=30s # Unload models after 30 seconds
Once you have the basics working, these advanced configurations unlock more capabilities.
RAG allows your AI to answer questions about your own documents:
Configure Open WebUI for RAG:
# Add to open-webui environment
environment:
- ENABLE_RAG_WEB_SEARCH=false
- RAG_EMBEDDING_MODEL=nomic-embed-text
- RAG_RERANKING_MODEL=
- CHUNK_SIZE=1000
- CHUNK_OVERLAP=100
Pull the embedding model:
docker exec -it ollama ollama pull nomic-embed-text
Using RAG in Open WebUI:
Run different models for different purposes:
Create model aliases with custom parameters:
# Create a fast model for simple queries
docker exec -it ollama ollama create fast-assistant -f - << 'EOF'
FROM qwen2.5:0.5b
PARAMETER num_ctx 1024
PARAMETER temperature 0.7
SYSTEM You are a fast, concise assistant. Keep responses brief.
EOF
# Create a thorough model for complex tasks
docker exec -it ollama ollama create thorough-assistant -f - << 'EOF'
FROM llama3.2:3b
PARAMETER num_ctx 4096
PARAMETER temperature 0.3
SYSTEM You are a thorough assistant. Provide detailed, well-reasoned responses.
EOF
Ollama provides an OpenAI-compatible API for integration with other tools:
Basic API Usage:
# Chat completion
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.2:1b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
# Generate embeddings
curl http://localhost:11434/api/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "nomic-embed-text",
"prompt": "This is a test sentence for embedding."
}'
Python Integration:
import requests
def chat(prompt, model="llama3.2:1b"):
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False}
)
return response.json()["response"]
# Example usage
answer = chat("What is the capital of France?")
print(answer)
Access your AI from anywhere using Tailscale:
Option 1: Tailscale (Recommended)
# Install Tailscale on your server
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up
# Access from any device on your Tailnet
# http://your-server-tailscale-ip:3000
Option 2: Reverse Proxy with HTTPS
Using Caddy for automatic HTTPS:
# Add to docker-compose.yml
caddy:
image: caddy:latest
container_name: caddy
restart: unless-stopped
ports:
- "80:80"
- "443:443"
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
- caddy_data:/data
Create Caddyfile:
ai.yourdomain.com {
reverse_proxy open-webui:8080
}
Protect your configurations and chat history:
Backup Script:
#!/bin/bash
# backup-ai-stack.sh
BACKUP_DIR="/backup/ai-stack-$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# Stop containers for consistent backup
docker compose stop
# Backup volumes
docker run --rm -v ollama_data:/data -v $BACKUP_DIR:/backup alpine \
tar czf /backup/ollama-data.tar.gz /data
docker run --rm -v openwebui_data:/data -v $BACKUP_DIR:/backup alpine \
tar czf /backup/openwebui-data.tar.gz /data
# Backup configuration
cp docker-compose.yml $BACKUP_DIR/
cp .env $BACKUP_DIR/
# Restart containers
docker compose start
echo "Backup completed: $BACKUP_DIR"
Restore Script:
#!/bin/bash
# restore-ai-stack.sh
BACKUP_DIR=$1
# Stop and remove containers
docker compose down -v
# Restore volumes
docker volume create ollama_data
docker volume create openwebui_data
docker run --rm -v ollama_data:/data -v $BACKUP_DIR:/backup alpine \
tar xzf /backup/ollama-data.tar.gz -C /
docker run --rm -v openwebui_data:/data -v $BACKUP_DIR:/backup alpine \
tar xzf /backup/openwebui-data.tar.gz -C /
# Restore configuration
cp $BACKUP_DIR/docker-compose.yml ./
cp $BACKUP_DIR/.env ./
# Start containers
docker compose up -d
Connect your AI to Home Assistant for voice control and automation:
Install the Ollama integration:
http://your-server:11434Create AI-powered automations:
# configuration.yaml
conversation:
intents:
HassLightSet:
- "Turn {area} lights {state}"
- "Set {area} brightness to {brightness}"
# Use Ollama for natural language understanding
# Example: "Make the living room cozy" → dims lights, adjusts color temperature
Integrate with n8n for complex AI workflows:
{
"nodes": [
{
"name": "Ollama",
"type": "n8n-nodes-base.httpRequest",
"parameters": {
"url": "http://ollama:11434/api/generate",
"method": "POST",
"body": {
"model": "llama3.2:1b",
"prompt": "={{ $json.input }}",
"stream": false
}
}
}
]
}
Use your local AI for coding assistance:
Install Continue extension:
~/.continue/config.json:{
"models": [
{
"title": "Local Ollama",
"provider": "ollama",
"model": "deepseek-coder:1.3b",
"apiBase": "http://localhost:11434"
}
]
}
Add AI to your note-taking workflow:
http://localhost:11434Use cases:
Bind to localhost only (if not exposing remotely):
services:
ollama:
ports:
- "127.0.0.1:11434:11434" # Only accessible from localhost
Use a firewall:
# Allow only local network access
sudo ufw allow from 192.168.1.0/24 to any port 3000
sudo ufw allow from 192.168.1.0/24 to any port 11434
Open WebUI provides built-in authentication:
environment:
- WEBUI_AUTH=true
- ENABLE_SIGNUP=false # Disable public registration
- DEFAULT_USER_ROLE=user
Create users via CLI:
docker exec -it open-webui python -c "
from apps.webui.models.users import Users
Users.insert_new_user('email@example.com', 'username', 'password', 'user')
"
Be aware of model capabilities and limitations:
# Pull latest images
docker compose pull
# Recreate containers with new images
docker compose up -d --force-recreate
# Clean up old images
docker image prune -f
# List current models
docker exec -it ollama ollama list
# Update a specific model
docker exec -it ollama ollama pull llama3.2:1b
# Remove old model versions
docker exec -it ollama ollama rm llama3.2:1b-old
Check resource usage:
# Container stats
docker stats ollama open-webui
# Ollama-specific metrics
curl http://localhost:11434/api/ps
Set up alerts:
# Simple health check script
#!/bin/bash
if ! curl -sf http://localhost:11434/api/tags > /dev/null; then
echo "Ollama is down!" | mail -s "AI Stack Alert" admin@example.com
fi
When you're ready to upgrade for better performance:
NVIDIA GPU Setup:
services:
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Budget GPU Options:
Keep an eye on:
Self-hosting AI with Ollama and Open WebUI is practical and rewarding:
Last updated: December 2025

Use Cases
Self-host Paperless-ngx for document management. OCR setup, scanner integration, and automation tips for your home server.

Use Cases
Deploy Immich on your low-power home server. Complete Docker Compose setup, mobile backup config, and hardware transcoding for Intel N100.

Use Cases
Build a private AI automation pipeline with n8n and Ollama. Self-hosted workflows for RSS summarization, email processing, and smart home automation.
Check out our build guides to get started with hardware.
View Build Guides