⚡Low Power Home Server
HomeBuildsHardwareOptimizationUse CasesPower Calculator
⚡Low Power Home Server

Your ultimate resource for building efficient, silent, and budget-friendly home servers. Discover the best hardware, optimization tips, and step-by-step guides for your homelab.

Blog

  • Build Guides
  • Hardware Reviews
  • Power & Noise
  • Use Cases

Tools

  • Power Calculator

Legal

  • Terms of Service
  • Privacy Policy

© 2026 Low Power Home Server. All rights reserved.

Building a Private AI Assistant: Self-Hosting Ollama & Open WebUI
  1. Home/
  2. Blog/
  3. Use Cases/
  4. Building a Private AI Assistant: Self-Hosting Ollama & Open WebUI
← Back to Use Cases

Building a Private AI Assistant: Self-Hosting Ollama & Open WebUI

Run your own AI assistant on low-power hardware. Complete guide to Ollama and Open WebUI setup with Docker on Intel N100.

Published Dec 22, 2025Updated Dec 28, 2025
llmopen-webuiself-hosted

Building a Private AI Assistant: Self-Hosting Ollama & Open WebUI (2025)

What if you could run your own ChatGPT-like assistant entirely on your home server? No API costs, no data leaving your network, and complete control over your AI experience. With Ollama and Open WebUI, this is not only possible—it's surprisingly accessible, even on low-power hardware like the Intel N100.

This guide walks you through setting up a complete self-hosted AI stack, from choosing the right models for your hardware to optimizing performance for CPU-only inference.

Why Self-Host Your AI Assistant?

Article image

Before diving into the technical setup, let's understand why running AI locally makes sense for home server enthusiasts.

Privacy & Data Control

Article image

Every query you send to ChatGPT, Claude, or Gemini travels through third-party servers. Your conversations about personal finances, health questions, work projects, and private ideas become training data for corporations. With a self-hosted AI:

  • Your data stays home: Queries never leave your local network
  • No corporate surveillance: Your prompts aren't logged or analyzed
  • Sensitive use cases: Analyze private documents, tax returns, medical records
  • Compliance friendly: Useful for professionals with confidentiality requirements

Cost Savings

Article image

ServiceMonthly CostAnnual Cost
ChatGPT Plus$20$240
Claude Pro$20$240
Self-Hosted (electricity only)~$2-5~$24-60

For households with multiple users, self-hosting becomes even more economical. A single Ollama instance can serve unlimited family members.

Offline Capability

Self-hosted AI works during internet outages, travel without connectivity, or in isolated network environments. Perfect for:

  • Rural properties with unreliable internet
  • Home automation that shouldn't depend on cloud services
  • Research environments with air-gapped security requirements

Customization & Control

  • Model selection: Choose models optimized for coding, writing, or reasoning
  • Fine-tuning: Train on your own documents and writing style
  • Integration: Connect to Home Assistant, note-taking apps, and automation workflows
  • No censorship: Use uncensored models for creative writing or research

Hardware Requirements for Local AI

Running AI locally is computationally intensive, but modern small language models (SLMs) have made it practical on modest hardware.

Minimum Specs (CPU-Only)

For a functional self-hosted AI on budget hardware:

ComponentMinimumRecommended
CPUIntel N100/N95Intel N305, Ryzen 5600U
RAM16GB32GB
Storage50GB free100GB+ SSD
NetworkGigabit EthernetGigabit Ethernet

The Intel N100 is the sweet spot for budget self-hosting. Its 6W TDP keeps electricity costs minimal while providing enough processing power for small language models.

RAM: The Critical Factor

RAM is the primary bottleneck for running LLMs. Here's how model size relates to memory requirements:

Model ParametersQuantizationRAM RequiredExample Models
0.5BQ41-2GBQwen 2.5 0.5B
1.5B-3BQ42-4GBLlama 3.2 1B, Phi-3 Mini
7BQ46-8GBLlama 3.1 7B, Mistral 7B
13BQ410-12GBLlama 2 13B
70BQ440-48GBLlama 3.1 70B

Pro tip: With 16GB RAM, you can comfortably run 7B models. With 32GB, you unlock 13B models and can run multiple smaller models simultaneously.

Single-Channel vs Dual-Channel RAM

The Intel N100 uses single-channel RAM, which significantly impacts LLM performance. Memory bandwidth directly affects tokens-per-second:

  • Single-channel (N100): ~25GB/s bandwidth → 1-5 tokens/second
  • Dual-channel (Ryzen 5600U): ~50GB/s bandwidth → 2-10 tokens/second

If AI performance is your priority, consider dual-channel systems like the AMD Ryzen 5600U or 5800U, which offer nearly 2x faster inference for similar power consumption.

Understanding Local LLM Performance

Setting realistic expectations is crucial. Self-hosted AI on consumer hardware won't match cloud services, but it can be surprisingly useful.

Tokens Per Second Explained

LLM performance is measured in tokens per second (tok/s). A token is roughly 4 characters or 0.75 words.

SpeedExperienceUse Case
1 tok/sPainfully slowBackground processing only
5 tok/sUsableSimple questions, short responses
10 tok/sComfortableGeneral chat, coding assistance
20+ tok/sReal-timeStreaming responses, interactive use

On an Intel N100:

  • 0.5B-1.5B models: 5-15 tokens/second
  • 3B models: 2-5 tokens/second
  • 7B models: 0.5-2 tokens/second

Quantization: Trading Quality for Speed

Quantization reduces model precision to decrease memory usage and increase speed. The format is typically expressed as Q4, Q5, Q8:

QuantizationSize ReductionQuality ImpactUse Case
Q275% smallerNoticeable degradationExtremely constrained hardware
Q4_K_M60% smallerMinimal impactBest balance for most users
Q5_K_M50% smallerVery slight impactQuality-focused
Q825% smallerNearly losslessMaximum quality
F16BaselineFull precisionResearch, fine-tuning

Recommendation: Start with Q4_K_M quantization for the best balance of speed and quality.

Choosing the Right Model

Model selection is crucial for a good experience on low-power hardware. Here are the best options for Intel N100 and similar systems:

Tier 1: Fast & Practical (0.5B-1.5B)

These models run smoothly on N100 hardware:

ModelParametersBest ForSpeed (N100)
Qwen 2.5 0.5B0.5BQuick answers, simple tasks10-15 tok/s
Llama 3.2 1B1BGeneral chat, summarization8-12 tok/s
Phi-3.5 Mini1.5BReasoning, coding help5-8 tok/s
Qwen 2.5 1.5B1.5BBalanced performance5-8 tok/s

Tier 2: More Capable (3B-7B)

Usable on N100, better on dual-channel systems:

ModelParametersBest ForSpeed (N100)
Llama 3.2 3B3BGeneral assistant2-4 tok/s
Phi-3 Medium3.8BCoding, reasoning2-3 tok/s
Qwen 2.5 3B3BMulti-language, coding2-4 tok/s
Gemma 2 2B2BEfficient general use3-5 tok/s

Tier 3: Maximum Capability (7B+)

Requires patience on N100, or better hardware:

ModelParametersBest ForSpeed (N100)
Llama 3.1 8B8BComplex reasoning0.5-1.5 tok/s
Mistral 7B7BStrong all-rounder0.5-1.5 tok/s
DeepSeek Coder 6.7B6.7BCode generation0.5-1.5 tok/s

Specialized Models

Use CaseRecommended ModelNotes
CodingDeepSeek Coder, CodeLlamaIDE integration ready
Creative WritingLlama 3.2, MistralUncensored versions available
SummarizationQwen 2.5, Phi-3Excellent at condensing text
Vision (image analysis)LLaVA, Llama 3.2 VisionRequires more RAM
Embeddingsnomic-embed-textFor RAG applications

Installation Guide

Now let's set up your self-hosted AI stack with Docker Compose.

Prerequisites

Docker & Docker Compose Installation:

# Debian/Ubuntu
sudo apt update
sudo apt install docker.io docker-compose-plugin
sudo systemctl enable --now docker
sudo usermod -aG docker $USER
# Log out and back in for group changes

# Verify installation
docker --version
docker compose version

System Preparation:

# Create directory structure
mkdir -p ~/ai-stack/{ollama,open-webui}
cd ~/ai-stack

# Check available RAM
free -h

Deploying Ollama

Ollama is the LLM runtime that downloads, manages, and serves AI models.

Option 1: Docker Compose (Recommended)

Create docker-compose.yml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ./ollama:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_KEEP_ALIVE=5m
      - OLLAMA_NUM_PARALLEL=1
      - OLLAMA_MAX_LOADED_MODELS=1
    # For low-power systems, limit CPU usage
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 12G

Start Ollama:

docker compose up -d ollama

# Check logs
docker compose logs -f ollama

Pulling Your First Model:

# Pull a lightweight model for testing
docker exec -it ollama ollama pull qwen2.5:1.5b

# List available models
docker exec -it ollama ollama list

# Test the model
docker exec -it ollama ollama run qwen2.5:1.5b "Hello! What can you help me with?"

Recommended Models to Pull:

# Fast, everyday assistant
docker exec -it ollama ollama pull llama3.2:1b

# More capable, slower
docker exec -it ollama ollama pull llama3.2:3b

# Coding assistant
docker exec -it ollama ollama pull deepseek-coder:1.3b

# For embeddings (RAG)
docker exec -it ollama ollama pull nomic-embed-text

Setting Up Open WebUI

Open WebUI provides a beautiful ChatGPT-like interface for interacting with your local models.

Add to docker-compose.yml:

services:
  ollama:
    # ... (previous ollama config)

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - ./open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - WEBUI_SECRET_KEY=your-secure-secret-key-change-this
      - DEFAULT_USER_ROLE=user
      - ENABLE_SIGNUP=true
    depends_on:
      - ollama

Deploy the complete stack:

docker compose up -d

# Watch the logs
docker compose logs -f

Access Open WebUI:

  1. Open http://your-server-ip:3000 in your browser
  2. Create an admin account (first signup becomes admin)
  3. Select a model from the dropdown
  4. Start chatting!

Complete Docker Compose Configuration

Here's the full production-ready configuration:

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
      - OLLAMA_KEEP_ALIVE=5m
      - OLLAMA_NUM_PARALLEL=1
      - OLLAMA_MAX_LOADED_MODELS=1
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 12G

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - openwebui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - WEBUI_SECRET_KEY=${WEBUI_SECRET_KEY:-changeme}
      - DEFAULT_USER_ROLE=user
      - ENABLE_SIGNUP=true
      - ENABLE_RAG_WEB_SEARCH=false
      - ENABLE_IMAGE_GENERATION=false
    depends_on:
      ollama:
        condition: service_healthy

volumes:
  ollama_data:
  openwebui_data:

Create a .env file:

WEBUI_SECRET_KEY=your-long-random-secret-key-here

Optimizing Performance on Low-Power Hardware

Getting the best experience on an N100 requires careful tuning.

Model Selection Strategy

RAM Available → Model Choice
├── 8GB  → Qwen 0.5B, Llama 3.2 1B only
├── 16GB → Llama 3.2 3B, Phi-3 Mini (comfortable)
├── 32GB → Llama 3.1 8B, Mistral 7B (usable)
└── 64GB → Any model, multiple models loaded

Ollama Environment Tuning

Add these environment variables for low-power optimization:

environment:
  # Reduce context length to save RAM
  - OLLAMA_NUM_CTX=2048
  
  # Single model at a time (saves RAM)
  - OLLAMA_MAX_LOADED_MODELS=1
  
  # Unload models faster (saves RAM)
  - OLLAMA_KEEP_ALIVE=2m
  
  # Limit concurrent requests
  - OLLAMA_NUM_PARALLEL=1
  
  # Use all available threads
  - OLLAMA_NUM_THREAD=4

Context Length vs Performance

Context length (num_ctx) determines how much text the model can "remember" in a conversation:

Context LengthRAM ImpactSpeed ImpactUse Case
512MinimalFastestQuick Q&A
2048ModerateGoodStandard chat
4096SignificantSlowerDocument analysis
8192HighMuch slowerLong conversations

For N100 systems, stick with 2048 context for the best balance.

CPU Thread Optimization

# Check your CPU cores
nproc

# For Intel N100 (4 cores), use all cores:
OLLAMA_NUM_THREAD=4

Memory Management

Monitor memory usage and swap:

# Watch memory in real-time
watch -n 1 free -h

# Check if swapping (bad for performance)
vmstat 1

If you see heavy swapping, either:

  1. Use a smaller model
  2. Reduce context length
  3. Add more RAM

Real-World Performance Benchmarks

Here are actual benchmarks from the community on Intel N100 hardware with 16GB RAM:

Speed Benchmarks by Model

ModelPrompt EvalGenerationNotes
Qwen 2.5 0.5B (Q4)50 tok/s12 tok/sVery responsive
Llama 3.2 1B (Q4)35 tok/s8 tok/sGood daily driver
Qwen 2.5 1.5B (Q4)25 tok/s5 tok/sBest quality/speed
Llama 3.2 3B (Q4)15 tok/s3 tok/sUsable, patient users
Phi-3 Medium (Q4)12 tok/s2.5 tok/sGood for coding
Llama 3.1 8B (Q4)5 tok/s1 tok/sBackground tasks only

Response Time Examples

For a simple question ("What is the capital of France?"):

ModelTime to First TokenComplete Response
Qwen 0.5B0.5s2s
Llama 3.2 1B1s4s
Llama 3.2 3B2s10s
Llama 3.1 8B5s30s

Comparison with Cloud Services

MetricSelf-Hosted (N100)ChatGPT
Response Speed2-10 tok/s50-100 tok/s
PrivacyFullNone
Monthly Cost~$3 electricity$20 subscription
Offline UseYesNo
Custom ModelsYesNo

Use Cases for Your Private AI

Once running, here's what you can actually do with self-hosted AI:

Document Summarization

Upload PDFs, research papers, or long articles and get concise summaries. Particularly useful for:

  • Legal documents
  • Technical specifications
  • Meeting notes
  • News articles

Coding Assistance

Models like DeepSeek Coder and Phi-3 excel at:

  • Code explanation
  • Bug identification
  • Generating boilerplate
  • Documentation writing

Home Automation Integration

Connect Ollama to Home Assistant for:

  • Natural language device control
  • Intelligent automation suggestions
  • Status summarization

Personal Knowledge Base (RAG)

With Open WebUI's RAG features:

  • Index your personal documents
  • Query your notes and files
  • Build a searchable knowledge base

Writing Assistant

  • Draft emails
  • Blog post outlines
  • Creative writing prompts
  • Grammar checking

Troubleshooting Common Issues

Out of Memory Errors

Symptom: Container crashes or "failed to allocate memory"

Solutions:

# Use smaller model
docker exec -it ollama ollama pull qwen2.5:0.5b

# Reduce context length
# Add to docker-compose.yml:
environment:
  - OLLAMA_NUM_CTX=1024

# Check actual memory usage
docker stats ollama

Slow Inference

Symptom: Very slow responses (under 1 tok/s)

Solutions:

  • Switch to smaller model (3B → 1B)
  • Ensure no swap usage (free -h)
  • Check CPU isn't thermal throttling (sensors)
  • Use more aggressive quantization (Q4_K_M → Q4_K_S)

Container Networking Issues

Symptom: Open WebUI can't connect to Ollama

Solutions:

# Verify Ollama is responding
curl http://localhost:11434/api/tags

# Check container networking
docker network ls
docker network inspect ai-stack_default

# Ensure both containers on same network
docker compose down && docker compose up -d

Model Download Failures

Symptom: Model pull hangs or fails

Solutions:

# Check available disk space
df -h

# Pull with verbose output
docker exec -it ollama ollama pull llama3.2:1b --verbose

# Manual download (if registry issues)
# Download from huggingface, place in ./ollama/models/

High CPU Usage When Idle

Symptom: Ollama uses CPU even without requests

Solutions:

# Add keep-alive timeout
environment:
  - OLLAMA_KEEP_ALIVE=30s  # Unload models after 30 seconds

Advanced Configuration

Once you have the basics working, these advanced configurations unlock more capabilities.

Enabling RAG (Retrieval Augmented Generation)

RAG allows your AI to answer questions about your own documents:

Configure Open WebUI for RAG:

# Add to open-webui environment
environment:
  - ENABLE_RAG_WEB_SEARCH=false
  - RAG_EMBEDDING_MODEL=nomic-embed-text
  - RAG_RERANKING_MODEL=
  - CHUNK_SIZE=1000
  - CHUNK_OVERLAP=100

Pull the embedding model:

docker exec -it ollama ollama pull nomic-embed-text

Using RAG in Open WebUI:

  1. Click the + button next to the chat input
  2. Upload documents (PDF, TXT, MD, DOCX)
  3. Documents are automatically chunked and embedded
  4. Ask questions—the AI will search your documents for context

Multiple Model Configurations

Run different models for different purposes:

Create model aliases with custom parameters:

# Create a fast model for simple queries
docker exec -it ollama ollama create fast-assistant -f - << 'EOF'
FROM qwen2.5:0.5b
PARAMETER num_ctx 1024
PARAMETER temperature 0.7
SYSTEM You are a fast, concise assistant. Keep responses brief.
EOF

# Create a thorough model for complex tasks
docker exec -it ollama ollama create thorough-assistant -f - << 'EOF'
FROM llama3.2:3b
PARAMETER num_ctx 4096
PARAMETER temperature 0.3
SYSTEM You are a thorough assistant. Provide detailed, well-reasoned responses.
EOF

API Integration

Ollama provides an OpenAI-compatible API for integration with other tools:

Basic API Usage:

# Chat completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Generate embeddings
curl http://localhost:11434/api/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "prompt": "This is a test sentence for embedding."
  }'

Python Integration:

import requests

def chat(prompt, model="llama3.2:1b"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": False}
    )
    return response.json()["response"]

# Example usage
answer = chat("What is the capital of France?")
print(answer)

Remote Access Setup

Access your AI from anywhere using Tailscale:

Option 1: Tailscale (Recommended)

# Install Tailscale on your server
curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

# Access from any device on your Tailnet
# http://your-server-tailscale-ip:3000

Option 2: Reverse Proxy with HTTPS

Using Caddy for automatic HTTPS:

# Add to docker-compose.yml
caddy:
  image: caddy:latest
  container_name: caddy
  restart: unless-stopped
  ports:
    - "80:80"
    - "443:443"
  volumes:
    - ./Caddyfile:/etc/caddy/Caddyfile
    - caddy_data:/data

Create Caddyfile:

ai.yourdomain.com {
    reverse_proxy open-webui:8080
}

Backup and Migration

Protect your configurations and chat history:

Backup Script:

#!/bin/bash
# backup-ai-stack.sh

BACKUP_DIR="/backup/ai-stack-$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Stop containers for consistent backup
docker compose stop

# Backup volumes
docker run --rm -v ollama_data:/data -v $BACKUP_DIR:/backup alpine \
    tar czf /backup/ollama-data.tar.gz /data

docker run --rm -v openwebui_data:/data -v $BACKUP_DIR:/backup alpine \
    tar czf /backup/openwebui-data.tar.gz /data

# Backup configuration
cp docker-compose.yml $BACKUP_DIR/
cp .env $BACKUP_DIR/

# Restart containers
docker compose start

echo "Backup completed: $BACKUP_DIR"

Restore Script:

#!/bin/bash
# restore-ai-stack.sh

BACKUP_DIR=$1

# Stop and remove containers
docker compose down -v

# Restore volumes
docker volume create ollama_data
docker volume create openwebui_data

docker run --rm -v ollama_data:/data -v $BACKUP_DIR:/backup alpine \
    tar xzf /backup/ollama-data.tar.gz -C /

docker run --rm -v openwebui_data:/data -v $BACKUP_DIR:/backup alpine \
    tar xzf /backup/openwebui-data.tar.gz -C /

# Restore configuration
cp $BACKUP_DIR/docker-compose.yml ./
cp $BACKUP_DIR/.env ./

# Start containers
docker compose up -d

Integrations and Automation

Home Assistant Integration

Connect your AI to Home Assistant for voice control and automation:

Install the Ollama integration:

  1. Go to Settings → Devices & Services → Add Integration
  2. Search for "Ollama" or use the conversation agent
  3. Configure the Ollama URL: http://your-server:11434

Create AI-powered automations:

# configuration.yaml
conversation:
  intents:
    HassLightSet:
      - "Turn {area} lights {state}"
      - "Set {area} brightness to {brightness}"

# Use Ollama for natural language understanding
# Example: "Make the living room cozy" → dims lights, adjusts color temperature

n8n Workflow Automation

Integrate with n8n for complex AI workflows:

{
  "nodes": [
    {
      "name": "Ollama",
      "type": "n8n-nodes-base.httpRequest",
      "parameters": {
        "url": "http://ollama:11434/api/generate",
        "method": "POST",
        "body": {
          "model": "llama3.2:1b",
          "prompt": "={{ $json.input }}",
          "stream": false
        }
      }
    }
  ]
}

VS Code Integration

Use your local AI for coding assistance:

Install Continue extension:

  1. Install "Continue" extension in VS Code
  2. Configure ~/.continue/config.json:
{
  "models": [
    {
      "title": "Local Ollama",
      "provider": "ollama",
      "model": "deepseek-coder:1.3b",
      "apiBase": "http://localhost:11434"
    }
  ]
}

Obsidian Integration

Add AI to your note-taking workflow:

  1. Install the "Text Generator" plugin
  2. Configure provider as "Ollama"
  3. Set endpoint: http://localhost:11434
  4. Select your preferred model

Use cases:

  • Summarize long notes
  • Generate ideas from existing content
  • Expand bullet points into paragraphs
  • Create flashcards from notes

Security Considerations

Network Security

Bind to localhost only (if not exposing remotely):

services:
  ollama:
    ports:
      - "127.0.0.1:11434:11434"  # Only accessible from localhost

Use a firewall:

# Allow only local network access
sudo ufw allow from 192.168.1.0/24 to any port 3000
sudo ufw allow from 192.168.1.0/24 to any port 11434

Authentication

Open WebUI provides built-in authentication:

environment:
  - WEBUI_AUTH=true
  - ENABLE_SIGNUP=false  # Disable public registration
  - DEFAULT_USER_ROLE=user

Create users via CLI:

docker exec -it open-webui python -c "
from apps.webui.models.users import Users
Users.insert_new_user('email@example.com', 'username', 'password', 'user')
"

Model Security

Be aware of model capabilities and limitations:

  • Uncensored models: Some models remove safety filters—use responsibly
  • Prompt injection: Local models can still be manipulated via prompts
  • Data leakage: Models may memorize and repeat training data
  • Resource exhaustion: Large prompts can consume significant resources

Upgrading and Maintenance

Updating Containers

# Pull latest images
docker compose pull

# Recreate containers with new images
docker compose up -d --force-recreate

# Clean up old images
docker image prune -f

Updating Models

# List current models
docker exec -it ollama ollama list

# Update a specific model
docker exec -it ollama ollama pull llama3.2:1b

# Remove old model versions
docker exec -it ollama ollama rm llama3.2:1b-old

Monitoring

Check resource usage:

# Container stats
docker stats ollama open-webui

# Ollama-specific metrics
curl http://localhost:11434/api/ps

Set up alerts:

# Simple health check script
#!/bin/bash
if ! curl -sf http://localhost:11434/api/tags > /dev/null; then
    echo "Ollama is down!" | mail -s "AI Stack Alert" admin@example.com
fi

Future Considerations

GPU Acceleration

When you're ready to upgrade for better performance:

NVIDIA GPU Setup:

services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Budget GPU Options:

  • NVIDIA Tesla P40 (~$150 used): 24GB VRAM, excellent for 70B models
  • NVIDIA RTX 3060 12GB (~$250): Good balance of price/performance
  • Intel Arc A380 (~$100): Experimental Ollama support

Upcoming Features

Keep an eye on:

  • Ollama: Native Windows support, more model formats
  • Open WebUI: Enhanced RAG, collaborative features, plugins
  • Models: Smaller, faster models with better quality (Phi-4, Llama 4)

Key Takeaways

Self-hosting AI with Ollama and Open WebUI is practical and rewarding:

  • Start small: Begin with 1.5B parameter models on N100 hardware for usable 5-8 tok/s performance
  • RAM is king: 16GB minimum, 32GB recommended for flexibility
  • Privacy matters: Your conversations stay local with zero cloud dependency
  • Cost-effective: Pay only for electricity (~$3/month) vs $20/month subscriptions
  • Customize freely: Choose models optimized for your specific use cases

Additional Resources

Official Documentation

  • Ollama Documentation
  • Open WebUI Docs
  • Ollama GitHub
  • Open WebUI GitHub

Community Resources

  • r/LocalLLaMA - Local AI community
  • r/selfhosted - Self-hosting enthusiasts
  • r/homelab - Home server community

Related Guides

  • Raspberry Pi 5 vs Intel N100 - Hardware comparison
  • Linux Power Optimization Guide - Reduce power consumption
  • Tailscale vs Cloudflare Tunnel - Remote access options

Model Resources

  • Ollama Model Library - Official models
  • Hugging Face - Model discovery
  • LM Studio - Alternative runtime with GUI

Last updated: December 2025

← Back to all use cases

You may also like

Paperless-ngx Setup Guide: Go Paperless in 2025

Use Cases

Paperless-ngx Setup Guide: Go Paperless in 2025

Self-host Paperless-ngx for document management. OCR setup, scanner integration, and automation tips for your home server.

documentspaperless-ngxself-hosted
Self-Hosted Immich Guide: Google Photos Alternative (2025)

Use Cases

Self-Hosted Immich Guide: Google Photos Alternative (2025)

Deploy Immich on your low-power home server. Complete Docker Compose setup, mobile backup config, and hardware transcoding for Intel N100.

self-hosted
Private AI Automation with n8n: Local LLM Workflows

Use Cases

Private AI Automation with n8n: Local LLM Workflows

Build a private AI automation pipeline with n8n and Ollama. Self-hosted workflows for RSS summarization, email processing, and smart home automation.

n8nself-hosted

Ready to set up your server?

Check out our build guides to get started with hardware.

View Build Guides

On this page

  1. Why Self-Host Your AI Assistant?
  2. Privacy & Data Control
  3. Cost Savings
  4. Offline Capability
  5. Customization & Control
  6. Hardware Requirements for Local AI
  7. Minimum Specs (CPU-Only)
  8. RAM: The Critical Factor
  9. Single-Channel vs Dual-Channel RAM
  10. Understanding Local LLM Performance
  11. Tokens Per Second Explained
  12. Quantization: Trading Quality for Speed
  13. Choosing the Right Model
  14. Tier 1: Fast & Practical (0.5B-1.5B)
  15. Tier 2: More Capable (3B-7B)
  16. Tier 3: Maximum Capability (7B+)
  17. Specialized Models
  18. Installation Guide
  19. Prerequisites
  20. Deploying Ollama
  21. Setting Up Open WebUI
  22. Complete Docker Compose Configuration
  23. Optimizing Performance on Low-Power Hardware
  24. Model Selection Strategy
  25. Ollama Environment Tuning
  26. Context Length vs Performance
  27. CPU Thread Optimization
  28. Memory Management
  29. Real-World Performance Benchmarks
  30. Speed Benchmarks by Model
  31. Response Time Examples
  32. Comparison with Cloud Services
  33. Use Cases for Your Private AI
  34. Document Summarization
  35. Coding Assistance
  36. Home Automation Integration
  37. Personal Knowledge Base (RAG)
  38. Writing Assistant
  39. Troubleshooting Common Issues
  40. Out of Memory Errors
  41. Slow Inference
  42. Container Networking Issues
  43. Model Download Failures
  44. High CPU Usage When Idle
  45. Advanced Configuration
  46. Enabling RAG (Retrieval Augmented Generation)
  47. Multiple Model Configurations
  48. API Integration
  49. Remote Access Setup
  50. Backup and Migration
  51. Integrations and Automation
  52. Home Assistant Integration
  53. n8n Workflow Automation
  54. VS Code Integration
  55. Obsidian Integration
  56. Security Considerations
  57. Network Security
  58. Authentication
  59. Model Security
  60. Upgrading and Maintenance
  61. Updating Containers
  62. Updating Models
  63. Monitoring
  64. Future Considerations
  65. GPU Acceleration
  66. Upcoming Features
  67. Key Takeaways
  68. Additional Resources
  69. Official Documentation
  70. Community Resources
  71. Related Guides
  72. Model Resources