Ollama Manager Guide¶

Introduction¶

Ollama Manager provides a streamlined way to prototype and develop applications using Ollama's AI models. Instead of manually managing the Ollama server process, installing it as a service, or running it in a separate terminal, Ollama Manager handles the entire lifecycle programmatically.

Key Benefits for Prototyping:

Start/stop Ollama server automatically within your Python code
Configure resources dynamically based on your needs
Handle multiple server instances for testing
Automatic cleanup of resources
Platform-independent operation

Quick Start¶

from clientai import ClientAI
from clientai.ollama import OllamaManager

# Basic usage - server starts automatically and stops when done
with OllamaManager() as manager:
    # Create a client that connects to the managed server
    client = ClientAI('ollama', host="http://localhost:11434")

    # Use the client normally
    response = client.generate_text(
        "Explain quantum computing",
        model="llama2"
    )
    print(response)
# Server automatically stops when exiting the context

Installation¶

# Install with Ollama support
pip install "clientai[ollama]"

# Install with all providers
pip install "clientai[all]"

Core Concepts¶

Server Lifecycle Management¶

Context Manager (Recommended)

with OllamaManager() as manager:
    # Server starts automatically
    client = ClientAI('ollama')
    # Use client...
# Server stops automatically

Manual Management

manager = OllamaManager()
try:
    manager.start()
    client = ClientAI('ollama')
    # Use client...
finally:
    manager.stop()

Configuration Management¶

from clientai.ollama import OllamaServerConfig

# Create custom configuration
config = OllamaServerConfig(
    host="127.0.0.1",
    port=11434,
    gpu_layers=35,
    memory_limit="8GiB"
)

# Use configuration with manager
with OllamaManager(config) as manager:
    client = ClientAI('ollama')
    # Use client...

Resource Configuration Guide¶

GPU Configuration¶

Basic GPU Setup

config = OllamaServerConfig(
    gpu_layers=35,  # Number of layers to run on GPU
    gpu_memory_fraction=0.8  # 80% of GPU memory
)

Multi-GPU Setup

config = OllamaServerConfig(
    gpu_devices=[0, 1],  # Use first two GPUs
    gpu_memory_fraction=0.7
)

Memory Management¶

The memory_limit parameter requires specific formatting following Go's memory limit syntax:

# Correct memory limit formats
config = OllamaServerConfig(
    memory_limit="8GiB"    # 8 gibibytes
    # OR
    memory_limit="8192MiB" # Same as 8GiB
)

# Invalid formats that will cause errors
config = OllamaServerConfig(
    memory_limit="8GiB"     # Wrong unit
    memory_limit="8 GiB"   # No spaces allowed
    memory_limit="8g"      # Must specify full unit
)

Valid Memory Units: - B - Bytes - KiB - Kibibytes (1024 bytes) - MiB - Mebibytes (1024² bytes) - GiB - Gibibytes (1024³ bytes) - TiB - Tebibytes (1024⁴ bytes)

Common Configurations:

High-Performance Setup

config = OllamaServerConfig(
    memory_limit="24GiB",
    cpu_threads=16,
    gpu_layers=40
)

Balanced Setup

config = OllamaServerConfig(
    memory_limit="16GiB",
    cpu_threads=8,
    gpu_layers=32
)

Resource-Constrained Setup

config = OllamaServerConfig(
    memory_limit="8GiB",
    cpu_threads=4,
    gpu_layers=24
)

Dynamic Memory Allocation

import psutil

# Calculate available memory and use 70%
available_gib = (psutil.virtual_memory().available / (1024**3))
memory_limit = f"{int(available_gib * 0.7)}GiB"

config = OllamaServerConfig(
    memory_limit=memory_limit,
    cpu_threads=8
)

Model-Specific Memory Guidelines¶

Different models require different amounts of memory:

Large Language Models (>30B parameters)

config = OllamaServerConfig(
    memory_limit="24GiB",
    gpu_memory_fraction=0.9
)

Medium Models (7B-30B parameters)

config = OllamaServerConfig(
    memory_limit="16GiB",
    gpu_memory_fraction=0.8
)

Small Models (<7B parameters)

config = OllamaServerConfig(
    memory_limit="8GiB",
    gpu_memory_fraction=0.7
)

Advanced Configuration Reference¶

config = OllamaServerConfig(
    # Server settings
    host="127.0.0.1",      # Server bind address
    port=11434,            # Server port number
    timeout=30,            # Maximum startup wait time in seconds
    check_interval=1.0,    # Health check interval in seconds

    # GPU settings
    gpu_layers=35,          # More layers = more GPU utilization
    gpu_memory_fraction=0.8, # 0.0 to 1.0 (80% GPU memory)
    gpu_devices=[0],        # Specific GPU devices to use

    # CPU settings
    cpu_threads=8,          # Number of CPU threads
    memory_limit="16GiB",   # Maximum RAM usage (must use GiB/MiB units)

    # Compute settings
    compute_unit="auto",    # "auto", "cpu", or "gpu"

    # Additional settings
    env_vars={"CUSTOM_VAR": "value"},  # Additional environment variables
    extra_args=["--verbose"]           # Additional command line arguments
)

Each setting explained:

Server Settings:

host: IP address to bind the server to
- Default: "127.0.0.1" (localhost)
- Use "0.0.0.0" to allow external connections
port: Port number for the server
- Default: 11434
- Change if default port is in use
timeout: Maximum time to wait for server startup
- Unit: seconds
- Increase for slower systems
check_interval: Time between server health checks
- Unit: seconds
- Adjust based on system responsiveness

GPU Settings:

gpu_layers: Number of model layers to offload to GPU
- Higher = more GPU utilization
- Lower = more CPU utilization
- Model-dependent (typically 24-40)
gpu_memory_fraction: Portion of GPU memory to use
- Range: 0.0 to 1.0
- Higher values may improve performance
- Lower values leave room for other applications
gpu_devices: Specific GPU devices to use
- Single GPU: gpu_devices=0
- Multiple GPUs: gpu_devices=[0, 1]
- None: gpu_devices=None

CPU Settings:

cpu_threads: Number of CPU threads to use
- Default: System dependent
- Recommended: Leave some threads for system
- Example: os.cpu_count() - 2
memory_limit: Maximum RAM allocation
- Must use GiB or MiB units
- Examples: "8GiB", "16384MiB"
- Should not exceed available system RAM

Compute Settings:

compute_unit: Preferred compute device
- "auto": Let Ollama decide (recommended)
- "cpu": Force CPU-only operation
- "gpu": Force GPU operation if available

Additional Settings:

env_vars: Additional environment variables
- Used for platform-specific settings
- Example: {"CUDA_VISIBLE_DEVICES": "0,1"}
extra_args: Additional CLI arguments
- Passed directly to Ollama server
- Example: ["--verbose", "--debug"]

Common Use Cases¶

1. Development and Prototyping¶

# Quick setup for development
with OllamaManager() as manager:
    client = ClientAI('ollama')

    # Test different prompts
    prompts = [
        "Write a poem about AI",
        "Explain quantum physics",
        "Create a Python function"
    ]

    for prompt in prompts:
        response = client.generate_text(prompt, model="llama2")
        print(f"Prompt: {prompt}\nResponse: {response}\n")

2. Multiple Model Testing¶

# Test different models with different configurations
models = ["llama2", "codellama", "mistral"]

for model in models:
    # Adjust configuration based on model
    if model == "llama2":
        config = OllamaServerConfig(gpu_layers=35)
    else:
        config = OllamaServerConfig(gpu_layers=28)

    with OllamaManager(config) as manager:
        client = ClientAI('ollama')
        response = client.generate_text(
            "Explain how you work",
            model=model
        )
        print(f"{model}: {response}\n")

3. A/B Testing Configurations¶

def test_configuration(config, prompt, model):
    start_time = time.time()

    with OllamaManager(config) as manager:
        client = ClientAI('ollama')
        response = client.generate_text(prompt, model=model)

    duration = time.time() - start_time
    return response, duration

# Test different configurations
configs = {
    "high_gpu": OllamaServerConfig(gpu_layers=40, gpu_memory_fraction=0.9),
    "balanced": OllamaServerConfig(gpu_layers=32, gpu_memory_fraction=0.7),
    "low_resource": OllamaServerConfig(gpu_layers=24, gpu_memory_fraction=0.5)
}

for name, config in configs.items():
    response, duration = test_configuration(
        config,
        "Write a long story about space",
        "llama2"
    )
    print(f"Configuration {name}: {duration:.2f} seconds")

4. Production Setup¶

import logging

logging.basicConfig(level=logging.INFO)

def create_production_manager():
    config = OllamaServerConfig(
        # Stable production settings
        gpu_layers=32,
        gpu_memory_fraction=0.7,
        memory_limit="16GiB",
        timeout=60,  # Longer timeout for stability
        check_interval=2.0
    )

    try:
        manager = OllamaManager(config)
        manager.start()
        return manager
    except Exception as e:
        logging.error(f"Failed to start Ollama: {e}")
        raise

# Use in production
try:
    manager = create_production_manager()
    client = ClientAI('ollama')
    # Use client...
finally:
    manager.stop()

Error Handling¶

Ollama Manager provides several specific exception types to help you handle different error scenarios effectively:

ExecutableNotFoundError¶

Occurs when the Ollama executable cannot be found on the system.

Common causes: - Ollama not installed - Ollama not in system PATH - Incorrect installation

How to handle:

try:
    manager = OllamaManager()
    manager.start()
except ExecutableNotFoundError:
    # Guide user through installation
    if platform.system() == "Darwin":
        print("Install Ollama using: brew install ollama")
    elif platform.system() == "Linux":
        print("Install Ollama using: curl -fsSL https://ollama.com/install.sh | sh")
    else:
        print("Download Ollama from: https://ollama.com/download")

ServerStartupError¶

Occurs when the server fails to start properly.

Common causes: - Port already in use - Insufficient permissions - Corrupt installation - Resource conflicts

How to handle:

try:
    manager = OllamaManager()
    manager.start()
except ServerStartupError as e:
    if "address already in use" in str(e).lower():
        # Try alternative port
        config = OllamaServerConfig(port=11435)
        manager = OllamaManager(config)
        manager.start()
    elif "permission denied" in str(e).lower():
        print("Please run with appropriate permissions")
    else:
        print(f"Server startup failed: {e}")

ServerTimeoutError¶

Occurs when the server doesn't respond within the configured timeout period.

Common causes: - System under heavy load - Insufficient resources - Network issues - Too short timeout period

How to handle:

config = OllamaServerConfig(
    timeout=60,  # Increase timeout
    check_interval=2.0  # Reduce check frequency
)
try:
    manager = OllamaManager(config)
    manager.start()
except ServerTimeoutError:
    # Either retry with longer timeout or fail gracefully
    config.timeout = 120
    try:
        manager = OllamaManager(config)
        manager.start()
    except ServerTimeoutError:
        print("Server unresponsive after extended timeout")

ResourceError¶

Occurs when there are insufficient system resources to run Ollama.

Common causes: - Insufficient memory - GPU memory allocation failures - Too many CPU threads requested - Disk space constraints

How to handle:

try:
    manager = OllamaManager()
    manager.start()
except ResourceError as e:
    if "memory" in str(e).lower():
        # Try with reduced memory settings
        config = OllamaServerConfig(
            memory_limit="4GiB",
            gpu_memory_fraction=0.5
        )
    elif "gpu" in str(e).lower():
        # Fallback to CPU
        config = OllamaServerConfig(compute_unit="cpu")

    try:
        manager = OllamaManager(config)
        manager.start()
    except ResourceError:
        print("Unable to allocate required resources")

ConfigurationError¶

Occurs when provided configuration values are invalid.

Common causes: - Invalid memory format - Invalid GPU configuration - Incompatible settings - Out-of-range values

How to handle:

try:
    config = OllamaServerConfig(
        gpu_memory_fraction=1.5  # Invalid value
    )
except ConfigurationError as e:
    # Use safe default values
    config = OllamaServerConfig(
        gpu_memory_fraction=0.7,
        gpu_layers=32
    )

UnsupportedPlatformError¶

Occurs when running on an unsupported platform or configuration.

Common causes: - Unsupported operating system - Missing system features - Incompatible hardware

How to handle:

try:
    manager = OllamaManager()
    manager.start()
except UnsupportedPlatformError as e:
    # Fall back to alternative configuration or inform user
    print(f"Platform not supported: {e}")
    print("Supported platforms: Windows, macOS, Linux")

def start_with_memory_fallback():
    try:
        # Try optimal memory configuration
        config = OllamaServerConfig(memory_limit="16GiB")
        return OllamaManager(config)
    except ServerStartupError as e:
        if "GOMEMLIMIT" in str(e):
            # Fall back to lower memory configuration
            config = OllamaServerConfig(memory_limit="8GiB")
            return OllamaManager(config)
        raise  # Re-raise if error is not memory-related

Best Practices for Error Handling¶

Graceful Degradation

def start_with_fallback():
    # Try optimal configuration first
    config = OllamaServerConfig(
        gpu_layers=35,
        gpu_memory_fraction=0.8
    )

    try:
        return OllamaManager(config)
    except ResourceError:
        # Fall back to minimal configuration
        config = OllamaServerConfig(
            gpu_layers=24,
            gpu_memory_fraction=0.5
        )
        return OllamaManager(config)

Logging and Monitoring

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

try:
    manager = OllamaManager()
    manager.start()
except ServerStartupError as e:
    logger.error(f"Server startup failed: {e}")
    logger.debug(f"Full error details: {e.original_exception}")

Recovery Strategies

def start_with_retry(max_attempts=3, delay=5):
    for attempt in range(max_attempts):
        try:
            manager = OllamaManager()
            manager.start()
            return manager
        except (ServerStartupError, ServerTimeoutError) as e:
            if attempt == max_attempts - 1:
                raise
            logger.warning(f"Attempt {attempt + 1} failed, retrying...")
            time.sleep(delay)

Remember to always clean up resources properly, even when handling errors:

manager = None
try:
    manager = OllamaManager()
    manager.start()
    # Use manager...
except Exception as e:
    logger.error(f"Error occurred: {e}")
finally:
    if manager is not None:
        manager.stop()

Performance Monitoring¶

import psutil
from contextlib import contextmanager
import time

@contextmanager
def monitor_performance():
    start_time = time.time()
    start_cpu = psutil.cpu_percent()
    start_mem = psutil.virtual_memory().percent

    yield

    duration = time.time() - start_time
    cpu_diff = psutil.cpu_percent() - start_cpu
    mem_diff = psutil.virtual_memory().percent - start_mem

    print(f"Duration: {duration:.2f}s")
    print(f"CPU impact: {cpu_diff:+.1f}%")
    print(f"Memory impact: {mem_diff:+.1f}%")

# Use with Ollama Manager
with monitor_performance():
    with OllamaManager() as manager:
        client = ClientAI('ollama')
        response = client.generate_text(
            "Write a story",
            model="llama2"
        )

Best Practices¶

Always use context managers when possible
Start with conservative resource settings and adjust up
Monitor system resources during development
Implement proper error handling
Use appropriate configurations for development vs. production
Clean up resources properly in all code paths
Log important events for troubleshooting
Test different configurations to find optimal settings
Consider platform-specific settings for cross-platform applications
Implement graceful degradation when resources are constrained

Troubleshooting¶

Server won't start: Check if Ollama is installed and port is available
Performance issues: Monitor and adjust resource configuration
Memory errors: Reduce memory_limit and gpu_memory_fraction
GPU errors: Try reducing gpu_layers or switch to CPU
Timeout errors: Increase timeout value in configuration
Platform-specific issues: Check platform support and requirements

Remember that Ollama Manager is a powerful tool for prototyping and development, but always monitor system resources and adjust configurations based on your specific needs and hardware capabilities.

Ollama Manager Guide¶

Introduction¶

Quick Start¶

Installation¶

Core Concepts¶

Server Lifecycle Management¶

Configuration Management¶

Resource Configuration Guide¶

GPU Configuration¶

Memory Management¶

Model-Specific Memory Guidelines¶

Advanced Configuration Reference¶

Common Use Cases¶

1. Development and Prototyping¶

2. Multiple Model Testing¶

3. A/B Testing Configurations¶

4. Production Setup¶

Error Handling¶

ExecutableNotFoundError¶

ServerStartupError¶

ServerTimeoutError¶

ResourceError¶

ConfigurationError¶

UnsupportedPlatformError¶

Memory-Related Error Handling¶

Best Practices for Error Handling¶

Performance Monitoring¶

Best Practices¶

Troubleshooting¶