Ollama Articles

Ollama on Raspberry Pi: Edge LLM Inference

Running large language models LLMs directly on a Raspberry Pi might seem like a pipe dream, but Ollama makes it a surprisingly capable reality for edge .

2 min read

Ollama Resource Limits: Cap Memory and Loaded Models

Ollama doesn't actually have built-in, configurable resource limits for memory or loaded models in the way you might expect from a traditional applicati.

4 min read

Ollama REST API: Call Local LLMs from Python

Ollama’s REST API is actually a surprisingly powerful tool for integrating local Large Language Models LLMs into your Python applications, often bypassi.

2 min read

Ollama Nginx Proxy: Expose API with HTTPS

Ollama’s API is incredibly easy to expose securely to the outside world with Nginx, but the magic that makes it work is that Nginx is not simply forward.

2 min read

Ollama AMD GPU: ROCm Acceleration Setup

ROCm doesn't actually use your AMD GPU for inference unless you specifically tell it to, even if you have a perfectly compatible card.

4 min read

Ollama Models: Run Llama, Mistral, Gemma Locally

You can run sophisticated large language models like Llama, Mistral, and Gemma directly on your own hardware, bypassing the need for cloud APIs and thei.

2 min read

Ollama Install: Set Up Local LLMs on Any Platform

Ollama doesn't just run LLMs; it makes them feel like any other local application you'd install, just with exponentially more parameters.

Ollama on Raspberry Pi: Edge LLM Inference

Ollama Resource Limits: Cap Memory and Loaded Models

Ollama REST API: Call Local LLMs from Python

Ollama Nginx Proxy: Expose API with HTTPS

Ollama AMD GPU: ROCm Acceleration Setup

Ollama Models: Run Llama, Mistral, Gemma Locally

Ollama Install: Set Up Local LLMs on Any Platform

Ollama Streaming: Stream Tokens from Local LLMs

Ollama Structured Output: Enforce JSON Response Format

Ollama Vision Models: LLaVA Image Analysis Guide

Ollama vs LM Studio vs Jan: Local LLM Tools Compared

Ollama Windows WSL2: GPU-Accelerated Local LLMs

Fix Ollama Context Length Exceeds Maximum for Model

Ollama Batch Inference: Handle Parallel LLM Requests

Ollama CLI Cheatsheet: Every Command You Need

Ollama Code Generation: CodeLlama and Qwen2.5

Ollama Context Window: Extend Token Limit for Models

Ollama vs Cloud APIs: Cost Comparison at Scale

Ollama Custom Prompts: Write Modelfile System Templates

Ollama Docker: Deploy Local LLMs in Containers

Ollama Embeddings: nomic-embed and mxbai for RAG

Ollama GGUF Import: Load Fine-Tuned Models Locally

Ollama Function Calling: Tool Use in Local LLMs

Ollama CUDA: Enable GPU Acceleration on NVIDIA

Ollama GPU/CPU Hybrid: Offload Layers Across Devices

Ollama HuggingFace: Convert and Import GGUF Models

Ollama Keep-Alive: Preload Models to Eliminate Delays

Ollama on Kubernetes: Production LLM Serving Setup

Ollama + LangChain: Build Local RAG Applications

Ollama Latency: Optimize Time-to-First-Token

Ollama + LlamaIndex: Build Local RAG Pipelines

Ollama Load Balancing: Distribute Requests Across Instances

Ollama Debug Logging: Verbose Mode for Troubleshooting

Ollama RAM Requirements: Memory for Every Model Size

Ollama Apple Silicon: Metal GPU Acceleration on Mac

Ollama Context Length: Configure num_ctx for Models

Ollama Model Quantization: Q4, Q5, Q8 Compared

Ollama Modelfile: Create Custom Models with Templates

Ollama Multi-Model: Serve Multiple Models Concurrently

Ollama Multimodal: Analyze Documents with Vision Models

Ollama NUMA: Optimize Inference on Multi-CPU Systems

Ollama.js: Integrate Local LLMs in Node.js Apps

Ollama OpenAI API: Drop-In Replacement for OpenAI

Ollama + Open WebUI: Chat Interface for Local LLMs

Ollama Performance: Benchmark Tokens Per Second

Ollama Small Models: Phi-3 and Qwen2.5 Compared

Ollama Air-Gap: Deploy LLMs in Private Networks

Ollama Production: Architecture for High Availability

Ollama Prometheus Metrics: Monitor LLM Serving

Ollama Proxy Auth: Secure API with Authentication

Ollama Model Management: Pull, List, Delete Models