Skip to content
ADHDecode
  1. Home
  2. Articles
  3. Ollama

Ollama Articles

51 articles

Ollama on Raspberry Pi: Edge LLM Inference

Running large language models LLMs directly on a Raspberry Pi might seem like a pipe dream, but Ollama makes it a surprisingly capable reality for edge .

2 min read

Ollama Resource Limits: Cap Memory and Loaded Models

Ollama doesn't actually have built-in, configurable resource limits for memory or loaded models in the way you might expect from a traditional applicati.

4 min read

Ollama REST API: Call Local LLMs from Python

Ollama’s REST API is actually a surprisingly powerful tool for integrating local Large Language Models LLMs into your Python applications, often bypassi.

2 min read

Ollama Nginx Proxy: Expose API with HTTPS

Ollama’s API is incredibly easy to expose securely to the outside world with Nginx, but the magic that makes it work is that Nginx is not simply forward.

2 min read

Ollama AMD GPU: ROCm Acceleration Setup

ROCm doesn't actually use your AMD GPU for inference unless you specifically tell it to, even if you have a perfectly compatible card.

4 min read

Ollama Models: Run Llama, Mistral, Gemma Locally

You can run sophisticated large language models like Llama, Mistral, and Gemma directly on your own hardware, bypassing the need for cloud APIs and thei.

2 min read

Ollama Install: Set Up Local LLMs on Any Platform

Ollama doesn't just run LLMs; it makes them feel like any other local application you'd install, just with exponentially more parameters.

3 min read

Ollama Streaming: Stream Tokens from Local LLMs

Ollama Streaming: Stream Tokens from Local LLMs — practical guide covering ollama setup, configuration, and troubleshooting with real-world examples.

3 min read

Ollama Structured Output: Enforce JSON Response Format

Ollama's structured output feature doesn't actually enforce JSON; it merely requests it, and the model might still hallucinate non-JSON data.

3 min read

Ollama Vision Models: LLaVA Image Analysis Guide

LLaVA models can analyze images by breaking them down into a grid of patches, embedding each patch, and then using a vision transformer to process these.

3 min read

Ollama vs LM Studio vs Jan: Local LLM Tools Compared

Ollama, LM Studio, and Jan aren't just GUIs for running LLMs; they're fundamentally different philosophies on how you should interact with local artific.

3 min read

Ollama Windows WSL2: GPU-Accelerated Local LLMs

The most surprising thing about running LLMs locally with Ollama on Windows WSL2 is how easily you can bypass Windows' own GPU driver stack for a signif.

2 min read

Fix Ollama Context Length Exceeds Maximum for Model

Ollama's contextlength setting is failing because the prompt you're sending is longer than the model's actual maximum context window.

4 min read

Ollama Batch Inference: Handle Parallel LLM Requests

Ollama's batch inference capability doesn't just speed up your LLM requests; it fundamentally changes how you think about parallel processing by intelli.

3 min read

Ollama CLI Cheatsheet: Every Command You Need

Ollama CLI is more than just a way to download and run LLMs; it’s a surprisingly powerful tool for managing your local AI experiments.

3 min read

Ollama Code Generation: CodeLlama and Qwen2.5

Ollama Code Generation: CodeLlama and Qwen2.5 — CodeLlama and Qwen2.5 are both powerful open-source LLMs fine-tuned for code generation, and Ollama .

3 min read

Ollama Context Window: Extend Token Limit for Models

Ollama's context window isn't a hard limit you can just "extend" with a single flag; it's a fundamental architectural constraint of the model itself, de.

3 min read

Ollama vs Cloud APIs: Cost Comparison at Scale

Running large language models locally with Ollama can be significantly cheaper than using cloud APIs like OpenAI's or Anthropic's when you're processing.

3 min read

Ollama Custom Prompts: Write Modelfile System Templates

The most surprising thing about Ollama's Modelfile system templates is that they don't actually "template" anything in the way you'd expect from a progr.

3 min read

Ollama Docker: Deploy Local LLMs in Containers

Ollama is a tool that lets you run large language models LLMs locally on your own machine, and Docker is a way to package and run applications in isolat.

3 min read

Ollama Embeddings: nomic-embed and mxbai for RAG

The most surprising thing about embedding models for RAG is how much they don't care about sentence structure, prioritizing instead the sheer semantic d.

3 min read

Ollama GGUF Import: Load Fine-Tuned Models Locally

Ollama GGUF Import: Load Fine-Tuned Models Locally — practical guide covering ollama setup, configuration, and troubleshooting with real-world examples.

2 min read

Ollama Function Calling: Tool Use in Local LLMs

Function calling in local LLMs, particularly with Ollama, isn't about the LLM executing code; it's about the LLM describing what code it wants to execut.

3 min read

Ollama CUDA: Enable GPU Acceleration on NVIDIA

Ollama, when properly configured, uses your NVIDIA GPU for massive speedups on AI model inference, but sometimes it just doesn't seem to be picking it u.

3 min read

Ollama GPU/CPU Hybrid: Offload Layers Across Devices

The surprising truth about Ollama's GPU/CPU hybrid mode is that it's not about splitting a single model's layers between devices, but rather about strat.

3 min read

Ollama HuggingFace: Convert and Import GGUF Models

Ollama's ability to import Hugging Face GGUF models is a game-changer for running large language models locally, but it's not as simple as just pointing.

2 min read

Ollama Keep-Alive: Preload Models to Eliminate Delays

Preloading models into Ollama's memory isn't about "keeping them alive" in the traditional sense; it's about shifting the compute cost from your interac.

11 min read

Ollama on Kubernetes: Production LLM Serving Setup

The most surprising thing about serving LLMs with Ollama on Kubernetes is how aggressively it fights against the very infrastructure designed to manage .

3 min read

Ollama + LangChain: Build Local RAG Applications

The most surprising thing about building local RAG applications with Ollama and LangChain is how little infrastructure you actually need to get started.

3 min read

Ollama Latency: Optimize Time-to-First-Token

The most surprising thing about Ollama latency is that the bottleneck is almost never the LLM itself; it's usually the I/O and network stack sitting bet.

2 min read

Ollama + LlamaIndex: Build Local RAG Pipelines

LlamaIndex doesn't actually index your data; it indexes representations of your data that are designed for efficient retrieval.

2 min read

Ollama Load Balancing: Distribute Requests Across Instances

Ollama Load Balancing: Distribute Requests Across Instances — practical guide covering ollama setup, configuration, and troubleshooting with real-world ...

4 min read

Ollama Debug Logging: Verbose Mode for Troubleshooting

Ollama's verbose logging mode doesn't just give you more output; it fundamentally changes how the system perceives and reports on its own internal state.

2 min read

Ollama RAM Requirements: Memory for Every Model Size

Ollama models don't just use RAM; they are RAM for all intents and purposes, meaning their entire weight needs to be loaded into memory before they can .

3 min read

Ollama Apple Silicon: Metal GPU Acceleration on Mac

Metal GPU acceleration on macOS with Ollama is the primary mechanism that allows your M-series Mac to run large language models at speeds that feel almo.

2 min read

Ollama Context Length: Configure num_ctx for Models

Ollama's numctx parameter doesn't just change how much text a model can "remember"; it fundamentally alters the transformer's attention window, impactin.

3 min read

Ollama Model Quantization: Q4, Q5, Q8 Compared

Quantization isn't about making models smaller in the sense of fewer parameters; it's about reducing the precision of the weights, which dramatically sh.

3 min read

Ollama Modelfile: Create Custom Models with Templates

You can build truly custom AI models with Ollama by using Modelfiles, and the most powerful feature is their templating system.

3 min read

Ollama Multi-Model: Serve Multiple Models Concurrently

You can actually serve multiple Ollama models on the same machine simultaneously, and it's much less of a resource hog than you'd think, because Ollama .

2 min read

Ollama Multimodal: Analyze Documents with Vision Models

Ollama Multimodal: Analyze Documents with Vision Models — practical guide covering ollama setup, configuration, and troubleshooting with real-world exam...

2 min read

Ollama NUMA: Optimize Inference on Multi-CPU Systems

Ollama NUMA: Optimize Inference on Multi-CPU Systems — practical guide covering ollama setup, configuration, and troubleshooting with real-world examples.

5 min read

Ollama.js: Integrate Local LLMs in Node.js Apps

Ollama.js: Integrate Local LLMs in Node.js Apps — Ollama.js is a Node.js library that lets you run large language models LLMs locally on your machine .

3 min read

Ollama OpenAI API: Drop-In Replacement for OpenAI

Ollama's OpenAI API compatibility means you can run large language models locally and swap them in for OpenAI's cloud-based services with minimal code c.

2 min read

Ollama + Open WebUI: Chat Interface for Local LLMs

Open WebUI can run locally and serve as a slick chat interface for your Ollama-hosted LLMs, letting you interact with models like Llama 3 or Mistral wit.

3 min read

Ollama Performance: Benchmark Tokens Per Second

Ollama doesn't actually measure "tokens per second" as its primary performance metric, which is why benchmarks can be misleading.

3 min read

Ollama Small Models: Phi-3 and Qwen2.5 Compared

The most surprising thing about small language models like Phi-3 and Qwen2. 5 is how they manage to punch so far above their weight class, often approac.

3 min read

Ollama Air-Gap: Deploy LLMs in Private Networks

The most surprising thing about deploying LLMs in air-gapped environments is how little the core LLM technology changes; it's the delivery mechanism tha.

3 min read

Ollama Production: Architecture for High Availability

Ollama doesn't actually have a concept of "production" or "high availability" as a built-in feature; it's designed as a local development tool.

3 min read

Ollama Prometheus Metrics: Monitor LLM Serving

Ollama's Prometheus metrics are surprisingly stateless, focusing on the current state and ephemeral request details rather than historical trends.

3 min read

Ollama Proxy Auth: Secure API with Authentication

Ollama, the slick local LLM runner, doesn't come with built-in authentication for its API, leaving your local models wide open if exposed.

3 min read

Ollama Model Management: Pull, List, Delete Models

Ollama's model management is surprisingly flexible, letting you treat large language models like simple packages on your local machine.

3 min read
ADHDecode

Complex topics, finally made simple

Courses

  • Networking
  • Databases
  • Linux
  • Distributed Systems
  • Containers & Kubernetes
  • System Design
  • All Courses →

Resources

  • Cheatsheets
  • Debugging
  • Articles
  • About
  • Privacy
  • Sitemap

Connect

  • Twitter (opens in new tab)
  • GitHub (opens in new tab)

Built for curious minds. Free forever.

© 2026 ADHDecode. All content is free.

  • Home
  • Learn
  • Courses
Esc
Start typing to search all courses...
See all results →
↑↓ navigate Enter open Esc close