Running Local with Ollama
a guided introduction by Prospero, your majordomo
Welcome. You must be the one who wants to run the whole show from home — no cloud accounts, no monthly invoices, no data leaving the premises. A person of discernment. I'm Prospero, and it's my job to make sure everything in this house runs on time and to specification. I'll be introducing you to the people who'll get you sorted.
But first: a word about what you're walking into.
Quilltap connects to AI models through connection profiles — named configurations that pair a provider with a model and a set of tuning parameters. For cloud providers, those profiles also reference stored API keys. For you, running Ollama, the arrangement is simpler. No keys. No credentials. Just a local service and the models you've pulled down to your own machine.
Nothing leaves your house. There are no per-message costs. Once you've downloaded a model, the whole operation works offline.
Here's the order of introductions:
- 1.
Getting Ollama running — before you meet anyone here, you need the engine on your end
- 2.
The Foundryman — he'll wire you up to your local models
- 3.
The Commonplace Book's Librarian — she'll give your characters a memory
- 4.
The Salon — where you'll actually hold conversations
- 5.
The Lantern — where images come from (with a caveat for local-only guests)
Shall we?
Before You Arrive: Installing Ollama
Before you step through our doors, you'll need Ollama running on your machine. Think of it as the engine room in your own basement — Quilltap will reach down to it whenever it needs a model to think.
Install
Download and install Ollama from ollama.com. It's available for macOS, Linux, and Windows. The installer handles everything.
Pull Your Models
Open a terminal and download at least one chat model. You'll also want an embedding model — more on that when you meet our Librarian.
# Chat models — pick one to start with
ollama pull llama3.2 # Good all-rounder, ~2GB
ollama pull mistral # Strong at following instructions, ~4GB
ollama pull gemma3 # Google's compact model, ~3–5GB
# Embedding model — for semantic memory search
ollama pull nomic-embed-text # Small, fast, purpose-built for embeddings Verify It's Running
Ollama starts automatically after installation and serves on http://localhost:11434.
Open that address in your browser. If you see "Ollama is running",
you're ready. If you prefer the terminal:
curl http://localhost:11434/api/tags
That returns a list of every model you've pulled. If you get a connection error, Ollama isn't
running — restart it from your applications menu or run ollama serve in a terminal.
Hardware Notes
RAM: 8GB minimum. 16GB or more is better, especially for larger models. GPU: A dedicated GPU with 6GB+ VRAM will stream responses significantly faster than CPU-only inference, but CPU works. Storage: Each model is roughly 2–8GB depending on parameter count and quantization.
Meet the Foundryman
The Foundry (/foundry/forge)
The Foundryman is the one who connects everything to everything. Grease on his sleeves, a wrench in his back pocket, and an encyclopedic knowledge of who needs to talk to whom. For cloud providers, he'd have you file your API keys first — credentials, labels, the whole bureaucracy. But you're running Ollama.
You don't need an API key. Walk right past the API Keys section. The Foundryman will tip his cap and point you straight to Connection Profiles.
Creating Your Connection Profile
This is your main link between Quilltap and your local models.
- 1.
Navigate to The Foundry → The Forge (click the Foundry icon in the left sidebar footer).
- 2.
Find the Connection Profiles section.
- 3.
Click Add Connection Profile.
- 4.
Fill in the form:
- • Profile Name — Something you'll recognize. "Local Llama," "Mistral at Home," "The House Model" — dealer's choice.
- • Provider — Select Ollama.
- • API Key — This field won't apply. Ollama doesn't need one.
- • Model — Click Fetch Models to populate the dropdown with every model you've pulled. If the list is empty, Ollama either isn't running or hasn't finished loading.
- • Base URL — Defaults to
http://localhost:11434for Ollama. Change this if you've moved Ollama to a different port or machine.
- 5.
Optionally, adjust the advanced settings:
- • Temperature — Controls randomness. Lower values (0.0–0.3) give more focused, consistent output. Higher values (0.7–1.0+) make the model more creative and unpredictable. Start around 0.7 for creative writing; drop to 0.2–0.3 for factual or structured tasks.
- • Max Tokens — Caps how long a response can be. Leave it at the default unless you have a reason to limit it.
- • Top P — An alternative knob for output diversity. Most people adjust Temperature and leave this alone.
- 6.
Click Save.
- 7.
Click Test Connection. The Foundryman sends a quick request to your Ollama instance and reports back: ✓ Healthy, ⚠ Degraded, or ✗ Unhealthy. You want that checkmark.
Make It the Default
Click Set as Default on your new profile. This pre-selects it whenever you start a new chat, so you don't have to pick it from a dropdown every time. You can always override the default on a per-chat or per-character basis later.
One Profile or Two?
If you've pulled multiple models — say, a large one for serious conversations and a small one for quick tasks — you can create a profile for each. The Foundryman is happy to manage as many as you like. This will matter in a moment when we talk about background tasks.
Meet the Librarian
The Commonplace Book (/foundry/commonplace-book)
Down the hall from the Foundry, through a door that smells faintly of old paper and binding glue, you'll find the Commonplace Book — and its keeper, the Librarian. She manages your characters' memories: the facts, impressions, and details that accumulate over the course of your conversations. When a character remembers that your protagonist has a scar above her left eye or that the treaty was signed in the autumn — that's the Librarian's doing.
Memories are found by embeddings — mathematical representations of meaning that let the system search by concept rather than exact keywords. A search for "cat" also surfaces memories mentioning "feline" or "kitten." Without embeddings, memory search falls back to simple keyword matching, which works but misses connections.
You have two options here, and both are free.
Option A: The Built-in System (Zero Configuration)
Quilltap ships with a built-in TF-IDF embedding system that works without any external service. Check The Commonplace Book — if you see a profile named "Built-in TF-IDF" marked as default, you're already covered. It was set up on first run.
If it's not there for some reason:
- 1. In The Commonplace Book, click Add Profile.
- 2. Select BUILTIN as the provider.
- 3. Name it "Built-in TF-IDF."
- 4. Click Save, then Set as Default.
No API key. No external service. It works offline. It's good enough for most use cases, and you can upgrade anytime.
Option B: Ollama Embeddings (Better Semantic Search)
If you pulled nomic-embed-text earlier, you can use it for higher-quality semantic
search — still local, still free, just smarter about meaning.
- 1. In The Commonplace Book, click Add Profile.
- 2. Select Ollama as the provider.
- 3. Set the Base URL to
http://localhost:11434(or wherever your Ollama is serving). - 4. Select nomic-embed-text as the model.
- 5. Click Save, then Set as Default.
The Librarian will use this for all memory retrieval going forward. If you ever stop running Ollama, she gracefully falls back to keyword search — nobody panics.
The Salon: Background Tasks and the Cheap LLM
The Salon (/foundry/salon)
The Salon is where your conversations actually happen — but behind the velvet curtains, a great deal of machinery keeps the experience running smoothly. Memory extraction, context compression, chat titling, image prompt expansion, housekeeping — all of these need a model to do their work. Quilltap calls this the Cheap LLM, because for cloud users, you'd want to assign a bargain-bin model to these tasks instead of burning through your expensive one.
You're running Ollama. Every model is free. But the Cheap LLM still matters, because without it enabled, some of these background tasks may not run at all.
Setting It Up
You can point the Cheap LLM at the same profile you already created, or — if you pulled a second, smaller model — at a dedicated lightweight profile. Either way:
- 1.
Navigate to The Foundry → The Salon.
- 2.
Find the Cheap LLM section.
- 3.
Toggle Enable Cheap LLM to on.
- 4.
Select a connection profile from the dropdown. For a single-model Ollama setup, just pick your main profile. If you created a second one for a smaller model, pick that.
What It Powers
Once enabled, the Cheap LLM handles:
- • Memory extraction — pulling key facts from conversations to store as character or user memories
- • Context compression — summarizing older parts of a conversation to fit within the model's context window
- • Chat auto-rename — generating a title for new conversations instead of leaving them as "New Chat"
- • Prompt expansion — enriching your image generation prompts with character and persona details before they're sent to the image provider
- • Housekeeping — regenerating tags, placeholders, and summaries during maintenance
- • Dangermouse classification — the content safety gatekeeper uses the Cheap LLM to classify messages, when enabled (more on Dangermouse another time — you'll know him when you see the trenchcoat)
- • Image description — describing images that appear in your chats
The Practical Advice
If you have the hardware for it, running a smaller model (like llama3.2 or phi3)
for background tasks while keeping a larger model for conversation is a good split. The background tasks
don't need brilliance — they need speed and reliability. But if you're running a single model,
that's fine too. Just turn it on.
A Word from the Lantern
The Lantern (/foundry/lantern)
You'll hear him before you see him — goggles pushed up on his forehead, scarf trailing behind, practically vibrating with enthusiasm about the latest image he's projected. The Lantern handles image generation in Quilltap: portraits, atmospheric story backgrounds generated on the fly, anything visual.
Here's the honest truth for a local-only setup: The Lantern needs a cloud provider. Image generation in Quilltap currently works through Google Gemini (Imagen 4), Grok, OpenAI (DALL-E), or OpenRouter — all of which require an API key and an internet connection.
If you're strictly local, The Lantern will have to wait. He'll be disappointed — he's always disappointed when he can't show you something — but he'll survive. Everything else in Quilltap works without him.
If you're mostly local but willing to add one cloud connection just for images, The Lantern's setup is straightforward:
- 1. Get an API key from one of the supported image providers (Google, Grok, OpenAI, or OpenRouter).
- 2. Add that key in The Foundry → The Forge → API Keys.
- 3. Navigate to The Foundry → The Lantern.
- 4. Click Add Image Profile, select the provider, key, and model.
- 5. Configure quality, style, and safety settings to taste.
- 6. Save.
But that's entirely optional, and it's the one place where your all-local setup touches the outside world. Your call.
The Tour in Summary
Here's where everything lives, at a glance:
| Who You're Meeting | Where They Are | What They Need from You |
|---|---|---|
| The Foundryman | The Forge → Connection Profiles | Create a profile: Provider = Ollama, pick your model |
| The Librarian | The Commonplace Book | Built-in TF-IDF works automatically — or add an Ollama embedding profile with nomic-embed-text |
| The Salon | The Salon → Cheap LLM | Enable the Cheap LLM toggle and point it at a connection profile |
| The Lantern | The Lantern | Cloud-only for now — skip this or add one cloud API key for images |
| Prospero | Prospero | I'm already here. Check the Tasks Queue and LLM Logs if anything seems off. |
When Things Go Sideways
Even the best-run house has its plumbing emergencies. Here are the ones most likely to affect you.
"No models available" when creating a profile
Ollama isn't running, or it hasn't finished loading. Open http://localhost:11434 in
your browser — if you don't see "Ollama is running," restart it. Then come back and
click Fetch Models again.
"No connection profile configured"
You've created a profile but haven't set it as the default, and the character you're chatting with doesn't have one assigned either. Go back to the Foundryman and either set a default or assign the profile to your character.
Streaming stops mid-response
This usually means your machine is running out of VRAM or RAM. Try a smaller model. If you pulled
llama3.2:70b on a machine with 8GB of RAM, the math doesn't work — step
down to the 8B or 3B variant.
Memory search returns poor results
If you're on the built-in TF-IDF and results feel disconnected, upgrade to the Ollama embedding profile
with nomic-embed-text. If you've already done that and results are still off, visit
The Commonplace Book and try refreshing the vocabulary or
re-indexing memories.
Cheap LLM tasks failing
Test the profile you designated for Cheap LLM tasks by using it in a regular chat first. If it works there but not in the background, make sure Ollama hasn't been shut down or put to sleep since you last checked.
Everything was working and now nothing is
Ollama may have stopped after a reboot or a system update. Check if it's running. On macOS and Windows,
it usually starts automatically; on Linux, you may need to run ollama serve manually or
set it up as a system service.
What's Next
You're connected. Your models are local. Your memories have a home. The background machinery is turning.
From here, you might want to meet Aurora — the character system that gives depth and continuity to the people you're talking to. Or explore Calliope's themes to make the place look the way you want it to look. Or find out what exactly Dangermouse does in that trenchcoat of his.
But those are introductions for another day. For now, go start a conversation. The Salon is open.
— Prospero