Bulk product descriptions and tags, generated fully offline — with the cost story to prove it’s worth it.
Impact
100%
schema-valid (mistral)
A local small-model service that turns messy product rows into clean, schema-valid descriptions and tags. Runs on Ollama with no API bill, returns validated JSON, and ships with a real benchmark across three models and quantization levels.
The problem
Generating copy for tens of thousands of SKUs through a frontier API is expensive and slow, and you’re shipping your catalog to a third party. A local model is only useful if its output is structured, deterministic enough to trust, and fast enough to matter — and if you can prove where the break-even sits.
The approach
Wrap a local model in a service that enforces a JSON schema with Pydantic, reprompts once on invalid output, then fails gracefully. The standout is the measurement: three models benchmarked on the same hardware, a quality-vs-speed quantization study, and an honest “local beats API past N products” break-even.
Architecture
IN
Product row
title · specs · category
↓
PROCESS
Versioned prompt
JSON schema in context
↓
MODEL
Local SLM · Ollama
mistral 7B Q4 (chosen of 3)
↓
GATE
Validate
Pydantic ProductCopy schema
↓
OUT
Structured copy
description · tags · SEO
cross-cutting
LOOP
Retry on invalid
reprompt once, then fail gracefully
STORE
Benchmark + cost study
3 models · tok/s · quality · break-even
DEPLOY
Deployed
Ollama on Modal T4 · model baked in a Volume
- ·Fully offline at inference — no API bill, and the catalog never leaves the machine.
- ·Chose mistral 7B (Q4) of three: 100% schema-valid, best grounding, ~52 tok/s; phi3 was faster but flaky JSON. Temperature 0 for reproducible catalog copy.
- ·Every output is Pydantic-validated; invalid JSON triggers exactly one reprompt, then fails gracefully.
- ·The real deliverable is the benchmark + cost study — where local on owned hardware beats a frontier API, with numbers.
- ·Deployed on a Modal T4 with Ollama (model baked into a Volume, scale-to-zero) so the portfolio's “Run live” works in the cloud, not just on a laptop.
How it was built
Phase 1
Get it running + measure
- ✓Ollama + a 3–7B model
- ✓FastAPI wrapper for descriptions & tags
- ✓Benchmark tokens/sec, TTFT, latency
Phase 2
Structure + determinism
- ✓JSON schema enforced with Pydantic
- ✓Retry-on-invalid, then fail gracefully
- ✓Temperature study: 0 vs 0.7
Phase 3
Model comparison
- ✓3 models on identical hardware/prompts
- ✓Memory, tokens/sec, output quality
- ✓GGUF Q4/Q5 + local-vs-API break-even
Stack
OllamaLlama 3.2 3BPhi-3Mistral 7BFastAPIPydanticGGUF Q4/Q5