Local small modelsrunning in your browser

Local Copy Enricher

An offline benchmark of 3 small models on 15 products. Explore the comparison, the actual generated copy, the cost study, and determinism — all from committed results.

model	mem	tok/s	p50	valid	overall	grounding	seo
mistralbest	5.1 GB	52.3	5.245s	100%	3.8	4.53	4
llama3.2:3b	2.8 GB	26.52	13.96s	100%	3.8	4.13	3.87
phi3	4.0 GB	75.27	3.796s	73%	3.45	3.82	3.36

Quality is the Anthropic LLM-judge (1–5). CPU-only Windows box (no GPU) — throughput is conservative/noisy.

Product descriptions and tags from a model on your own hardware. No API bill, and a benchmark that says which model to use.

Impact

100%

schema-valid (mistral)

52 tok/s

local · CPU

API cost · offline

A small service that turns raw product rows into clean descriptions, tags, and SEO fields with a local model through Ollama. Every response is validated JSON, and nothing leaves the machine. The real deliverable is the benchmark: three models on the same hardware, a quantization study, and a cost comparison against the API.

The problem

Running copy for tens of thousands of SKUs through a frontier API gets expensive, and it means sending your whole catalog to a third party. A local model avoids both, but only if its JSON is reliable enough to trust and it's fast enough to be worth it. The real question is where local actually beats the API.

The approach

The model runs behind a service that checks every response against a Pydantic schema, retries once on malformed JSON, then fails cleanly. Most of the work is measurement: three models (Llama 3.2 3B, Phi-3, Mistral 7B) on identical hardware and prompts, a Q4-vs-Q5 quantization comparison, and a cost model for where running locally pays off.

Architecture

Product row

title · specs · category

PROCESS

Versioned prompt

JSON schema in context

MODEL

Local SLM · Ollama

mistral 7B Q4 (chosen of 3)

GATE

Validate

Pydantic ProductCopy schema

OUT

Structured copy

description · tags · SEO

cross-cutting

LOOP

Retry on invalid

reprompt once, then fail gracefully

STORE

Benchmark + cost study

3 models · tok/s · quality · break-even

DEPLOY

Deployed

Ollama on Modal T4 · model baked in a Volume

·Fully offline at inference — no API bill, and the catalog never leaves the machine.
·Chose mistral 7B (Q4) of three: 100% schema-valid, best grounding, ~52 tok/s; phi3 was faster but flaky JSON. Temperature 0 for reproducible catalog copy.
·Every output is Pydantic-validated; invalid JSON triggers exactly one reprompt, then fails gracefully.
·The real deliverable is the benchmark + cost study — where local on owned hardware beats a frontier API, with numbers.
·Deployed on a Modal T4 with Ollama (model baked into a Volume, scale-to-zero) so the portfolio's “Run live” works in the cloud, not just on a laptop.

How it was built

Phase 1

Get it running + measure

✓Ollama + a 3–7B model
✓FastAPI wrapper for descriptions & tags
✓Benchmark tokens/sec, TTFT, latency

Phase 2

Structure + determinism

✓JSON schema enforced with Pydantic
✓Retry-on-invalid, then fail gracefully
✓Temperature study: 0 vs 0.7

Phase 3

Model comparison

✓3 models on identical hardware/prompts
✓Memory, tokens/sec, output quality
✓GGUF Q4/Q5 + local-vs-API break-even

Stack

OllamaLlama 3.2 3BPhi-3Mistral 7BFastAPIPydanticGGUF Q4/Q5

Source

GitHub ↗Benchmark report

miskelvilaly@gmail.com ← back to work