SiliconBench local model testing

Local LLM results, explained in plain English

See which local models are best for coding, writing, speed, reliability, and everyday use, with the test evidence underneath.

View current recommendations See how we tested

These are practical local-AI results from our own Mac testing. They show what worked in the tests listed here, not a universal ranking of every model.

Last data update: 9 May 2026, 17:50

Current recommendations

Best fits from this test run

Each card gives the practical answer first, then links to the evidence that supports it. Where evidence is thin, the page says so instead of guessing.

1 Good early evidence

Best local coding helper so far

Qwen3 Coder Next

Qwen3 Coder Next is the safest coding recommendation from the evidence we have published. It has passed small coding checks, but we still need larger project tests.

View evidence

2 Early result

Best writing helper so far

Gemma 4 26B

Gemma 4 26B is currently the strongest writing and summarising candidate in these published tests.

View evidence

3 Early result

Quickest response we measured

Qwen3 Coder Next

Qwen3 Coder Next was the quickest model in the new three-task mini pass.

View evidence

4 Good early evidence

Most consistently available so far

Qwen3 Coder Next

Qwen3 Coder Next has the strongest repeated availability evidence in these published tests.

View evidence

5 Early result

Best all-rounder candidate so far

Qwen3.6 35B

Qwen3.6 35B gave the cleanest all-round result in the new quick practical pass.

View evidence

Tracked models

What is being tested

The page focuses on what each model is useful for. Technical setup details are kept out of the way.

Qwen3 Coder Next

Quick check passed

Our current local coding pick. It passed the new three-task quick practical pass and remains the fastest measured lane in this page, but still needs bigger real-project coding tests.

How we ran it: Dedicated local coding setup
What we checked: basic availability, coding help, format-following check, plain-English advice, quick response check, small coding helper

Qwen3.6 27B

Tested, with caveats

A useful everyday local model, but the new quick pass exposed a real presentation issue: it can show its thinking process instead of giving a clean plain-English answer.

How we ran it: Local everyday-work setup
What we checked: availability, format-following check, general writing, plain-English advice, short-answer edge case, small coding helper

Qwen3.6 35B

Quick check passed

A larger local model that passed three quick practical checks cleanly: plain-English advice, exact one-line output, and a small coding helper. Needs broader testing before promotion.

How we ran it: Shared local model setup
What we checked: availability, basic answer quality, format-following check, plain-English advice, small coding helper

Qwen3 235B

Tested, with caveats

A very large local model that gave useful advice and code in the quick pass, but failed an format-following check by copying the placeholder. Promising, not clean yet.

How we ran it: Shared local model setup
What we checked: availability, basic readiness, format-following check, plain-English advice, small coding helper

Gemma 4 26B

Tested, with caveats

The best non-Qwen all-rounder candidate in this evidence set so far. It did well on several writing and general tasks, but still struggled with some format-following and structured-output tests.

How we ran it: Shared local model setup
What we checked: writing, general advice, coding quick test, long-document checks, formatting limits

GLM 5.1 cloud route

not-local

This entry is not a local result. Ollama was available, but this GLM listing is cloud-only, so it stays out of local recommendations.

How we ran it: Ollama local setup
What we checked: classification only

Gemma 4 small model

Quick check passed

A small local model that passed quick writing and coding checks but failed an exact-format check. Useful as small-model coverage, not a recommendation yet.

How we ran it: Temporary test setup
What we checked: quick writing check, quick coding check, formatting limit

DeepSeek V4 Flash

Tested, with caveats

Now genuinely testable locally through a chat-style test setup. It did useful work on reasoning, privacy advice, fake-question handling, structured answers, and small code, but it was too bullish on a missing-data business decision and still has exact wording/language caveats. It also used about 152 GB of memory.

How we ran it: Compatibility check only
What we checked: local test setup works, reasoning tradeoff, privacy advice, hallucination resistance, small code helper, business-judgment caveat, format-following caveat, high memory use

Llama, Mistral, and Phi candidates

Waiting for tests

These families are listed as future coverage gaps. They are not local winners on this page yet.

How we ran it: Retired setup
What we checked: future coverage gap

MTPLX Qwen3.6 27B

Quick check passed

An auxiliary local general-purpose lane that passed three quick practical checks and answered short tasks quickly. It still needs larger real-work tests.

How we ran it: Auxiliary local setup
What we checked: availability, basic response quality, format-following check, plain-English advice, small coding helper

Gemma 4 31B

Tested, with caveats

Now running locally and tested beyond a simple quick check. It passed the new three-task practical pass and still carries earlier caveats on stricter file-making and formatting checks.

How we ran it: Specialist local setup
What we checked: coding-artifact limits, document-style tasks, format-following check, formatting limits, image prompt critique, long-document checks, plain-English advice, readiness check, small coding helper

Qwen2.5 32B

Quick check passed

A local Ollama model that answered a simple one-sentence explanation task. This fills one Ollama coverage gap, but it needs the same practical checks as the other models before comparison.

How we ran it: Ollama local setup
What we checked: one-sentence local AI explanation, availability

Evidence behind the verdict

Drill down by test type

Evidence is grouped by plain questions: can we use it, did it answer quickly, what jobs did it handle, and where did it fail?

Can we use it right now?

Coding model availability check

8 May 2026, 15:34

Worked in this test

Show short evidence note

Model: Qwen3 Coder Next
What we checked: Check that the coding model was available.

Result: The model was available.
Plain-English note: The coding model was available when checked.

First coding model availability check

8 May 2026, 14:32

Worked in this test

Show short evidence note

Model: Qwen3 Coder Next
What we checked: Check that the coding model was available.

Result: The coding model was available.
Plain-English note: The first saved check showed the coding model was available.

Qwen3 235B minimal readiness quick

9 May 2026, 13:34

Worked in this test

Show short evidence note

Model: Qwen3 235B
What we checked: Reply with exactly: READY

Result: READY
Plain-English note: The large local setup-listed Qwen3 235B lane answered a minimal readiness prompt. This confirms basic serveability only; it still needs task-specific evidence.

Gemma 31B is running locally

9 May 2026, 15:12

Worked in this test

Show short evidence note

Model: Gemma 4 31B
What we checked: Start Gemma 31B locally and ask for a simple readiness phrase.

Result: GEMMA31_READY
Plain-English note: Gemma 31B is now running locally and answered the simple readiness check.

What jobs did it handle?

Simple greeting response

8 May 2026, 15:34

Worked in this test

Show short evidence note

Model: Qwen3 Coder Next
What we checked: Hello, max_tokens=10

Result: Hello! How can I help you today?
Plain-English note: Short coherent greeting returned. Finish reason was length, expected because the token cap was low.

Gemma 4 26B non-Qwen gap-fill mini bench

8 May 2026, 16:53

Worked with caveats

Show short evidence note

Model: Gemma 4 26B
What we checked: Five bounded tasks: writing summary, general recommendation, strict structured answer, exact structured answer, and coding quick.

Result: 4/5 validators passed; strict structured answer format failed due fenced structured answer rather than raw structured answer.
Plain-English note: Gemma passed 4/5 validators: writing/summarising, all-rounder reasoning, exact structured answer, and coding quick. It failed strict structured answer because it wrapped structured answer in markdown fences.

Gemma 4 26B longer practical task pack partial run

8 May 2026, 16:54

Worked with caveats

Show short evidence note

Model: Gemma 4 26B
What we checked: Canonical 13-task longer practical task pack through the local setup.

Result: Some requested files or format-following answers were missing.
Plain-English note: 11/13 tasks completed. The run is blocked by missing page files and a missing structured answer, so it cannot be scored as a clean win yet.

Gemma 4 26B follow-up file and long-document packs

8 May 2026, 16:56

Worked with caveats

Show short evidence note

Model: Gemma 4 26B
What we checked: Follow-up file, long-document poison, and format-following packs.

Result: 7/9 follow-up tests passed; file pass; long-document poison pass; format-following TS3 and TS4 invalid structured answer.
Plain-English note: File and long-document poison packs passed. Exact-format pack blocked on two invalid structured answer outputs, so this is useful capability evidence but not promotion-ready.

Gemma 4 E2B local file small-model quick test

8 May 2026, 16:59

Worked with caveats

Show short evidence note

Model: Gemma 4 small model
What we checked: Three quick tasks through a temporary local llama.cpp server: writing summary, strict structured answer, and coding quick.

Result: 2/3 quick validators passed; strict structured answer failed.
Plain-English note: The small Gemma lane passed writing and coding quick checks but failed strict structured answer parsing. This is small-model coverage, not a recommendation yet.

MTPLX Qwen3.6 27B plain-English quick response

9 May 2026, 13:34

Worked in this test

Show short evidence note

Model: MTPLX Qwen3.6 27B
What we checked: In exactly one short sentence, say what local LLM model availability proves and what it does not prove.

Result: Local LLM model availability proves the service is reachable and responsive, but it does not prove the model's output is accurate or logically sound.
Plain-English note: The auxiliary MTPLX lane returned a concise, accurate distinction. This is useful quick evidence, not a full quality benchmark.

Qwen3.6 35B LM Studio plain-English quick response

9 May 2026, 13:34

Worked in this test

Show short evidence note

Model: Qwen3.6 35B
What we checked: In exactly one short sentence, say what local LLM model availability proves and what it does not prove.

Result: Local LLM model availability proves the service is reachable and responsive, but it does not prove that the model generates accurate or high-quality outputs.
Plain-English note: Qwen3.6 35B through LM Studio produced a clean public-safe answer. It is capability quick evidence, not enough to change recommendation cards yet.

Gemma 31B longer practical task pack

9 May 2026, 15:21

Worked with caveats

Show short evidence note

Model: Gemma 4 31B
What we checked: A 13-part practical task set: writing, research, page copy, reasoning, code, formatting, and summarising.

Result: 11/13 first time; 2 missing pieces recovered after follow-up.
Plain-English note: Gemma 31B completed 11 of 13 tasks first time. Two missing pieces were produced after follow-up repair prompts, so this is useful evidence but not a clean win.

Gemma 31B deeper follow-up checks

9 May 2026, 15:30

Worked with caveats

Show short evidence note

Model: Gemma 4 31B
What we checked: Extra checks for long-document handling, image-prompt critique, exact formatting, and file-making.

Result: 8/11 checks OK; long-document and image-prompt checks passed; format-following and file-making checks need work.
Plain-English note: Gemma 31B did well on long-document and image-prompt checks. It still had problems with exact structured answer-style answers and one file-making task.

Qwen3 Coder Next three quick practical checks

9 May 2026, 16:24

Worked in this test

Show short evidence note

Model: Qwen3 Coder Next
What we checked: Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.

Result: 3/3 quick checks passed; fastest overall in this mini pass.
Plain-English note: Passed all three quick checks: plain-English advice, exact one-line answer, and a small JavaScript helper. Code came back fenced, but the function itself was correct.

Qwen3.6 27B three quick practical checks

9 May 2026, 16:24

Worked with caveats

Show short evidence note

Model: Qwen3.6 27B
What we checked: Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.

Result: 2/3 useful; plain-English advice was blocked by visible thinking text.
Plain-English note: Returned correct format-following and coding answers, but the plain-English advice task leaked its thinking process and hit the reply limit before giving the clean answer.

Qwen3.6 35B three quick practical checks

9 May 2026, 16:24

Worked in this test

Show short evidence note

Model: Qwen3.6 35B
What we checked: Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.

Result: 3/3 quick checks passed with clean short answers.
Plain-English note: Passed all three quick checks with clean short answers. This is the strongest new all-rounder signal, though still only a mini pass.

MTPLX Qwen3.6 27B three quick practical checks

9 May 2026, 16:24

Worked in this test

Show short evidence note

Model: MTPLX Qwen3.6 27B
What we checked: Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.

Result: 3/3 quick checks passed; short tasks returned quickly.
Plain-English note: Passed all three quick checks and was quick on the short-answer and code tasks. Needs larger tests before ranking above the main models.

Qwen3 235B three quick practical checks

9 May 2026, 16:24

Worked with caveats

Show short evidence note

Model: Qwen3 235B
What we checked: Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.

Result: 2/3 useful; format-following answer copied the placeholder.
Plain-English note: Gave useful plain-English advice and correct function code, but failed the format-following check by copying the placeholder instead of writing a real sentence.

Gemma 4 31B three quick practical checks

9 May 2026, 16:24

Worked in this test

Show short evidence note

Model: Gemma 4 31B
What we checked: Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.

Result: 3/3 quick checks passed; slower but useful.
Plain-English note: Passed all three quick checks. It was slower than the smaller lanes, but produced useful advice, exact one-line output, and correct function code.

Qwen2.5 32B one-sentence local AI check

9 May 2026, 16:31

Worked in this test

Show short evidence note

Model: Qwen2.5 32B
What we checked: Answer in one sentence: what is local AI useful for?

Result: Local AI is useful for processing data directly on a device, helping with privacy and offline use.
Plain-English note: Qwen2.5 32B answered the simple local-AI explanation check. A separate Qwen3 Coder Next 80B listing did not load in this pass and is not counted as a working local result.

DeepSeek V4 Flash local workaround test

9 May 2026, 17:08

Worked with caveats

Show short evidence note

Model: DeepSeek V4 Flash
What we checked: Try three short checks after making the local model load: readiness phrase, plain-English business use, and exact one-line format.

Result: Useful business answer produced; exact wording and format checks failed.
Plain-English note: The model loaded only after a local workaround: using the newer DeepSeek-capable runner, removing a tokenizer-load blocker from a temporary config copy, and adding manual DeepSeek chat markers. It gave a useful business answer, but the readiness phrase came back slightly wrong and the exact-format task failed.

DeepSeek V4 Flash local test setup

9 May 2026, 17:22

Worked with caveats

Show short evidence note

Model: DeepSeek V4 Flash
What we checked: Call DeepSeek V4 Flash through a local chat-style test service and run three checks: readiness phrase, plain-English business use, and exact one-line format.

Result: Useful business answer; exact wording and format-following still weak.
Plain-English note: The local test service is now working. DeepSeek V4 Flash gave a useful business answer, but returned a near-miss readiness phrase and failed the exact one-line format check. This is enough to keep testing, not enough to recommend it as a clean lane.

DeepSeek V4 Flash deeper local test

9 May 2026, 17:45

Worked with caveats

Show short evidence note

Model: DeepSeek V4 Flash
What we checked: Deeper local checks through the chat-style test service: reasoning tradeoff, privacy advice, fake-question handling, exact wording, structured answer, small code, missing-data launch judgment, and contradiction detection.

Result: Useful on most practical checks; weak on missing-data business judgment and some exact wording/language behavior.
Plain-English note: DeepSeek V4 Flash is now worth deeper testing: it handled reasoning, privacy advice, fake-question handling, structured output, and small code. It still failed a missing-data business judgment by recommending scaling despite unknown CAC/retention/traffic quality, and it showed exact wording/language quirks on one format task.

Did it answer quickly?

Tiny request response time quick test

8 May 2026, 14:32

Worked in this test

Show short evidence note

Model: Qwen3 Coder Next
What we checked: Say hello, max_tokens=5

Result: Hello! How...
Plain-English note: Approximate response time saved from legacy timestamp comparison. Treat as quick evidence, not tokens-per-second benchmarking.

Did it stay available?

Repeated available and model availability pass

8 May 2026, 15:04

Worked in this test

Show short evidence note

Model: Qwen3 Coder Next
What we checked: Repeated available and model availability checks across local setups.

Result: All checked active local setups stable; no alert conditions detected.
Plain-English note: Active local setups on 8081, 8085, 8091, 1234, and 11434 were recorded as reachable. 8086 is not included as active in this entry.

Six-local setup interactive available and inventory check

9 May 2026, 13:31

Worked in this test

Show short evidence note

Model: No model attached
What we checked: Interactive local HTTP checks against available, model availability, and tag local setups for the active local stack.

Result: :8081 ok; :8085 ok; :8091 ok; :1234 model list ok; :11434 tags ok; :8321 ok; :8099 connection refused.
Plain-English note: Interactive checks succeeded for the local coding local setup, local setup multi-model local setup, 27B general local setup, LM Studio, Ollama tags, and the MTPLX auxiliary local setup. The 8099 specialist local setup remained unavailable.

Machine and storage impact

Serving and storage hygiene heartbeat

5 May 2026, 15:48

Worked with caveats

Show short evidence note

Model: Qwen3.6 35B
What we checked: Local serving inventory plus rounded storage and duplicate-file scan.

Result: local setup local setup online; rounded free space recorded; duplicate candidates classified for owner-approved cleanup.
Plain-English note: Serving state was clear. Resource use was storage-focused, not live RAM or power usage, so it is only partial resource evidence.

Not ready yet

Retired local setup stays visible as offline

5 May 2026, 10:53

Worked in this test

Show short evidence note

Model: No model attached
What we checked: Check whether the old local setup is listening.

Result: Retired local setup absent/off.
Plain-English note: 8086 was recorded as absent/off and treated as retired, not active drift.

DeepSeek V4 Flash local setup blocked

8 May 2026, 16:53

Not currently available

Show short evidence note

Model: DeepSeek V4 Flash
What we checked: Attempt local chat completions via the LM Studio-listed DeepSeek V4 Flash model.

Result: Setup load error: model type deepseek_v4 not supported.
Plain-English note: Local weights are present, but the current LM Studio local setup backend reports that the DeepSeek V4 model type is unsupported. No quality benchmark was run.

Non-Qwen family coverage gap recorded

8 May 2026, 17:20

Waiting for tests

Show short evidence note

Model: Llama, Mistral, and Phi candidates
What we checked: Inventory currently served and stored local model lanes for non-Qwen family coverage.

Result: Gemma tested; DeepSeek setup-blocked; GLM cloud-only; Llama/Mistral/Phi not selected yet.
Plain-English note: This pass found Gemma evidence and a DeepSeek setup blocker, but no current served Llama, Mistral, or Phi local lane. They are listed as future candidates to make the gap visible.

Where answer format went wrong

Sample appended result for update mechanism

8 May 2026, 16:20

Worked with caveats

Show short evidence note

Model: Qwen3 Coder Next
What we checked: Return exactly valid structured answer with keys verdict and confidence.

Result: { verdict: usable, confidence: medium }
Plain-English note: Synthetic sample used to prove the public data append mechanism and UI support a formatting problem category. Replace with a live Ahoy result when the next benchmark cycle runs.

Update mechanism verification sample

8 May 2026, 16:25

Worked in this test

Show short evidence note

Model: Qwen3 Coder Next
What we checked: Return compact structured answer with verdict and confidence fields.

Result: {"verdict":"usable","confidence":"medium"}
Plain-English note: Public-safe sample used to verify that future Ahoy benchmark cycles can append structured results without hand-editing the page.

Qwen3.6 27B low-token exact-answer edge case

9 May 2026, 13:35

Worked with caveats

Show short evidence note

Model: Qwen3.6 27B
What we checked: Reply with exactly one word: OK, with a low max token cap.

Result: visible content empty; reasoning_content present; finish_reason length.
Plain-English note: The local setup responded, but visible assistant content was empty while reasoning tokens consumed the small cap. This is an important prompt/setup edge case for the public matrix.

writing_summarising

Gemma 4 26B founder summary quick test

8 May 2026, 16:53

Worked in this test

Show short evidence note

Model: Gemma 4 26B
What we checked: Summarise a local-LLM benchmarking note for a non-technical founder in four bullets plus one caveat.

Result: Privacy advantage; no one-size-fits-all winner; rigorous testing required; avoid quick-test traps; caveat about hardware and maintenance overhead.
Plain-English note: Gemma produced a clear privacy/testing/caveat summary suitable for a non-technical founder. This fills the first public writing/summarising evidence gap, but it is still one bounded quick task.

How these results are judged

Methodology and limits

We separate “it is running” from “it gives good answers”. A model must do both before we recommend it.
Small checks are labelled as small checks. They are not treated as full benchmarks.
If a model gives the wrong format, misses files, or needs a repair prompt, we say so.
We compare models only when they have been tested on similar jobs.
We keep cloud results separate from local Mac results.
We publish caveats beside the recommendation so readers can see how strong the evidence is.
Ollama models are only counted when they answer locally; cloud-only listings stay out of recommendations.

Update path

How new results are added

Run a real local test and save the result.
Rewrite the result in plain English before it reaches this page.
Build the site and check that no private paths, secrets, or overclaims are published.