×

SiliconBench local model testing

Local LLM results, explained in plain English

See which local models are best for coding, writing, speed, reliability, and everyday use, with the test evidence underneath.

These are practical local-AI results from our own Mac testing. They show what worked in the tests listed here, not a universal ranking of every model.

Last data update: 9 May 2026, 17:50

Current recommendations

Best fits from this test run

Each card gives the practical answer first, then links to the evidence that supports it. Where evidence is thin, the page says so instead of guessing.

1 Good early evidence

Best local coding helper so far

Qwen3 Coder Next

Qwen3 Coder Next is the safest coding recommendation from the evidence we have published. It has passed small coding checks, but we still need larger project tests.

View evidence
2 Early result

Best writing helper so far

Gemma 4 26B

Gemma 4 26B is currently the strongest writing and summarising candidate in these published tests.

View evidence
3 Early result

Quickest response we measured

Qwen3 Coder Next

Qwen3 Coder Next was the quickest model in the new three-task mini pass.

View evidence
4 Good early evidence

Most consistently available so far

Qwen3 Coder Next

Qwen3 Coder Next has the strongest repeated availability evidence in these published tests.

View evidence
5 Early result

Best all-rounder candidate so far

Qwen3.6 35B

Qwen3.6 35B gave the cleanest all-round result in the new quick practical pass.

View evidence

Tracked models

What is being tested

The page focuses on what each model is useful for. Technical setup details are kept out of the way.

Qwen3 Coder Next

Quick check passed

Our current local coding pick. It passed the new three-task quick practical pass and remains the fastest measured lane in this page, but still needs bigger real-project coding tests.

How we ran it
Dedicated local coding setup
What we checked
basic availability, coding help, format-following check, plain-English advice, quick response check, small coding helper

Qwen3.6 27B

Tested, with caveats

A useful everyday local model, but the new quick pass exposed a real presentation issue: it can show its thinking process instead of giving a clean plain-English answer.

How we ran it
Local everyday-work setup
What we checked
availability, format-following check, general writing, plain-English advice, short-answer edge case, small coding helper

Qwen3.6 35B

Quick check passed

A larger local model that passed three quick practical checks cleanly: plain-English advice, exact one-line output, and a small coding helper. Needs broader testing before promotion.

How we ran it
Shared local model setup
What we checked
availability, basic answer quality, format-following check, plain-English advice, small coding helper

Qwen3 235B

Tested, with caveats

A very large local model that gave useful advice and code in the quick pass, but failed an format-following check by copying the placeholder. Promising, not clean yet.

How we ran it
Shared local model setup
What we checked
availability, basic readiness, format-following check, plain-English advice, small coding helper

Gemma 4 26B

Tested, with caveats

The best non-Qwen all-rounder candidate in this evidence set so far. It did well on several writing and general tasks, but still struggled with some format-following and structured-output tests.

How we ran it
Shared local model setup
What we checked
writing, general advice, coding quick test, long-document checks, formatting limits

GLM 5.1 cloud route

not-local

This entry is not a local result. Ollama was available, but this GLM listing is cloud-only, so it stays out of local recommendations.

How we ran it
Ollama local setup
What we checked
classification only

Gemma 4 small model

Quick check passed

A small local model that passed quick writing and coding checks but failed an exact-format check. Useful as small-model coverage, not a recommendation yet.

How we ran it
Temporary test setup
What we checked
quick writing check, quick coding check, formatting limit

DeepSeek V4 Flash

Tested, with caveats

Now genuinely testable locally through a chat-style test setup. It did useful work on reasoning, privacy advice, fake-question handling, structured answers, and small code, but it was too bullish on a missing-data business decision and still has exact wording/language caveats. It also used about 152 GB of memory.

How we ran it
Compatibility check only
What we checked
local test setup works, reasoning tradeoff, privacy advice, hallucination resistance, small code helper, business-judgment caveat, format-following caveat, high memory use

Llama, Mistral, and Phi candidates

Waiting for tests

These families are listed as future coverage gaps. They are not local winners on this page yet.

How we ran it
Retired setup
What we checked
future coverage gap

MTPLX Qwen3.6 27B

Quick check passed

An auxiliary local general-purpose lane that passed three quick practical checks and answered short tasks quickly. It still needs larger real-work tests.

How we ran it
Auxiliary local setup
What we checked
availability, basic response quality, format-following check, plain-English advice, small coding helper

Gemma 4 31B

Tested, with caveats

Now running locally and tested beyond a simple quick check. It passed the new three-task practical pass and still carries earlier caveats on stricter file-making and formatting checks.

How we ran it
Specialist local setup
What we checked
coding-artifact limits, document-style tasks, format-following check, formatting limits, image prompt critique, long-document checks, plain-English advice, readiness check, small coding helper

Qwen2.5 32B

Quick check passed

A local Ollama model that answered a simple one-sentence explanation task. This fills one Ollama coverage gap, but it needs the same practical checks as the other models before comparison.

How we ran it
Ollama local setup
What we checked
one-sentence local AI explanation, availability

Evidence behind the verdict

Drill down by test type

Evidence is grouped by plain questions: can we use it, did it answer quickly, what jobs did it handle, and where did it fail?

Can we use it right now?

Coding model availability check

8 May 2026, 15:34

Worked in this test
Show short evidence note
Model
Qwen3 Coder Next
What we checked
Check that the coding model was available.
Result
The model was available.
Plain-English note
The coding model was available when checked.

First coding model availability check

8 May 2026, 14:32

Worked in this test
Show short evidence note
Model
Qwen3 Coder Next
What we checked
Check that the coding model was available.
Result
The coding model was available.
Plain-English note
The first saved check showed the coding model was available.

Qwen3 235B minimal readiness quick

9 May 2026, 13:34

Worked in this test
Show short evidence note
Model
Qwen3 235B
What we checked
Reply with exactly: READY
Result
READY
Plain-English note
The large local setup-listed Qwen3 235B lane answered a minimal readiness prompt. This confirms basic serveability only; it still needs task-specific evidence.

Gemma 31B is running locally

9 May 2026, 15:12

Worked in this test
Show short evidence note
Model
Gemma 4 31B
What we checked
Start Gemma 31B locally and ask for a simple readiness phrase.
Result
GEMMA31_READY
Plain-English note
Gemma 31B is now running locally and answered the simple readiness check.

What jobs did it handle?

Simple greeting response

8 May 2026, 15:34

Worked in this test
Show short evidence note
Model
Qwen3 Coder Next
What we checked
Hello, max_tokens=10
Result
Hello! How can I help you today?
Plain-English note
Short coherent greeting returned. Finish reason was length, expected because the token cap was low.

Gemma 4 26B non-Qwen gap-fill mini bench

8 May 2026, 16:53

Worked with caveats
Show short evidence note
Model
Gemma 4 26B
What we checked
Five bounded tasks: writing summary, general recommendation, strict structured answer, exact structured answer, and coding quick.
Result
4/5 validators passed; strict structured answer format failed due fenced structured answer rather than raw structured answer.
Plain-English note
Gemma passed 4/5 validators: writing/summarising, all-rounder reasoning, exact structured answer, and coding quick. It failed strict structured answer because it wrapped structured answer in markdown fences.

Gemma 4 26B longer practical task pack partial run

8 May 2026, 16:54

Worked with caveats
Show short evidence note
Model
Gemma 4 26B
What we checked
Canonical 13-task longer practical task pack through the local setup.
Result
Some requested files or format-following answers were missing.
Plain-English note
11/13 tasks completed. The run is blocked by missing page files and a missing structured answer, so it cannot be scored as a clean win yet.

Gemma 4 26B follow-up file and long-document packs

8 May 2026, 16:56

Worked with caveats
Show short evidence note
Model
Gemma 4 26B
What we checked
Follow-up file, long-document poison, and format-following packs.
Result
7/9 follow-up tests passed; file pass; long-document poison pass; format-following TS3 and TS4 invalid structured answer.
Plain-English note
File and long-document poison packs passed. Exact-format pack blocked on two invalid structured answer outputs, so this is useful capability evidence but not promotion-ready.

Gemma 4 E2B local file small-model quick test

8 May 2026, 16:59

Worked with caveats
Show short evidence note
Model
Gemma 4 small model
What we checked
Three quick tasks through a temporary local llama.cpp server: writing summary, strict structured answer, and coding quick.
Result
2/3 quick validators passed; strict structured answer failed.
Plain-English note
The small Gemma lane passed writing and coding quick checks but failed strict structured answer parsing. This is small-model coverage, not a recommendation yet.

MTPLX Qwen3.6 27B plain-English quick response

9 May 2026, 13:34

Worked in this test
Show short evidence note
Model
MTPLX Qwen3.6 27B
What we checked
In exactly one short sentence, say what local LLM model availability proves and what it does not prove.
Result
Local LLM model availability proves the service is reachable and responsive, but it does not prove the model's output is accurate or logically sound.
Plain-English note
The auxiliary MTPLX lane returned a concise, accurate distinction. This is useful quick evidence, not a full quality benchmark.

Qwen3.6 35B LM Studio plain-English quick response

9 May 2026, 13:34

Worked in this test
Show short evidence note
Model
Qwen3.6 35B
What we checked
In exactly one short sentence, say what local LLM model availability proves and what it does not prove.
Result
Local LLM model availability proves the service is reachable and responsive, but it does not prove that the model generates accurate or high-quality outputs.
Plain-English note
Qwen3.6 35B through LM Studio produced a clean public-safe answer. It is capability quick evidence, not enough to change recommendation cards yet.

Gemma 31B longer practical task pack

9 May 2026, 15:21

Worked with caveats
Show short evidence note
Model
Gemma 4 31B
What we checked
A 13-part practical task set: writing, research, page copy, reasoning, code, formatting, and summarising.
Result
11/13 first time; 2 missing pieces recovered after follow-up.
Plain-English note
Gemma 31B completed 11 of 13 tasks first time. Two missing pieces were produced after follow-up repair prompts, so this is useful evidence but not a clean win.

Gemma 31B deeper follow-up checks

9 May 2026, 15:30

Worked with caveats
Show short evidence note
Model
Gemma 4 31B
What we checked
Extra checks for long-document handling, image-prompt critique, exact formatting, and file-making.
Result
8/11 checks OK; long-document and image-prompt checks passed; format-following and file-making checks need work.
Plain-English note
Gemma 31B did well on long-document and image-prompt checks. It still had problems with exact structured answer-style answers and one file-making task.

Qwen3 Coder Next three quick practical checks

9 May 2026, 16:24

Worked in this test
Show short evidence note
Model
Qwen3 Coder Next
What we checked
Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.
Result
3/3 quick checks passed; fastest overall in this mini pass.
Plain-English note
Passed all three quick checks: plain-English advice, exact one-line answer, and a small JavaScript helper. Code came back fenced, but the function itself was correct.

Qwen3.6 27B three quick practical checks

9 May 2026, 16:24

Worked with caveats
Show short evidence note
Model
Qwen3.6 27B
What we checked
Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.
Result
2/3 useful; plain-English advice was blocked by visible thinking text.
Plain-English note
Returned correct format-following and coding answers, but the plain-English advice task leaked its thinking process and hit the reply limit before giving the clean answer.

Qwen3.6 35B three quick practical checks

9 May 2026, 16:24

Worked in this test
Show short evidence note
Model
Qwen3.6 35B
What we checked
Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.
Result
3/3 quick checks passed with clean short answers.
Plain-English note
Passed all three quick checks with clean short answers. This is the strongest new all-rounder signal, though still only a mini pass.

MTPLX Qwen3.6 27B three quick practical checks

9 May 2026, 16:24

Worked in this test
Show short evidence note
Model
MTPLX Qwen3.6 27B
What we checked
Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.
Result
3/3 quick checks passed; short tasks returned quickly.
Plain-English note
Passed all three quick checks and was quick on the short-answer and code tasks. Needs larger tests before ranking above the main models.

Qwen3 235B three quick practical checks

9 May 2026, 16:24

Worked with caveats
Show short evidence note
Model
Qwen3 235B
What we checked
Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.
Result
2/3 useful; format-following answer copied the placeholder.
Plain-English note
Gave useful plain-English advice and correct function code, but failed the format-following check by copying the placeholder instead of writing a real sentence.

Gemma 4 31B three quick practical checks

9 May 2026, 16:24

Worked in this test
Show short evidence note
Model
Gemma 4 31B
What we checked
Three short tasks: explain local vs cloud AI in plain English, follow an exact one-line format, and write a small JavaScript helper function.
Result
3/3 quick checks passed; slower but useful.
Plain-English note
Passed all three quick checks. It was slower than the smaller lanes, but produced useful advice, exact one-line output, and correct function code.

Qwen2.5 32B one-sentence local AI check

9 May 2026, 16:31

Worked in this test
Show short evidence note
Model
Qwen2.5 32B
What we checked
Answer in one sentence: what is local AI useful for?
Result
Local AI is useful for processing data directly on a device, helping with privacy and offline use.
Plain-English note
Qwen2.5 32B answered the simple local-AI explanation check. A separate Qwen3 Coder Next 80B listing did not load in this pass and is not counted as a working local result.

DeepSeek V4 Flash local workaround test

9 May 2026, 17:08

Worked with caveats
Show short evidence note
Model
DeepSeek V4 Flash
What we checked
Try three short checks after making the local model load: readiness phrase, plain-English business use, and exact one-line format.
Result
Useful business answer produced; exact wording and format checks failed.
Plain-English note
The model loaded only after a local workaround: using the newer DeepSeek-capable runner, removing a tokenizer-load blocker from a temporary config copy, and adding manual DeepSeek chat markers. It gave a useful business answer, but the readiness phrase came back slightly wrong and the exact-format task failed.

DeepSeek V4 Flash local test setup

9 May 2026, 17:22

Worked with caveats
Show short evidence note
Model
DeepSeek V4 Flash
What we checked
Call DeepSeek V4 Flash through a local chat-style test service and run three checks: readiness phrase, plain-English business use, and exact one-line format.
Result
Useful business answer; exact wording and format-following still weak.
Plain-English note
The local test service is now working. DeepSeek V4 Flash gave a useful business answer, but returned a near-miss readiness phrase and failed the exact one-line format check. This is enough to keep testing, not enough to recommend it as a clean lane.

DeepSeek V4 Flash deeper local test

9 May 2026, 17:45

Worked with caveats
Show short evidence note
Model
DeepSeek V4 Flash
What we checked
Deeper local checks through the chat-style test service: reasoning tradeoff, privacy advice, fake-question handling, exact wording, structured answer, small code, missing-data launch judgment, and contradiction detection.
Result
Useful on most practical checks; weak on missing-data business judgment and some exact wording/language behavior.
Plain-English note
DeepSeek V4 Flash is now worth deeper testing: it handled reasoning, privacy advice, fake-question handling, structured output, and small code. It still failed a missing-data business judgment by recommending scaling despite unknown CAC/retention/traffic quality, and it showed exact wording/language quirks on one format task.

Did it answer quickly?

Tiny request response time quick test

8 May 2026, 14:32

Worked in this test
Show short evidence note
Model
Qwen3 Coder Next
What we checked
Say hello, max_tokens=5
Result
Hello! How...
Plain-English note
Approximate response time saved from legacy timestamp comparison. Treat as quick evidence, not tokens-per-second benchmarking.

Did it stay available?

Repeated available and model availability pass

8 May 2026, 15:04

Worked in this test
Show short evidence note
Model
Qwen3 Coder Next
What we checked
Repeated available and model availability checks across local setups.
Result
All checked active local setups stable; no alert conditions detected.
Plain-English note
Active local setups on 8081, 8085, 8091, 1234, and 11434 were recorded as reachable. 8086 is not included as active in this entry.

Six-local setup interactive available and inventory check

9 May 2026, 13:31

Worked in this test
Show short evidence note
Model
No model attached
What we checked
Interactive local HTTP checks against available, model availability, and tag local setups for the active local stack.
Result
:8081 ok; :8085 ok; :8091 ok; :1234 model list ok; :11434 tags ok; :8321 ok; :8099 connection refused.
Plain-English note
Interactive checks succeeded for the local coding local setup, local setup multi-model local setup, 27B general local setup, LM Studio, Ollama tags, and the MTPLX auxiliary local setup. The 8099 specialist local setup remained unavailable.

Machine and storage impact

Serving and storage hygiene heartbeat

5 May 2026, 15:48

Worked with caveats
Show short evidence note
Model
Qwen3.6 35B
What we checked
Local serving inventory plus rounded storage and duplicate-file scan.
Result
local setup local setup online; rounded free space recorded; duplicate candidates classified for owner-approved cleanup.
Plain-English note
Serving state was clear. Resource use was storage-focused, not live RAM or power usage, so it is only partial resource evidence.

Not ready yet

Retired local setup stays visible as offline

5 May 2026, 10:53

Worked in this test
Show short evidence note
Model
No model attached
What we checked
Check whether the old local setup is listening.
Result
Retired local setup absent/off.
Plain-English note
8086 was recorded as absent/off and treated as retired, not active drift.

DeepSeek V4 Flash local setup blocked

8 May 2026, 16:53

Not currently available
Show short evidence note
Model
DeepSeek V4 Flash
What we checked
Attempt local chat completions via the LM Studio-listed DeepSeek V4 Flash model.
Result
Setup load error: model type deepseek_v4 not supported.
Plain-English note
Local weights are present, but the current LM Studio local setup backend reports that the DeepSeek V4 model type is unsupported. No quality benchmark was run.

Non-Qwen family coverage gap recorded

8 May 2026, 17:20

Waiting for tests
Show short evidence note
Model
Llama, Mistral, and Phi candidates
What we checked
Inventory currently served and stored local model lanes for non-Qwen family coverage.
Result
Gemma tested; DeepSeek setup-blocked; GLM cloud-only; Llama/Mistral/Phi not selected yet.
Plain-English note
This pass found Gemma evidence and a DeepSeek setup blocker, but no current served Llama, Mistral, or Phi local lane. They are listed as future candidates to make the gap visible.

Where answer format went wrong

Sample appended result for update mechanism

8 May 2026, 16:20

Worked with caveats
Show short evidence note
Model
Qwen3 Coder Next
What we checked
Return exactly valid structured answer with keys verdict and confidence.
Result
{ verdict: usable, confidence: medium }
Plain-English note
Synthetic sample used to prove the public data append mechanism and UI support a formatting problem category. Replace with a live Ahoy result when the next benchmark cycle runs.

Update mechanism verification sample

8 May 2026, 16:25

Worked in this test
Show short evidence note
Model
Qwen3 Coder Next
What we checked
Return compact structured answer with verdict and confidence fields.
Result
{"verdict":"usable","confidence":"medium"}
Plain-English note
Public-safe sample used to verify that future Ahoy benchmark cycles can append structured results without hand-editing the page.

Qwen3.6 27B low-token exact-answer edge case

9 May 2026, 13:35

Worked with caveats
Show short evidence note
Model
Qwen3.6 27B
What we checked
Reply with exactly one word: OK, with a low max token cap.
Result
visible content empty; reasoning_content present; finish_reason length.
Plain-English note
The local setup responded, but visible assistant content was empty while reasoning tokens consumed the small cap. This is an important prompt/setup edge case for the public matrix.

writing_summarising

Gemma 4 26B founder summary quick test

8 May 2026, 16:53

Worked in this test
Show short evidence note
Model
Gemma 4 26B
What we checked
Summarise a local-LLM benchmarking note for a non-technical founder in four bullets plus one caveat.
Result
Privacy advantage; no one-size-fits-all winner; rigorous testing required; avoid quick-test traps; caveat about hardware and maintenance overhead.
Plain-English note
Gemma produced a clear privacy/testing/caveat summary suitable for a non-technical founder. This fills the first public writing/summarising evidence gap, but it is still one bounded quick task.

How these results are judged

Methodology and limits

  • We separate “it is running” from “it gives good answers”. A model must do both before we recommend it.
  • Small checks are labelled as small checks. They are not treated as full benchmarks.
  • If a model gives the wrong format, misses files, or needs a repair prompt, we say so.
  • We compare models only when they have been tested on similar jobs.
  • We keep cloud results separate from local Mac results.
  • We publish caveats beside the recommendation so readers can see how strong the evidence is.
  • Ollama models are only counted when they answer locally; cloud-only listings stay out of recommendations.

Update path

How new results are added

  1. Run a real local test and save the result.
  2. Rewrite the result in plain English before it reaches this page.
  3. Build the site and check that no private paths, secrets, or overclaims are published.