How It Works

WhichLLM helps you find the best local LLM for your hardware and use case. Here's what happens behind the scenes.

1. Collecting Benchmark Data

We pull scores from two actively maintained benchmark sources, updated twice daily:

  • Chatbot Arena โ€” Real users vote on which model gives better answers in blind comparisons. The gold standard for "vibes." Updated daily.
  • ZeroEval โ€” 10 automated evaluations covering reasoning (GPQA, HLE), math (AIME2025, FrontierMath), coding (SWE-Bench Verified, SciCode), knowledge (MMMLU, SimpleQA), and multimodal (MMMU, MMMUPro). Updated regularly with new models.

We only use sources that are actively maintained with current models. Stale leaderboards that haven't been updated in months are excluded โ€” they skew rankings in favor of older models.

2. Scoring Models

Raw scores from different benchmarks aren't directly comparable โ€” an 85 on SimpleQA means something very different from an 85 on Arena Elo. So we normalize everything to a 0โ€“100 scale per benchmark using min-max normalization:

normalized = ((raw - min) / (max - min)) ร— 100

Then we compute a unified score for each use case as a weighted average across benchmark groups:

S = (wโ‚ ร— bฬ„โ‚ + wโ‚‚ ร— bฬ„โ‚‚ + ... + wโ‚™ ร— bฬ„โ‚™) / (wโ‚ + wโ‚‚ + ... + wโ‚™)

Where wแตข is the weight for benchmark group i and bฬ„แตข is the average normalized score across that group's tasks. If a model is missing data for a benchmark group, that group is excluded and the remaining weights are renormalized. This means models aren't penalized for benchmarks that simply haven't evaluated them yet.

The weights for each use case:

Use CaseWeights
GeneralArena 35%, ZeroEval GPQA/MMMLU/HLE 20%, ZeroEval SimpleQA 20%, ZeroEval AIME/FrontierMath 15%, ZeroEval MMMU/MMMUPro 10%
CodingZeroEval SWEBench/SciCode 35%, ZeroEval GPQA/AIME 25%, Arena 20%, ZeroEval FrontierMath/HLE 20%
RAGZeroEval SimpleQA/MMMLU 30%, Arena 30%, ZeroEval GPQA/HLE 20%, ZeroEval MMMU/MMMUPro 20%
RoleplayArena 50%, ZeroEval MMMLU/HLE 25%, ZeroEval SimpleQA/GPQA 25%
Math / ScienceZeroEval AIME/FrontierMath 30%, ZeroEval SciCode/GPQA 25%, ZeroEval HLE 25%, Arena 20%
ReasoningZeroEval HLE 30%, ZeroEval GPQA/AIME 25%, ZeroEval FrontierMath/MMMUPro 25%, Arena 20%

Notice how roleplay leans heavily on Arena (human preference) while coding spreads weight across code-specific benchmarks. Math/Science prioritizes AIME and FrontierMath. The weights reflect what actually matters for each task.

3. Matching to Your Hardware

When you tell us your setup, we calculate how much VRAM is available for model weights:

Windows / Linux

V_max = V_GPU - 2 GB

Mac (unified memory)

V_max = (RAM_total ร— 0.75) - 2 GB

The 2 GB overhead accounts for KV cache and system memory. On Mac, we use 75% of total RAM since unified memory is shared between the GPU and the rest of the system.

A model fits if:

size_variant โ‰ค V_max

Models that don't fit are excluded entirely โ€” a model that doesn't load isn't useful, no matter how good its scores are.

4. Picking the Best Variant

Most models come in multiple quantization levels โ€” compressed versions that trade a small amount of quality for significantly less memory:

PrecisionQualitySize
FP16Full precision, best qualityLargest
Q8Nearly indistinguishable from full~50% of FP16
Q6 / Q5Sweet spot for most users~40% of FP16
Q4Noticeable but acceptable loss~30% of FP16
Q3Significant quality trade-off~25% of FP16

Your preference controls which variant we pick for each model:

  • Max Quality โ€” Highest precision that fits in your VRAM:
    pick variant with max(precision_rank) where size โ‰ค V_max
  • Balanced โ€” Closest to 70% VRAM utilization, leaving room for longer context:
    pick variant with min(|size - 0.7 ร— V_max|)
  • Max Context โ€” Smallest variant, maximizing headroom for KV cache:
    pick variant with min(precision_rank) where size โ‰ค V_max

5. The Recommendation

You get two sets of results:

Top Picks (by quality)

Models ranked purely by their unified score for your use case. The best model is #1 regardless of size โ€” we just pick the variant that fits your hardware. You get the top 5 models, each with the variant that makes sense for your setup. Each recommendation includes the model's context length when available.

No magic multipliers, no hidden boosts for bigger models. Quality first, hardware fit second.

Trending Now (from HuggingFace)

The currently trending models on HuggingFace that are compatible with your hardware. This list updates automatically โ€” we pull from the HuggingFace API and match trending models to available GGUF variants in our database so you can actually run them locally.

Trending models might not have the highest benchmark scores, but they represent what the community is excited about right now. We show the benchmark score alongside download counts when available, so you can see how trending models stack up against the top-scoring ones.