How It Works
WhichLLM helps you find the best local LLM for your hardware and use case. Here's what happens behind the scenes.
1. Collecting Benchmark Data
We pull scores from two actively maintained benchmark sources, updated twice daily:
- Chatbot Arena โ Real users vote on which model gives better answers in blind comparisons. The gold standard for "vibes." Updated daily.
- ZeroEval โ 10 automated evaluations covering reasoning (GPQA, HLE), math (AIME2025, FrontierMath), coding (SWE-Bench Verified, SciCode), knowledge (MMMLU, SimpleQA), and multimodal (MMMU, MMMUPro). Updated regularly with new models.
We only use sources that are actively maintained with current models. Stale leaderboards that haven't been updated in months are excluded โ they skew rankings in favor of older models.
2. Scoring Models
Raw scores from different benchmarks aren't directly comparable โ an 85 on SimpleQA means something very different from an 85 on Arena Elo. So we normalize everything to a 0โ100 scale per benchmark using min-max normalization:
normalized = ((raw - min) / (max - min)) ร 100
Then we compute a unified score for each use case as a weighted average across benchmark groups:
S = (wโ ร bฬโ + wโ ร bฬโ + ... + wโ ร bฬโ) / (wโ + wโ + ... + wโ)
Where wแตข is the
weight for benchmark group i and bฬแตข is the
average normalized score across that group's tasks. If a model is
missing data for a benchmark group, that group is excluded and the
remaining weights are renormalized. This means models aren't
penalized for benchmarks that simply haven't evaluated them yet.
The weights for each use case:
| Use Case | Weights |
|---|---|
| General | Arena 35%, ZeroEval GPQA/MMMLU/HLE 20%, ZeroEval SimpleQA 20%, ZeroEval AIME/FrontierMath 15%, ZeroEval MMMU/MMMUPro 10% |
| Coding | ZeroEval SWEBench/SciCode 35%, ZeroEval GPQA/AIME 25%, Arena 20%, ZeroEval FrontierMath/HLE 20% |
| RAG | ZeroEval SimpleQA/MMMLU 30%, Arena 30%, ZeroEval GPQA/HLE 20%, ZeroEval MMMU/MMMUPro 20% |
| Roleplay | Arena 50%, ZeroEval MMMLU/HLE 25%, ZeroEval SimpleQA/GPQA 25% |
| Math / Science | ZeroEval AIME/FrontierMath 30%, ZeroEval SciCode/GPQA 25%, ZeroEval HLE 25%, Arena 20% |
| Reasoning | ZeroEval HLE 30%, ZeroEval GPQA/AIME 25%, ZeroEval FrontierMath/MMMUPro 25%, Arena 20% |
Notice how roleplay leans heavily on Arena (human preference) while coding spreads weight across code-specific benchmarks. Math/Science prioritizes AIME and FrontierMath. The weights reflect what actually matters for each task.
3. Matching to Your Hardware
When you tell us your setup, we calculate how much VRAM is available for model weights:
Windows / Linux
V_max = V_GPU - 2 GB
Mac (unified memory)
V_max = (RAM_total ร 0.75) - 2 GB
The 2 GB overhead accounts for KV cache and system memory. On Mac, we use 75% of total RAM since unified memory is shared between the GPU and the rest of the system.
A model fits if:
size_variant โค V_max
Models that don't fit are excluded entirely โ a model that doesn't load isn't useful, no matter how good its scores are.
4. Picking the Best Variant
Most models come in multiple quantization levels โ compressed versions that trade a small amount of quality for significantly less memory:
| Precision | Quality | Size |
|---|---|---|
| FP16 | Full precision, best quality | Largest |
| Q8 | Nearly indistinguishable from full | ~50% of FP16 |
| Q6 / Q5 | Sweet spot for most users | ~40% of FP16 |
| Q4 | Noticeable but acceptable loss | ~30% of FP16 |
| Q3 | Significant quality trade-off | ~25% of FP16 |
Your preference controls which variant we pick for each model:
- Max Quality โ Highest precision that fits in your VRAM:
pick variant with max(precision_rank) where size โค V_max
- Balanced โ Closest to 70% VRAM utilization, leaving room for longer context:
pick variant with min(|size - 0.7 ร V_max|)
- Max Context โ Smallest variant, maximizing headroom for KV cache:
pick variant with min(precision_rank) where size โค V_max
5. The Recommendation
You get two sets of results:
Top Picks (by quality)
Models ranked purely by their unified score for your use case. The best model is #1 regardless of size โ we just pick the variant that fits your hardware. You get the top 5 models, each with the variant that makes sense for your setup. Each recommendation includes the model's context length when available.
No magic multipliers, no hidden boosts for bigger models. Quality first, hardware fit second.
Trending Now (from HuggingFace)
The currently trending models on HuggingFace that are compatible with your hardware. This list updates automatically โ we pull from the HuggingFace API and match trending models to available GGUF variants in our database so you can actually run them locally.
Trending models might not have the highest benchmark scores, but they represent what the community is excited about right now. We show the benchmark score alongside download counts when available, so you can see how trending models stack up against the top-scoring ones.