Benchmark request: DeepSeek-R1-Distill-Llama-70B #3167

EricBizet · 2025-02-07T13:24:47Z

According to https://artificialanalysis.ai/leaderboards/models benchmark, seems the DeepSeek-R1-Distill-Llama-70B model is able to perform better than Sonnet 3.5 and a bit worse than o1-mini. Given its amazing speed for its quality ratio beating top models from the benchmark, should not we consider this model as a serious contender as well?

I would like to request to add a benchmark for DeepSeek-R1-Distill-Llama-70B 🙇 . There are many providers out there, and some like Together.ai or Targon provide free access to that model making it very attractive for developers.

The text was updated successfully, but these errors were encountered:

gcp · 2025-02-07T20:23:13Z

This and the Qwen-R1-Distill are very interesting as they (in reasonable quantizations like Q3 to Q5) can be run locally at a "reasonable" cost. Is this just a matter of running some scripts with an API key (or link to an OpenAI API)? In general knowing the best local setup for Aider would be highly interesting.

gcp · 2025-02-08T22:20:52Z

I've added the definitions for a full openrouter/deepseek/deepseek-r1-distill-llama-70b:free provider (there is an existing one using that for reasoning but using deepseek-chat for edits), but I think those "free" providers simply don't work. One is currently listed on OpenRouter as down, and the other is supposedly up, but always gives size 0 answers. Trying OpenRouters' chat on this model gives me a "rate limit exceeded" on the very first query.

Checking the logs, it looks like "Targon" managed a few queries before going down, but "Together" just returns 0 size answers.

Not sure that you'd ever want to use them given the above!

gcp · 2025-02-09T11:41:51Z

Running the model locally runs into ggml-org/llama.cpp#10976 (comment) (and for this model, the "plenty of VRAM" does not apply!).

gcp · 2025-02-09T21:20:52Z

Hmm, I tried this on another machine that's not affected by this issue, but aider and llama.cpp don't seem to play nice:

slot launch_slot_: id  0 | task 563 | processing task
slot update_slots: id  0 | task 563 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 2596
slot update_slots: id  0 | task 563 | need to evaluate at least 1 token to generate logits, n_past = 2596, n_prompt_tokens = 2596
slot update_slots: id  0 | task 563 | kv cache rm [2595, end)
slot update_slots: id  0 | task 563 | prompt processing progress, n_past = 2596, n_tokens = 1, progress = 0.000385
slot update_slots: id  0 | task 563 | prompt done, n_past = 2596, n_tokens = 1
srv  cancel_tasks: cancel task, id_task = 563
srv  log_server_r: request: POST /chat/completions 172.17.0.2 200
slot      release: id  0 | task 563 | stop processing: n_past = 2998, truncated = 0
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 968 | processing task
slot update_slots: id  0 | task 968 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 2596
slot update_slots: id  0 | task 968 | need to evaluate at least 1 token to generate logits, n_past = 2596, n_prompt_tokens = 2596
slot update_slots: id  0 | task 968 | kv cache rm [2595, end)
slot update_slots: id  0 | task 968 | prompt processing progress, n_past = 2596, n_tokens = 1, progress = 0.000385
slot update_slots: id  0 | task 968 | prompt done, n_past = 2596, n_tokens = 1

This "need to evaluate at least 1 token to generate logits" looks quite suspicious.

Edit: This is because of the default timeout of 10 minutes, which isn't enough for the thinking models running locally.

EricBizet · 2025-02-10T18:38:14Z

That's unfortunate. I have been using personally the Together AI free endpoint without any issue, but I was not expecting the Openrouter proxy to fail. Many thanks for checking this out.
Please keep us posted if you find anything 🙇

EvitanRelta · 2025-02-12T08:36:09Z

seems like someone has created a PR with the R1 70b results: #3194

disappointing results tho. seems like r1 70b is quite underwhelming on our benchmark

gcp · 2025-02-12T09:13:20Z

If they used the model configuration as in the current repo, then the edits were actually done by deepseek-chat. See: https://github.com/Aider-AI/aider/blob/main/aider/resources/model-settings.yml#L601

That said, that would be an upper bound on the performance. And from my run so far, it looks just as bad:

- dirname: 2025-02-09-22-36-49--deepseek-r1-distill-llama-70b
  test_cases: 18
  model: openai/deepseek/deepseek-r1-distill-llama-70b:free
  edit_format: diff
  commit_hash: d5dd252
  pass_rate_1: 5.6
  pass_rate_2: 5.6
  pass_num_1: 1
  pass_num_2: 1
  percent_cases_well_formed: 88.9
  error_outputs: 7
  num_malformed_responses: 6
  num_with_malformed_responses: 2
  user_asks: 5
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 0
  total_tests: 225
  command: aider --model openai/deepseek/deepseek-r1-distill-llama-70b:free
  date: 2025-02-09
  versions: 0.74.2.dev
  seconds_per_case: 7248.5
  total_cost: 0.0000

costs: $0.0000/test-case, $0.00 total, $0.00 projected

I'm going to stop this (it's very slow, as you can see) as there is no real reason to consider this configuration. And I hope this hammers it down that the R1 distills are NOT R1, not even close.

gcp · 2025-02-17T10:44:29Z

The Qwen distill is a bit better: #3278

Pretty close to the other models that are reasonable to run locally.

EricBizet · 2025-02-20T17:25:44Z

Closing as performance is poorer than expected. Many thanks

EricBizet closed this as completed Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark request: DeepSeek-R1-Distill-Llama-70B #3167

Benchmark request: DeepSeek-R1-Distill-Llama-70B #3167

EricBizet commented Feb 7, 2025 •

edited

Loading

gcp commented Feb 7, 2025 •

edited

Loading

gcp commented Feb 8, 2025 •

edited

Loading

gcp commented Feb 9, 2025 •

edited

Loading

gcp commented Feb 9, 2025 •

edited

Loading

EricBizet commented Feb 10, 2025

EvitanRelta commented Feb 12, 2025

gcp commented Feb 12, 2025 •

edited

Loading

gcp commented Feb 17, 2025

EricBizet commented Feb 20, 2025

Benchmark request: DeepSeek-R1-Distill-Llama-70B #3167

Benchmark request: DeepSeek-R1-Distill-Llama-70B #3167

Comments

EricBizet commented Feb 7, 2025 • edited Loading

gcp commented Feb 7, 2025 • edited Loading

gcp commented Feb 8, 2025 • edited Loading

gcp commented Feb 9, 2025 • edited Loading

gcp commented Feb 9, 2025 • edited Loading

EricBizet commented Feb 10, 2025

EvitanRelta commented Feb 12, 2025

gcp commented Feb 12, 2025 • edited Loading

gcp commented Feb 17, 2025

EricBizet commented Feb 20, 2025

EricBizet commented Feb 7, 2025 •

edited

Loading

gcp commented Feb 7, 2025 •

edited

Loading

gcp commented Feb 8, 2025 •

edited

Loading

gcp commented Feb 9, 2025 •

edited

Loading

gcp commented Feb 9, 2025 •

edited

Loading

gcp commented Feb 12, 2025 •

edited

Loading