Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark request: DeepSeek-R1-Distill-Llama-70B #3167

Closed
EricBizet opened this issue Feb 7, 2025 · 9 comments
Closed

Benchmark request: DeepSeek-R1-Distill-Llama-70B #3167

EricBizet opened this issue Feb 7, 2025 · 9 comments

Comments

@EricBizet
Copy link

EricBizet commented Feb 7, 2025

According to https://artificialanalysis.ai/leaderboards/models benchmark, seems the DeepSeek-R1-Distill-Llama-70B model is able to perform better than Sonnet 3.5 and a bit worse than o1-mini. Given its amazing speed for its quality ratio beating top models from the benchmark, should not we consider this model as a serious contender as well?

I would like to request to add a benchmark for DeepSeek-R1-Distill-Llama-70B 🙇 . There are many providers out there, and some like Together.ai or Targon provide free access to that model making it very attractive for developers.

@gcp
Copy link

gcp commented Feb 7, 2025

This and the Qwen-R1-Distill are very interesting as they (in reasonable quantizations like Q3 to Q5) can be run locally at a "reasonable" cost. Is this just a matter of running some scripts with an API key (or link to an OpenAI API)? In general knowing the best local setup for Aider would be highly interesting.

@gcp
Copy link

gcp commented Feb 8, 2025

I've added the definitions for a full openrouter/deepseek/deepseek-r1-distill-llama-70b:free provider (there is an existing one using that for reasoning but using deepseek-chat for edits), but I think those "free" providers simply don't work. One is currently listed on OpenRouter as down, and the other is supposedly up, but always gives size 0 answers. Trying OpenRouters' chat on this model gives me a "rate limit exceeded" on the very first query.

Checking the logs, it looks like "Targon" managed a few queries before going down, but "Together" just returns 0 size answers.

Not sure that you'd ever want to use them given the above!

@gcp
Copy link

gcp commented Feb 9, 2025

Running the model locally runs into ggml-org/llama.cpp#10976 (comment) (and for this model, the "plenty of VRAM" does not apply!).

@gcp
Copy link

gcp commented Feb 9, 2025

Hmm, I tried this on another machine that's not affected by this issue, but aider and llama.cpp don't seem to play nice:

slot launch_slot_: id  0 | task 563 | processing task
slot update_slots: id  0 | task 563 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 2596
slot update_slots: id  0 | task 563 | need to evaluate at least 1 token to generate logits, n_past = 2596, n_prompt_tokens = 2596
slot update_slots: id  0 | task 563 | kv cache rm [2595, end)
slot update_slots: id  0 | task 563 | prompt processing progress, n_past = 2596, n_tokens = 1, progress = 0.000385
slot update_slots: id  0 | task 563 | prompt done, n_past = 2596, n_tokens = 1
srv  cancel_tasks: cancel task, id_task = 563
srv  log_server_r: request: POST /chat/completions 172.17.0.2 200
slot      release: id  0 | task 563 | stop processing: n_past = 2998, truncated = 0
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 968 | processing task
slot update_slots: id  0 | task 968 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 2596
slot update_slots: id  0 | task 968 | need to evaluate at least 1 token to generate logits, n_past = 2596, n_prompt_tokens = 2596
slot update_slots: id  0 | task 968 | kv cache rm [2595, end)
slot update_slots: id  0 | task 968 | prompt processing progress, n_past = 2596, n_tokens = 1, progress = 0.000385
slot update_slots: id  0 | task 968 | prompt done, n_past = 2596, n_tokens = 1

This "need to evaluate at least 1 token to generate logits" looks quite suspicious.

Edit: This is because of the default timeout of 10 minutes, which isn't enough for the thinking models running locally.

@EricBizet
Copy link
Author

That's unfortunate. I have been using personally the Together AI free endpoint without any issue, but I was not expecting the Openrouter proxy to fail. Many thanks for checking this out.
Please keep us posted if you find anything 🙇

@EvitanRelta
Copy link

seems like someone has created a PR with the R1 70b results: #3194

disappointing results tho. seems like r1 70b is quite underwhelming on our benchmark

@gcp
Copy link

gcp commented Feb 12, 2025

If they used the model configuration as in the current repo, then the edits were actually done by deepseek-chat. See: https://github.com/Aider-AI/aider/blob/main/aider/resources/model-settings.yml#L601

That said, that would be an upper bound on the performance. And from my run so far, it looks just as bad:

- dirname: 2025-02-09-22-36-49--deepseek-r1-distill-llama-70b
  test_cases: 18
  model: openai/deepseek/deepseek-r1-distill-llama-70b:free
  edit_format: diff
  commit_hash: d5dd252
  pass_rate_1: 5.6
  pass_rate_2: 5.6
  pass_num_1: 1
  pass_num_2: 1
  percent_cases_well_formed: 88.9
  error_outputs: 7
  num_malformed_responses: 6
  num_with_malformed_responses: 2
  user_asks: 5
  lazy_comments: 0
  syntax_errors: 0
  indentation_errors: 0
  exhausted_context_windows: 0
  test_timeouts: 0
  total_tests: 225
  command: aider --model openai/deepseek/deepseek-r1-distill-llama-70b:free
  date: 2025-02-09
  versions: 0.74.2.dev
  seconds_per_case: 7248.5
  total_cost: 0.0000

costs: $0.0000/test-case, $0.00 total, $0.00 projected

I'm going to stop this (it's very slow, as you can see) as there is no real reason to consider this configuration. And I hope this hammers it down that the R1 distills are NOT R1, not even close.

@gcp
Copy link

gcp commented Feb 17, 2025

The Qwen distill is a bit better: #3278

Pretty close to the other models that are reasonable to run locally.

@EricBizet
Copy link
Author

Closing as performance is poorer than expected. Many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants