-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark request: DeepSeek-R1-Distill-Llama-70B #3167
Comments
This and the Qwen-R1-Distill are very interesting as they (in reasonable quantizations like Q3 to Q5) can be run locally at a "reasonable" cost. Is this just a matter of running some scripts with an API key (or link to an OpenAI API)? In general knowing the best local setup for Aider would be highly interesting. |
I've added the definitions for a full Checking the logs, it looks like "Targon" managed a few queries before going down, but "Together" just returns 0 size answers. Not sure that you'd ever want to use them given the above! |
Running the model locally runs into ggml-org/llama.cpp#10976 (comment) (and for this model, the "plenty of VRAM" does not apply!). |
Hmm, I tried this on another machine that's not affected by this issue, but aider and llama.cpp don't seem to play nice:
This "need to evaluate at least 1 token to generate logits" looks quite suspicious. Edit: This is because of the default timeout of 10 minutes, which isn't enough for the thinking models running locally. |
That's unfortunate. I have been using personally the Together AI free endpoint without any issue, but I was not expecting the Openrouter proxy to fail. Many thanks for checking this out. |
seems like someone has created a PR with the R1 70b results: #3194 disappointing results tho. seems like r1 70b is quite underwhelming on our benchmark |
If they used the model configuration as in the current repo, then the edits were actually done by That said, that would be an upper bound on the performance. And from my run so far, it looks just as bad:
I'm going to stop this (it's very slow, as you can see) as there is no real reason to consider this configuration. And I hope this hammers it down that the R1 distills are NOT R1, not even close. |
The Qwen distill is a bit better: #3278 Pretty close to the other models that are reasonable to run locally. |
Closing as performance is poorer than expected. Many thanks |
According to https://artificialanalysis.ai/leaderboards/models benchmark, seems the
DeepSeek-R1-Distill-Llama-70B
model is able to perform better than Sonnet 3.5 and a bit worse than o1-mini. Given its amazing speed for its quality ratio beating top models from the benchmark, should not we consider this model as a serious contender as well?I would like to request to add a benchmark for
DeepSeek-R1-Distill-Llama-70B
🙇 . There are many providers out there, and some like Together.ai or Targon provide free access to that model making it very attractive for developers.The text was updated successfully, but these errors were encountered: