MQA Implementation for 2B models #114

ufownl · 2024-03-20T15:04:47Z

This PR implements "Multi-Query Attention" for the 2B models and modifies vocabulary size to be the same as gemma_pytorch (mentioned in #103). It works fine with weights converted from gemma_pytorch but will lead to the original gemma.cpp weights are unusable.

It needs more testing, and I'll use it to test the fine-tuned weights.

austinvhuang

Thanks very much, MQA is one of the most important low-hanging fruit to implement right now! Looks pretty good overall, have a look at the comment about avoiding branching.

Tagging @pculliton to check the model exporting + vocab size change and @jan-wassenberg on any perf suggestions.

gemma.cc

gemma.h

jan-wassenberg

Nice, thank you :) Some small suggestions:

gemma.cc

austinvhuang

This LGTM, if the performance looks good/better (I'm curious how much) and generation looks correct + @jan-wassenberg LGTMs can probably move forward with merging to dev.

ufownl · 2024-03-21T14:21:51Z

I tested the weights converted from gemma_pytorch (2b-it and 7b-it) and the generation looks fine.

jan-wassenberg

Very nice use of lambdas! Thanks for making the change.

ufownl added 4 commits March 20, 2024 18:17

Adjust vocab size to be the same as gemma_pytorch

130e1f6

Add MQA support

6923aec

Streamline the implementation

ce32f4d

Add the missing HWY_ATTR of ProjKV

c75d2eb

austinvhuang reviewed Mar 20, 2024

View reviewed changes

gemma.cc Outdated Show resolved Hide resolved

gemma.h Outdated Show resolved Hide resolved

Move conditional branch out of pos2 loop

8fc6959

ufownl changed the base branch from experimental to dev March 20, 2024 15:51

jan-wassenberg requested changes Mar 21, 2024

View reviewed changes

gemma.cc Outdated Show resolved Hide resolved

gemma.cc Outdated Show resolved Hide resolved

Refactor the implementation of Attention

90b0e9f

austinvhuang reviewed Mar 21, 2024

View reviewed changes

jan-wassenberg approved these changes Mar 22, 2024

View reviewed changes

austinvhuang added the copybara-import Trigger Copybara for merging pull requests label Mar 22, 2024

copybara-service bot merged commit fcf5c1a into google:dev Mar 22, 2024
7 checks passed

ufownl mentioned this pull request Mar 22, 2024

Generate compressed weights file from finetune #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MQA Implementation for 2B models #114

MQA Implementation for 2B models #114

ufownl commented Mar 20, 2024 •

edited

Loading

austinvhuang left a comment

jan-wassenberg left a comment

austinvhuang left a comment

ufownl commented Mar 21, 2024

jan-wassenberg left a comment

MQA Implementation for 2B models #114

MQA Implementation for 2B models #114

Conversation

ufownl commented Mar 20, 2024 • edited Loading

austinvhuang left a comment

Choose a reason for hiding this comment

jan-wassenberg left a comment

Choose a reason for hiding this comment

austinvhuang left a comment

Choose a reason for hiding this comment

ufownl commented Mar 21, 2024

jan-wassenberg left a comment

Choose a reason for hiding this comment

ufownl commented Mar 20, 2024 •

edited

Loading