You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have read the above rules and searched the existing issues.
System Info
一、问题现象(附报错日志上下文):
在使用llama factory sft qwen2-7b时报错:
File "/home/goo/project/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
launch()
File "/home/goo/project/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
run_exp()
File "/home/goo/project/LLaMA-Factory/src/llamafactory/train/tuner.py", line 93, in run_exp
_training_function(config={"args": args, "callbacks": callbacks})
File "/home/goo/project/LLaMA-Factory/src/llamafactory/train/tuner.py", line 67, in _training_function
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/home/goo/project/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft
train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
return inner_training_loop(
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/trainer.py", line 3698, in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/trainer.py", line 3759, in compute_loss
outputs = model(**inputs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
loss = self.module(*inputs, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
return func(*args, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 877, in forward
loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 47, in ForCausalLMLoss
loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 26, in fixed_cross_entropy
loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: malloc:torch_npu/csrc/core/npu/NPUCachingAllocator.cpp:879 NPU error, error code is 507899
[ERROR] 2025-02-20-16:53:06 (PID:46075, Device:1, RankID:1) ERR00100 PTA call acl api failed
[Error]: An internal error occurs in the Driver module.
Rectify the fault based on the error information in the ascend log.
E40024: 2025-02-20-14:05:46.947.014 Failed call Python Func/Meathod [get_binfile_sha256_hash_from_c], Reason[SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats
]
Possible Cause: The Python Func/Meathod does not exist.
TraceBack (most recent call last):
Failed to allocate memory.
rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
[drv api] halMemGetInfo failed: device_id=5, type=2, drvRetCode=17![FUNC:MemGetInfoEx][FILE:npu_driver.cc][LINE:2017]
rtMemGetInfoEx execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
get memory information failed, runtime result = 507899[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
Reminder
System Info
一、问题现象(附报错日志上下文):
在使用llama factory sft qwen2-7b时报错:
二、软件版本:
-- CANN 版本 (e.g., CANN 3.0.x,5.x.x): 8.0.1RC3
--Tensorflow/Pytorch/MindSpore 版本: pytorch==2.1.0 pytorch_npu==2.1.0.post3
--Python 版本 (e.g., Python 3.7.5):3.10.18
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04):eulerosv2r8.aarch64
三、测试步骤:
1.测试环境 4卡910A
2.使用训练的配置文件:
llamafactory配置文件:
Deepspeed文件
Reproduction
llamafactory-cli train example.yaml
output.log
Others
No response
The text was updated successfully, but these errors were encountered: