Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用昇腾910A llama factory sft qwen2-7b时报错E40024: 2025-02-20-14:05:46.947.014 Failed call Python Func/Meathod [get_binfile_sha256_hash_from_c], #7014

Open
1 task done
Goo-goo-goo opened this issue Feb 20, 2025 · 0 comments
Labels
bug Something isn't working npu This problem is related to NPU devices pending This problem is yet to be addressed

Comments

@Goo-goo-goo
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

一、问题现象(附报错日志上下文):

在使用llama factory sft qwen2-7b时报错:

  File "/home/goo/project/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
    launch()
  File "/home/goo/project/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
    run_exp()
  File "/home/goo/project/LLaMA-Factory/src/llamafactory/train/tuner.py", line 93, in run_exp
    _training_function(config={"args": args, "callbacks": callbacks})
  File "/home/goo/project/LLaMA-Factory/src/llamafactory/train/tuner.py", line 67, in _training_function
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/home/goo/project/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/trainer.py", line 3698, in training_step
    loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/trainer.py", line 3759, in compute_loss
    outputs = model(**inputs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1914, in forward
    loss = self.module(*inputs, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 877, in forward
    loss = self.loss_function(logits=logits, labels=labels, vocab_size=self.config.vocab_size, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 47, in ForCausalLMLoss
    loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/transformers/loss/loss_utils.py", line 26, in fixed_cross_entropy
    loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
  File "/root/miniconda3/envs/project/lib/python3.10/site-packages/torch/nn/functional.py", line 3053, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: malloc:torch_npu/csrc/core/npu/NPUCachingAllocator.cpp:879 NPU error, error code is 507899
[ERROR] 2025-02-20-16:53:06 (PID:46075, Device:1, RankID:1) ERR00100 PTA call acl api failed
[Error]: An internal error occurs in the Driver module. 
        Rectify the fault based on the error information in the ascend log.
E40024: 2025-02-20-14:05:46.947.014 Failed call Python Func/Meathod [get_binfile_sha256_hash_from_c], Reason[SystemError: PY_SSIZE_T_CLEAN macro must be defined for '#' formats
] 
        Possible Cause: The Python Func/Meathod does not exist.
        TraceBack (most recent call last):
        Failed to allocate memory.
        rtMalloc execute failed, reason=[driver error:out of memory][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        alloc device memory failed, runtime result = 207001[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]
        [drv api] halMemGetInfo failed: device_id=5, type=2, drvRetCode=17![FUNC:MemGetInfoEx][FILE:npu_driver.cc][LINE:2017]
        rtMemGetInfoEx execute failed, reason=[driver error:internal error][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
        get memory information failed, runtime result = 507899[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:161]

二、软件版本:

-- CANN 版本 (e.g., CANN 3.0.x,5.x.x): 8.0.1RC3
--Tensorflow/Pytorch/MindSpore 版本: pytorch==2.1.0 pytorch_npu==2.1.0.post3
--Python 版本 (e.g., Python 3.7.5):3.10.18
-- MindStudio版本 (e.g., MindStudio 2.0.0 (beta3)):
--操作系统版本 (e.g., Ubuntu 18.04):eulerosv2r8.aarch64

三、测试步骤:

1.测试环境 4卡910A
2.使用训练的配置文件:

llamafactory配置文件:

cutoff_len: 2048
dataset: identity,tool_identify,glaive_toolcall_zh_demo
dataset_dir: /home/goo/project/dataset
ddp_timeout: 180000000
deepspeed: /home/goo/project/train_config/ds_z3_offload_config_copy.json
do_train: true
eval_steps: 100
eval_strategy: steps
finetuning_type: full
flash_attn: auto
fp16: true
gradient_accumulation_steps: 2
include_num_input_tokens_seen: true
learning_rate: 1.0e-5
logging_steps: 1
lr_scheduler_type: cosine
max_grad_norm: 1.0
max_samples: 100000
model_name_or_path: /home/goo/models/Qwen/Qwen2-7B-Instruct
num_train_epochs: 2.0
optim: adamw_torch
output_dir: saves/Qwen2-7B-Instruct/full/train_identify
overwrite_output_dir: true
packing: false
per_device_eval_batch_size: 8
per_device_train_batch_size: 8
plot_loss: true
preprocessing_num_workers: 8
report_to: none
save_steps: 100
stage: sft
template: qwen
trust_remote_code: true
val_size: 0.03
warmup_steps: 100

Deepspeed文件

{
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  }
}

Reproduction

  1. 运行的命令llamafactory-cli train example.yaml
  2. 报错的日志:

output.log

Others

No response

@Goo-goo-goo Goo-goo-goo added bug Something isn't working pending This problem is yet to be addressed labels Feb 20, 2025
@github-actions github-actions bot added the npu This problem is related to NPU devices label Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working npu This problem is related to NPU devices pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

1 participant