The entire attention's _elapsed_time is repeatedly assigned to attention_comlumn and attention_row #26

dageita · 2024-12-27T06:24:32Z

During the process of running AICB with customized parameters, my command is as follows:

GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6001
NNODES=1
RANK=0
WORLD_SIZE=$((GPUS_PER_NODE*NNODES))
micro_batch=28
global_batch=1792
epoch_num=1
hidden_size=1024
ffn_hidden_size=4096
num_attention_heads=16
seq_len=1024
num_layers=24
vocab_size=50257
max_position_embeddings=8192
tensor_model_parallel_size=2
pipeline_model_parallel_size=2
dtype=float16


torchrun \
--nnodes $NNODES \
--node_rank $RANK \
--nproc_per_node $GPUS_PER_NODE \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
./aicb.py --frame=Megatron --dtype=$dtype \
--world_size=$WORLD_SIZE --tensor_model_parallel_size=$tensor_model_parallel_size \
--pipeline_model_parallel=$pipeline_model_parallel_size \
  --micro_batch=$micro_batch --global_batch=$global_batch --epoch_num=$epoch_num \
  --num_layers=$num_layers --hidden_size=$hidden_size --ffn_hidden_size=$ffn_hidden_size --num_attention_heads=$num_attention_heads \
  --seq_len=$seq_len \
  --max_position_embeddings=$max_position_embeddings --vocab_size=$vocab_size \
  --aiob_enable

But I found that the benchmark prediction time is about twice the actual model training time.
I think the problem lies in:
According to the function, def extract_averages(file_path,args):, compute_cache calculates the time of the entire attention or mlp process. However in the generated workload, an attention or mlp is decomposed into two parts, such as:

CommType.computation,None,None,((1024, 28, 1024), (1024, 1536)),forward.MegatronColumnLinear.attention_column,None,None,,4157760,None,None,1
CommType.computation,None,None,((1024, 28, 512), (512, 1024)),forward.MegatronRowLinear.attention_row,None,None,,688725,None,None,1
CommType.all_reduce,CommGroup.tp_group,2,58720256,forward.MegatronRowLinear,None,None,,None,None,None,1
CommType.computation,None,None,((1024, 28, 1024), (1024, 2048)),forward.MegatronColumnLinear.mlp_column,None,None,,1549275,None,None,1
CommType.computation,None,None,((1024, 28, 2048), (2048, 1024)),forward.MegatronRowLinear.mlp_row,None,None,,1250304,None,None,1

The entire attention time is assigned to MegatronColumnLinear, resulting in this issue:

def Comp_with_aiob(workload, compute_cache):
    for item in workload.workload:
        if item.comm_type == CommType.computation:
            for key in compute_cache:
                key_temp = key.split("_")[0]
                if key_temp in item.stage:
                    item._elapsed_time = compute_cache[key]
                    break
    return workload

The text was updated successfully, but these errors were encountered:

dageita mentioned this issue Dec 27, 2024

Fix the entire attention's _elapsed_time is repeatedly assigned to attention_comlumn and attention_row #27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The entire attention's _elapsed_time is repeatedly assigned to attention_comlumn and attention_row #26

The entire attention's _elapsed_time is repeatedly assigned to attention_comlumn and attention_row #26

dageita commented Dec 27, 2024

The entire attention's _elapsed_time is repeatedly assigned to attention_comlumn and attention_row #26

The entire attention's _elapsed_time is repeatedly assigned to attention_comlumn and attention_row #26

Comments

dageita commented Dec 27, 2024