Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train is ok bug inference maybe some bug??? #28

Open
jiachen0212 opened this issue Mar 29, 2024 · 9 comments
Open

train is ok bug inference maybe some bug??? #28

jiachen0212 opened this issue Mar 29, 2024 · 9 comments

Comments

@jiachen0212
Copy link

jiachen0212 commented Mar 29, 2024

Hello, I have a question. I used my own data to create a scannet format and the train model got the following results. It seems normal. (The train val data is consistent so this value looks good), and the indicators of each category look like normal.

1711700718173

However, when using the trained model for inference, I found that the detected box categories were all 0, and the orientation of the box was wrong... The conf I used is as follows,

1711700842471
1711701021250

test_pipeline = [ dict( type='LoadPointsFromFile', coord_type='DEPTH', shift_height=False, use_color=True, load_dim=6, use_dim=[0, 1, 2, 3, 4, 5]), dict(type='GlobalAlignment', rotation_axis=2), dict( type='MultiScaleFlipAug3D', img_scale=(1333, 800), pts_scale_ratio=1, flip=False, transforms=[ dict(type='NormalizePointsColor', color_mean=None), dict( type='DefaultFormatBundle3D', class_names=class_names, with_label=False), dict(type='Collect3D', keys=['points']) ]) ]

Do you know where the problem might occur?

@filaPro
Copy link
Contributor

filaPro commented Mar 29, 2024

Can you share the full config .py file? Also to be sure, train and validation is ok, but test is not ok?

@jiachen0212
Copy link
Author

jiachen0212 commented Mar 29, 2024

Can you share the full config .py file? Also to be sure, train and validation is ok, but test is not ok?

Wow, thank you for replying so quickly. My config is as follows:

voxel_size = .01
n_points = 100000

model = dict(
    type='MinkSingleStage3DDetector',
    voxel_size=voxel_size,
    backbone=dict(type='MinkResNet', in_channels=3, max_channels=128, depth=34, norm='batch'),
    neck=dict(
        type='TR3DNeck',
        in_channels=(64, 128, 128, 128),
        out_channels=128),
    head=dict(
        type='TR3DHead',
        in_channels=128,
        n_reg_outs=6,
        n_classes=18,   # 这个得根据类别修改~ 
        voxel_size=voxel_size,
        assigner=dict(
            type='TR3DAssigner',
            top_pts_threshold=6,
            label2level=[0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0]),    
        bbox_loss=dict(type='AxisAlignedIoULoss', mode='diou', reduction='none')),
    train_cfg=dict(),
    test_cfg=dict(nms_pre=1000, iou_thr=.5, score_thr=.01))

optimizer = dict(type='AdamW', lr=.001, weight_decay=.0001)
optimizer_config = dict(grad_clip=dict(max_norm=10, norm_type=2))
lr_config = dict(policy='step', warmup=None, step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=17)
custom_hooks = [dict(type='EmptyCacheHook', after_iter=True)]

checkpoint_config = dict(interval=1, max_keep_ckpts=1)
log_config = dict(
    interval=50,
    hooks=[
        dict(type='TextLoggerHook'),
        # dict(type='TensorboardLoggerHook')
])
dist_params = dict(backend='nccl')
log_level = 'INFO'
work_dir = None
load_from = None
resume_from = None
workflow = [('train', 1)]

dataset_type = 'ScanNetDataset'
data_root = './data/scannet/'
class_names = ('cabinet', 'bed', 'chair', 'sofa', 'table', 'door', 'window',
               'bookshelf', 'picture', 'counter', 'desk', 'curtain',
               'refrigerator', 'showercurtrain', 'toilet', 'sink', 'bathtub',
               'garbagebin')

train_pipeline = [
    dict(
        type='LoadPointsFromFile',
        coord_type='DEPTH',
        shift_height=False,
        use_color=True,
        load_dim=6,
        use_dim=[0, 1, 2, 3, 4, 5]),
    dict(type='LoadAnnotations3D'),
    dict(type='GlobalAlignment', rotation_axis=2),
    # we do not sample 100k points for scannet, as very few scenes have
    # significantly more then 100k points. so we sample 33 to 100% of them
    dict(type='PointSample', num_points=.33),
    dict(
        type='RandomFlip3D',
        sync_2d=False,
        flip_ratio_bev_horizontal=.5,
        flip_ratio_bev_vertical=.5),
    dict(
        type='GlobalRotScaleTrans',
        rot_range=[-.02, .02],
        scale_ratio_range=[.9, 1.1],
        translation_std=[.1, .1, .1],
        shift_height=False),
    dict(type='NormalizePointsColor', color_mean=None),
    dict(type='DefaultFormatBundle3D', class_names=class_names),
    dict(type='Collect3D', keys=['points', 'gt_bboxes_3d', 'gt_labels_3d'])
]
test_pipeline = [
    dict(
        type='LoadPointsFromFile',
        coord_type='DEPTH',
        shift_height=False,
        use_color=True,
        load_dim=6,
        use_dim=[0, 1, 2, 3, 4, 5]),
    dict(type='GlobalAlignment', rotation_axis=2),
    dict(
        type='MultiScaleFlipAug3D',
        img_scale=(1333, 800),
        pts_scale_ratio=1,
        flip=False,
        transforms=[
            dict(type='NormalizePointsColor', color_mean=None),
            dict(
                type='DefaultFormatBundle3D',
                class_names=class_names,
                with_label=False),   
            dict(type='Collect3D', keys=['points'])
        ])
]
data = dict(
    samples_per_gpu=16,
    workers_per_gpu=4,
    train=dict(
        type='RepeatDataset',
        times=15,
        dataset=dict(
            type=dataset_type,
            data_root=data_root,
            ann_file=data_root + 'scannet_infos_train.pkl',
            pipeline=train_pipeline,
            filter_empty_gt=False,
            classes=class_names,
            box_type_3d='Depth')),
    val=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file=data_root + 'scannet_infos_train.pkl',
        pipeline=test_pipeline,
        classes=class_names,
        test_mode=True,
        box_type_3d='Depth'),
    test=dict(
        type=dataset_type,
        data_root=data_root,
        ann_file=data_root + 'scannet_infos_train.pkl',   
        pipeline=test_pipeline,
        classes=class_names,
        test_mode=True,
        box_type_3d='Depth'))

yes, "train and validation is ok, but test is not ok" . i run tools/train.py tools/test.py can get right thing~ but,
python demo/pcd_demo.py xxx.bin configs/tr3d/tr3d_scannet-3d-18class.py work_dirs/tr3d_scannet-3d-18class/epoch_17.pth get wrong result. The angle and category of the box are wrong~~

@filaPro
Copy link
Contributor

filaPro commented Mar 29, 2024

Ah, this pcd_demo.py thing doesn't work i think. You need to debug it a little bit, to be sure that everything is exactly like in test.py script. One of the main things is dict(type='GlobalAlignment', rotation_axis=2), i think you are missing it in pcd_demo.py, so the walls are rotated from x and y axis.

@jiachen0212
Copy link
Author

pcd_demo

Thank you very much for your reply~
I did some debugging according to your tips and found that I used GlobalAlignment(rotation_axis=2) in pcd_demo. But the direction of the box is still not much. As for the checkout category of the box, I found the answer , they are placed in the 'labels_3d' keyword of the result. So I am still confused, how can I get the box detection visualization results with the correct direction...~

1711943643144

Detection results of boxes whose directions are not aligned:

1711943596319

@filaPro
Copy link
Contributor

filaPro commented Apr 1, 2024

Btw i'm a little confused about rotation. As you use ScanNetDataset and n_reg_outs=6 in your config, we don't even predict rotation in this case. So all roations are zeros and both boxes and walls a parallel to x and y axis.

@jiachen0212
Copy link
Author

Btw i'm a little confused about rotation. As you use ScanNetDataset and n_reg_outs=6 in your config, we don't even predict rotation in this case. So all roations are zeros and both boxes and walls a parallel to x and y axis.

Hmm, I probably understand. Thank you very much for your replies. Maybe it’s a problem with my annotated data. I use my own annotated data and then convert it into the scannet data format. I’ll debug it again, thank you~

@jiachen0212
Copy link
Author

Btw i'm a little confused about rotation. As you use ScanNetDataset and n_reg_outs=6 in your config, we don't even predict rotation in this case. So all roations are zeros and both boxes and walls a parallel to x and y axis.

Hmm, I probably understand. Thank you very much for your replies. Maybe it’s a problem with my annotated data. I use my own annotated data and then convert it into the scannet data format. I’ll debug it again, thank you~

I made some changes to mmdet3d/core/visualizer/open3d_vis.py, and the visualization looks better~

    in_box_color = np.array(points_in_box_color)
    for i in range(len(bbox3d)):
        center = bbox3d[i, 0:3]
        dim = bbox3d[i, 3:6]
        yaw = np.zeros(3)
        # yaw[rot_axis] = bbox3d[i, 6]  # 耦合bug...
        yaw[rot_axis] = math.pi/8   # Manual modification 
        rot_mat = geometry.get_rotation_matrix_from_xyz(yaw)   
        print(rot_mat)

1712030998367

@MRCHENWJ
Copy link

Hello, I created my own dataset following the format of the S3DIS dataset and tried to train the network with it. However, I encountered the CUDA out of memory error even though I switched to a GPU with 32GB of VRAM. How can I solve this issue?

2024-06-11 11:19:47,852 - mmdet - INFO - Checkpoints will be saved to /root/autodl-tmp/tr3d-main/work_dirs/tr3d_s3dis-3d-5class by HardDiskBackend.
Traceback (most recent call last):
File "tools/train.py", line 263, in
main()
File "tools/train.py", line 252, in main
train_model(
File "/root/autodl-tmp/tr3d-main/mmdet3d/apis/train.py", line 344, in train_model
train_detector(
File "/root/autodl-tmp/tr3d-main/mmdet3d/apis/train.py", line 319, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 53, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 31, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/parallel/data_parallel.py", line 77, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "/root/miniconda3/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/root/autodl-tmp/tr3d-main/mmdet3d/models/detectors/base.py", line 60, in forward
return self.forward_train(**kwargs)
File "/root/autodl-tmp/tr3d-main/mmdet3d/models/detectors/mink_single_stage.py", line 86, in forward_train
x = self.extract_feats(points)
File "/root/autodl-tmp/tr3d-main/mmdet3d/models/detectors/mink_single_stage.py", line 70, in extract_feats
x = self.neck(x)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/tr3d-main/mmdet3d/models/necks/tr3d_neck.py", line 53, in forward
x = inputs[i] + x
File "/root/miniconda3/lib/python3.8/site-packages/MinkowskiEngine/MinkowskiTensor.py", line 556, in add
return self._binary_functor(other, lambda x, y: x + y)
File "/root/miniconda3/lib/python3.8/site-packages/MinkowskiEngine/MinkowskiTensor.py", line 531, in _binary_functor
out_F = torch.zeros(
RuntimeError: CUDA out of memory. Tried to allocate 2.26 GiB (GPU 0; 15.74 GiB total capacity; 13.21 GiB already allocated; 471.56 MiB free; 13.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@filaPro
Copy link
Contributor

filaPro commented Jun 11, 2024

Hard to say what is wrong with your dataset, as you don't give much details. I recommend to tune voxel_size, n_points, and samples_per_gpu in the config file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants