Errors for UNet3D application on distconv LBANN #2156

JBae2 · 2022-11-17T00:13:07Z

Hello, I am trying to run the supported UNet3D aplication code in the LBANN github, but it fails.

In the distconv environments and its related source codes, it looks like that the input with "labels" data_field is not supported yet. The source code also mentioned that Distconv currently only supports CosmoFlow data.

Is this possible to run unet3d application on LBANN or am I missing something? If you have a knowledge, please advise about it.

This is the main function of my source code that I modified from the example unet3d. The omitted functions are same with the original. Thank you.

if __name__ == '__main__':
    desc = ('Construct and run the 3D U-Net on a 3D segmentation dataset.'
            'Running the experiment is only supported on LC systems.')
    parser = argparse.ArgumentParser(description=desc)
    lbann.contrib.args.add_scheduler_arguments(parser)

    (Omit parser.add_argument section)

    lbann.contrib.args.add_optimizer_arguments(
        parser,
        default_optimizer="adam",
        default_learning_rate=0.001,
    )

    args = parser.parse_args()
    args.procs_per_node=4

    parallel_strategy = get_parallel_strategy_args(
        sample_groups=args.mini_batch_size,
        depth_groups=args.depth_groups)

    # Construct layer graph
    volume = lbann.Input(data_field='samples')
    segmentation = lbann.Input(data_field='labels')

    output = UNet3D()(volume)

    ce = lbann.CrossEntropy([output, segmentation])
    layers = list(lbann.traverse_layer_graph([volume, segmentation]))

    obj = lbann.ObjectiveFunction([ce])

    for l in layers:
        l.parallel_strategy = parallel_strategy

    # Setup model
    metrics = [lbann.Metric(ce, name='CE', unit='')]
    callbacks = [lbann.CallbackPrint(),
        lbann.CallbackTimer(),
        lbann.CallbackGPUMemoryUsage(),
        lbann.CallbackProfiler(skip_init=True),
    ]
    # # TODO: Use polynomial learning rate decay (https://github.com/LLNL/lbann/issues/1581)
    # callbacks.append(
    #     lbann.CallbackPolyLearningRate(
    #         power=1.0,
    #         num_epochs=100,
    #         end_lr=1e-5))
    model = lbann.Model(epochs=args.num_epochs,
        layers=layers,
        objective_function=obj,
        callbacks=callbacks,
    )

    # Setup optimizer
    optimizer = lbann.contrib.args.create_optimizer(args)

    # Setup data reader
    data_reader = create_unet3d_data_reader(
        train_dir=args.train_dir,
        test_dir=args.test_dir)

    # Setup trainer
    trainer = lbann.Trainer(mini_batch_size=args.mini_batch_size)

    # Runtime parameters/arguments
    environment = lbann.contrib.args.get_distconv_environment(
        num_io_partitions=args.depth_groups)
    if args.dynamically_reclaim_error_signals:
        environment['LBANN_KEEP_ERROR_SIGNALS'] = 0
    else:
        environment['LBANN_KEEP_ERROR_SIGNALS'] = 1
    lbann_args = ['--use_data_store']

    # Run experiment
    kwargs = lbann.contrib.args.get_scheduler_kwargs(args)
    lbann.contrib.launcher.run(
        trainer, model, data_reader, optimizer,
        job_name=args.job_name,
        environment=environment,
        lbann_args=lbann_args,
        batch_job=args.batch_job,
        **kwargs)

The text was updated successfully, but these errors were encountered:

bvanessen · 2022-11-28T17:52:37Z

@JBae2 There is a bug in the current UNet3D model, where the python representation of the model has drifted from some of the internal changes that have occurred in LBANN. This issue is currently being worked in PR #2151 but is not yet complete.

benson31 · 2023-02-01T15:59:33Z

@bvanessen Can this be closed as #2151 is now merged?

benson31 assigned bvanessen and benson31 Nov 29, 2022

benson31 added bug in progress labels Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors for UNet3D application on distconv LBANN #2156

Errors for UNet3D application on distconv LBANN #2156

JBae2 commented Nov 17, 2022 •

edited

Loading

bvanessen commented Nov 28, 2022

benson31 commented Feb 1, 2023

Errors for UNet3D application on distconv LBANN #2156

Errors for UNet3D application on distconv LBANN #2156

Comments

JBae2 commented Nov 17, 2022 • edited Loading

bvanessen commented Nov 28, 2022

benson31 commented Feb 1, 2023

JBae2 commented Nov 17, 2022 •

edited

Loading