Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors for UNet3D application on distconv LBANN #2156

Open
JBae2 opened this issue Nov 17, 2022 · 2 comments
Open

Errors for UNet3D application on distconv LBANN #2156

JBae2 opened this issue Nov 17, 2022 · 2 comments
Assignees

Comments

@JBae2
Copy link

JBae2 commented Nov 17, 2022

Hello, I am trying to run the supported UNet3D aplication code in the LBANN github, but it fails.

In the distconv environments and its related source codes, it looks like that the input with "labels" data_field is not supported yet. The source code also mentioned that Distconv currently only supports CosmoFlow data.

Is this possible to run unet3d application on LBANN or am I missing something? If you have a knowledge, please advise about it.

This is the main function of my source code that I modified from the example unet3d. The omitted functions are same with the original. Thank you.

if __name__ == '__main__':
    desc = ('Construct and run the 3D U-Net on a 3D segmentation dataset.'
            'Running the experiment is only supported on LC systems.')
    parser = argparse.ArgumentParser(description=desc)
    lbann.contrib.args.add_scheduler_arguments(parser)

    (Omit parser.add_argument section)

    lbann.contrib.args.add_optimizer_arguments(
        parser,
        default_optimizer="adam",
        default_learning_rate=0.001,
    )

    args = parser.parse_args()
    args.procs_per_node=4

    parallel_strategy = get_parallel_strategy_args(
        sample_groups=args.mini_batch_size,
        depth_groups=args.depth_groups)

    # Construct layer graph
    volume = lbann.Input(data_field='samples')
    segmentation = lbann.Input(data_field='labels')

    output = UNet3D()(volume)

    ce = lbann.CrossEntropy([output, segmentation])
    layers = list(lbann.traverse_layer_graph([volume, segmentation]))

    obj = lbann.ObjectiveFunction([ce])

    for l in layers:
        l.parallel_strategy = parallel_strategy

    # Setup model
    metrics = [lbann.Metric(ce, name='CE', unit='')]
    callbacks = [lbann.CallbackPrint(),
        lbann.CallbackTimer(),
        lbann.CallbackGPUMemoryUsage(),
        lbann.CallbackProfiler(skip_init=True),
    ]
    # # TODO: Use polynomial learning rate decay (https://github.com/LLNL/lbann/issues/1581)
    # callbacks.append(
    #     lbann.CallbackPolyLearningRate(
    #         power=1.0,
    #         num_epochs=100,
    #         end_lr=1e-5))
    model = lbann.Model(epochs=args.num_epochs,
        layers=layers,
        objective_function=obj,
        callbacks=callbacks,
    )

    # Setup optimizer
    optimizer = lbann.contrib.args.create_optimizer(args)

    # Setup data reader
    data_reader = create_unet3d_data_reader(
        train_dir=args.train_dir,
        test_dir=args.test_dir)

    # Setup trainer
    trainer = lbann.Trainer(mini_batch_size=args.mini_batch_size)

    # Runtime parameters/arguments
    environment = lbann.contrib.args.get_distconv_environment(
        num_io_partitions=args.depth_groups)
    if args.dynamically_reclaim_error_signals:
        environment['LBANN_KEEP_ERROR_SIGNALS'] = 0
    else:
        environment['LBANN_KEEP_ERROR_SIGNALS'] = 1
    lbann_args = ['--use_data_store']

    # Run experiment
    kwargs = lbann.contrib.args.get_scheduler_kwargs(args)
    lbann.contrib.launcher.run(
        trainer, model, data_reader, optimizer,
        job_name=args.job_name,
        environment=environment,
        lbann_args=lbann_args,
        batch_job=args.batch_job,
        **kwargs)
@bvanessen
Copy link
Collaborator

@JBae2 There is a bug in the current UNet3D model, where the python representation of the model has drifted from some of the internal changes that have occurred in LBANN. This issue is currently being worked in PR #2151 but is not yet complete.

@benson31
Copy link
Collaborator

benson31 commented Feb 1, 2023

@bvanessen Can this be closed as #2151 is now merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants