Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Pytorch #233

Open
eddiebergman opened this issue Jan 24, 2024 · 2 comments
Open

[Feature] Pytorch #233

eddiebergman opened this issue Jan 24, 2024 · 2 comments
Assignees
Labels
feature A new feature

Comments

@eddiebergman
Copy link
Contributor

This issue will serve as a log as to the PyTorch progress in AMLTK. Please feel free to chime in with any information/suggestions/solutions to problems.

@eddiebergman
Copy link
Contributor Author

eddiebergman commented Jan 24, 2024

The first step with PyTorch integration is to make it work with a simple MLP with 1 hidden layer. This works quite trivially if you have a class MyNet that implements it but that's not what the amltk pipelines are for. We'd rather define it as so:

pipeline = Sequential(
    nn.Flatten(start_dim=1),
    Component(nn.Linear, config={"in_features": 724, "out_features": 20}, name="fc1"),
    nn.ReLU,
    Component(nn.Linear, config={"in_features": 20, "out_features": 10}, name="fc2"),
    Component(nn.LogSoftmax, config={"dim": 1}),
    name="my-mlp-pipeline",
)

The first challenge is to somehow define the search space in the pipeline, where that number 20 can go between something like (10, 30). The main issue is:

  • The input and output features are tied together, i.e. they are the same parameters define in two places, how can we tie these together? Can we use the request functionality to make this work? Typically we've just defined the search space with the component its parameterize.
Problem Script This script can be used to try solve the problem.
```python
# Check the `main()` function to get started and follow it through.
# Note that performance is irrelevant for now.
# Most of my pytorch stuff as an example is just taken from here.
# https://github.com/pytorch/examples/blob/main/mnist/main.py
from __future__ import annotations

from collections import OrderedDict
from typing import TYPE_CHECKING

import torch
import torch.nn.functional as F  # noqa: N812
from torch import nn, optim
from torch.optim.lr_scheduler import StepLR
from torchvision import datasets, transforms

from amltk import Component, Metric, Sequential

# Change this to optuna if you prefer
# -- from amltk.optimization.optimizers.optuna import OptunaParser
from amltk.optimization.optimizers.smac import SMACOptimizer

if TYPE_CHECKING:
    from amltk import Node, Trial

# This is a nice import :)
from rich import print


# NOTE: This is the reference model, slowly try to build up to this
# but make it parametrizable.
# The goal would be that users don't define this class (maybe?)
# but they can define it using the pipeline structure.
# We can handle that later, for now, the pipeline definition below should be enough.
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


# Just taken from the pytorch example
def test(
    model: nn.Module,
    device: torch.device,
    test_loader: torch.utils.data.DataLoader,
) -> tuple[float, float]:
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction="sum").item()
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    accuracy = 100.0 * correct / len(test_loader.dataset)
    return float(test_loss), float(accuracy)


# NOTE: The idea for this would be to integrate a general enough builder
# into AMLTK that can take a pipeline and build a nn.Module out of it.
def some_custom_building_function(pipeline: Node) -> nn.Module:
    # TODO: This somehow has to go from a configured pipeline to a nn.Module
    # Take a look at the amltk.pipeline.builders.sklearn to see how this is done
    # for sklearn.
    #
    # https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html
    print(pipeline)
    print("Should use the configs and names from above, not be hardcoded here")

    # TODO: The main difficulty here will be to figure out how to build
    # this correctly given the pipeline configs and the `item` in the pipeline.
    # This means you should manually place in things like `nn.Flatten`,
    # they're already defined in the `main()` function below.
    # Specficially matching input and output dimensions properly, without
    # knowledge ahead of time what the pipeline should be
    model = nn.Sequential(
        OrderedDict(
            [
                ("flatten", nn.Flatten()),
                ("fc1", nn.Linear(in_features=784, out_features=20)),
                ("relu1", nn.ReLU()),
                ("fc2", nn.Linear(in_features=20, out_features=10)),
                ("sftmax", nn.LogSoftmax(dim=1)),
            ],
        ),
    )

    print(model)
    return model


def eval_configuration(
    trial: Trial,
    pipeline: Node,
    device: str = "cpu",  # Change if you have a GPU
    epochs: int = 1,  # Fixed for now
    lr: float = 0.1,  # Fixed for now
    gamma: float = 0.7,  # Fixed for now
    batch_size: int = 64,  # Fixed for now
    log_interval: int = 10,  # Fixed for now
) -> Trial.Report:
    trial.store({"config.json": pipeline.config})
    # TODO: I don't know if this is good enough for seeding and if it works across processes
    # for torch?? At least with sklearn you can pass around a RandomState but torch has no
    # such thing
    torch.manual_seed(trial.seed)

    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST(
            "../data",
            train=True,
            download=True,
            transform=transforms.Compose(
                [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))],
            ),
        ),
        batch_size=batch_size,
        shuffle=True,
    )
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST(
            "../data",
            train=False,
            download=True,
            transform=transforms.Compose(
                [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))],
            ),
        ),
        batch_size=batch_size,
        shuffle=True,
    )

    _device = torch.device(device)
    model = (
        pipeline
        .configure(trial.config)
        .build(builder=some_custom_building_function)  # TODO: This part is where difficulty lies
        .to(_device)
    )
    print(model)

    with trial.begin():
        # I feel like the optimizer and lr_scheduler should somehow also
        # be part of the pipeline that's gotten when calling build
        optimizer = optim.Adadelta(model.parameters(), lr=lr)
        lr_scheduler = StepLR(optimizer, step_size=1, gamma=gamma)

        # Just a defactor torch training loop
        for epoch in range(epochs):
            for batch_idx, (data, target) in enumerate(train_loader):
                optimizer.zero_grad()
                data, target = data.to(_device), target.to(_device)

                output = model(data)
                loss = F.nll_loss(output, target)

                loss.backward()
                optimizer.step()

                if batch_idx % log_interval == 0:
                    # Might want to store these things in the summary, see below
                    if batch_idx % log_interval == 0:
                        print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                            epoch, batch_idx * len(data), len(train_loader.dataset),
                            100. * batch_idx / len(train_loader), loss.item()))
                    lr_scheduler.step()

    if trial.exception:
        return trial.fail()

    final_train_loss, final_train_acc = test(model, _device, train_loader)
    final_test_loss, final_test_acc = test(model, _device, test_loader)
    trial.summary["final_test_loss"] = final_test_loss
    trial.summary["final_test_accuracy"] = final_test_acc
    trial.summary["final_train_loss"] = final_train_loss
    trial.summary["final_train_accuracy"] = final_train_acc

    # TODO: We might also want to be able do this inside the training loop,
    # during the batch_idx % log_interval == 0 block.
    # However we would then have to store it as
    #
    #   trial.summary["epoch_{epoch}:batch_{batch_idx}:loss"] = batch_loss
    #   trial.summary["epoch_{epoch}:batch_{batch_idx}:acc"] = batch_accuracy
    #
    # This is not ideal because getting a curve out of this wouldn't work well.
    # It could be possible to do
    #
    # At start,
    #
    #   trial.summary["blahhhh"] = {"loss": [], "acc": []}
    #
    # and then during the loop
    #
    #   trial.summary["blahhhh"]["loss"].append(batch_loss)
    #   trial.summary["blahhhh"]["acc"].append(batch_acc)

    # We need a custom PathLoader to now how to store
    # a .pt file?
    # trial.store({"model.pt": model.state_dict()})

    # Ideally we should have a validation set for doing proper HPO
    # setup but we'll just use the test accuracy
    return trial.success(accuracy=final_test_acc)


def main() -> None:
    # Training settings
    torch.device("cpu")

    # Download the dataset
    datasets.MNIST("../data", train=True, download=True)
    datasets.MNIST("../data", train=False, download=True)

    # TODO: The goal here will be to somehow setup a search space where
    # we can search over the this `20` number, lets say from `10` to `30`??
    # If you find this impossible to do, please write up how you'd like to express it instead
    # and we will go from there.
    pipeline = Sequential(
        nn.Flatten(start_dim=1),  # <- Will be a `Fixed` because it's an instantiated object
        Component(nn.Linear, config={"in_features": 724, "out_features": 20}, name="fc1"),
        nn.ReLU,  # <- Will be a `Component` because it's a class
        Component(nn.Linear, config={"in_features": 20, "out_features": 10}, name="fc2"),
        Component(nn.LogSoftmax, config={"dim": 1}),
        name="my-mlp-pipeline",
    )
    # NOTE: I don't particularly like that you have to wrap F.relu in a `Fixed`.
    # * Fixed - Something that is Fixed and doesn't need to be initialized
    # * Component - Something that needs to be initialized with a config
    #
    # The problem is that right now, if we detect a function, we assume it constructs
    # something to use in a pipeline, not that we should use the function directly.

    # FYI: The `Metric` class is so you don't have to worry about giving
    # the correct thing to the optimizer, the Metric class takes care of
    # normlizing and return a number the optimizer should optimize.
    # Some optimizers always minimize, some can allow you to choose, some
    # work better with normalized values, etc.
    metric = Metric("accuracy", minimize=False, bounds=(0, 1))
    optimizer = SMACOptimizer.create(
        space=pipeline,
        metrics=metric,
        seed=1,
        bucket="pytorch-experiments",
    )

    # We wont use the Scheduler here as it's not needed for making
    # this example work. We'll just use one trial for now.
    trial = optimizer.ask()
    report = eval_configuration(trial, pipeline)
    print(report)
	
	
if __name__ == "__main__":
    main()
```

@eddiebergman
Copy link
Contributor Author

eddiebergman commented Apr 16, 2024

The basic requirements of the previous features are mostly implemented aside from Join and Split which I will work on soon.

In the meantime, the next steps will be towards taking the ResNet models family from PyTorch and do the following:

  • Be able to fully define and parametrize a full ResNet3 such that the model fits in a single script, ideally a single object.
  • Perform Bayesian Optimization on this pipeline, training from scratch for a single epoch on a single node.
  • Perform Bayesian Optimization on a frozen ResNet18 (linked above) and only tune the hyperparameters of the last 1-2 layers, e.g. Conv + Linear or Linear + Linear.
  • Test if this works with the dask-jobqueue scheduler when linking to GPU nodes on a SLURM cluseter.
  • (?) Test if this works in a multi-node/parallelized model variant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new feature
Projects
None yet
Development

No branches or pull requests

2 participants