Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post-SMTB cleanup #31

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Post-SMTB cleanup #31

wants to merge 24 commits into from

Conversation

ilsenatorov
Copy link
Collaborator

Most important changes:

  • Added unit testing with pytest
  • Added github action to run the tests
  • Cleaned up redundant code
  • Rearranged some files here and there
  • Improved config/training logic for finetuning

Copy link
Member

@Old-Shatterhand Old-Shatterhand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applies to all files: Comment the code. It will be a project with >5 people working on and maintaining it, we need to set some documentation standards to not get completely lost before christmas.

smtb/model.py Outdated Show resolved Hide resolved
.github/workflows/run_test.yaml Outdated Show resolved Hide resolved
smtb/tests/test_data.py Outdated Show resolved Hide resolved
smtb/tests/test_finetune.py Outdated Show resolved Hide resolved
@ilsenatorov
Copy link
Collaborator Author

Applies to all files: Comment the code. It will be a project with >5 people working on and maintaining it, we need to set some documentation standards to not get completely lost before christmas.

I would suggest comments/docstring on code that is not self-explanatory. For example:

def setup(self, stage: str | None = None):
        """Create train, val, test datasets."""
        self.train = DownstreamDataset(self.data_dir / "train", self.layer_num)
        self.valid = DownstreamDataset(self.data_dir / "valid", self.layer_num)
        self.test = DownstreamDataset(self.data_dir / "test", self.layer_num)

    def _get_dataloader(self, dataset: DownstreamDataset, shuffle: bool = False) -> torch.utils.data.DataLoader:
        """Create a DataLoader for a given dataset."""
        return DataLoader(
            dataset, batch_size=self.batch_size, num_workers=self.num_workers, shuffle=shuffle, collate_fn=collate_fn
        )

    def train_dataloader(self) -> DataLoader:
        return self._get_dataloader(self.train, shuffle=True)

The first 2 functions are not super obvious, so some comments are necessary.

The last one is pretty self-explanatory and requires no documentation (IMO).

But since we are in a phase of active development, I wouldn't enforce huge docstrings on every function. Type hints and good variable names should be enough for 95% of the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to add those classifications as models?

smtb/pooling.py Outdated
queries = self.linear_query(x)
attention_scores = torch.matmul(queries, keys.transpose(-2, -1))
attention_weights = self.softmax(attention_scores)
pooled_output = torch.matmul(attention_weights, x).sum(dim=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add normalization here?

@sdkaraban sdkaraban self-requested a review August 24, 2024 11:12
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we just testing for 3.11? Maybe add other versions like 3.10 or eeven 3.9.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to this comment. I'd test all Python versions that are actively developed, such as 3.12. But if we fix versions in the requirements, I don't see why we should test multiple versions. The installed packages will have the same versions across different tested Python versions.

requirements.txt Outdated
Comment on lines 1 to 13
torch
pytorch-lightning
lightning
torchmetrics
rich
fair-esm
jsonargparse
wandb
tokenizers
transformers
beartype
datasets
transformers[torch]
pytest
pytest-cov
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start specifying versions to avoid breaking changes and use dependabot to update andreview version changes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, sure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be easily accomplished by installing requirements and then running pip freeze

setup.py Outdated
url="https://github.com/kalininalab/SMTB2024",
packages=["smtb"],
install_requires=requirements,
python_requires=">=3.11",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again is 3.11 maybe too high, maybe 3.10?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think, 3.11 is too high. I have another project with similar dependencies that is run on 3.11 and it works fine, so, I wouldn't expect any problems. But we can downgrade to 3.10 if you want.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe >=3.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants