Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KGDataset.from_dataframe + custom data notebook #25

Merged
merged 17 commits into from
Sep 28, 2023
Merged

Conversation

AlCatt91
Copy link
Collaborator

Added an ease-of-use method to build a KGDataset taking in directly labelled triples (organized in a pandas dataframe), performing the label -> ID conversion behind the curtain and re-ordering, if needed, to cluster entities of the same type. It is meant to be even higher-level than the from_triples method.
Using this, I also wrote up a notebook, as we were planning to do, to show how to build a custom dataset, using as example the OpenBioLink dataset (there are the usual concerns about licensing of data sources, but I hope that, if we use it just for this small demo notebook, it shouldn't be a problem?) Any feedback is welcome!

@AlCatt91 AlCatt91 requested a review from danjust August 11, 2023 10:56
@AlCatt91 AlCatt91 force-pushed the custom_data_notebook branch from a87ef12 to e1b2801 Compare September 26, 2023 13:31
@AlCatt91
Copy link
Collaborator Author

AlCatt91 commented Sep 26, 2023

I added to this PR with a few updates:

  • Added GC disclaimer for use of datasets (see README and NOTICE);
  • Added list of available datasets in README;
  • Updated CONTRIBUTING with details on how to work on a fork (preferred way to submit PRs);
  • Some updates to dataset.py, making use of the from_dataframe builder to cut down on code;
  • Added dataloader for OpenBioLink.

The updates to the notebook are from black formatting (can disregard).

Copy link
Contributor

@danjust danjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - see comment regarding OpenBioLink notebook


This repository provides dataloaders for third party datasets. The use of these datasets is at own risk and Graphcore offers no warranties of any kind. It is the user's responsibility to comply with all license requirements for datasets downloaded with dataloaders in this repository.

The tutorial notebooks make use of the following datasets:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the OpenBioLink nb we should include that here

@AlCatt91
Copy link
Collaborator Author

Thanks! I changed the dataset in the new notebook to biokg, please have a final look when you can :)

Copy link
Contributor

@danjust danjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - just spotted one typo

"We download the directed, high-quality version of [OpenBioLink2020](https://github.com/openbiolink/openbiolink#benchmark-dataset) directly from the link provided by the authors. This shouldn't take more than a minute.\n",
"\n",
"Notice that OpenBioLink2020 integrates data from other sources, whose licensing terms are detailed in [this table](https://openbiolink.readthedocs.io/en/latest/sources.html) and should be minded when utilizing or redistributing the dataset files."
"We download the OGBL-BioKG Knoweldge Graph using the `ogb` package (see [here](https://ogb.stanford.edu/docs/linkprop/#data-loader) for details on how to use it). This shouldn't take more than a couple of minutes."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in Knowledge Graph

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed!

@AlCatt91 AlCatt91 merged commit 443474c into main Sep 28, 2023
1 check passed
@AlCatt91 AlCatt91 deleted the custom_data_notebook branch September 28, 2023 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants