Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track original triple IDs in KGDataset.from_triples #37

Merged
merged 3 commits into from
Jan 8, 2024

Conversation

AlCatt91
Copy link
Collaborator

@AlCatt91 AlCatt91 commented Jan 3, 2024

When creating datasets with random train/valid/test splitting with KGDataset.from_triples and KGDataset.from_dataframe, for downstream tasks (e.g. comparing predictions with edge graph statistics) it can be useful to keep track of the shuffling and splitting of the original triple IDs in the (n_triple, 3) array/df.

@AlCatt91 AlCatt91 requested a review from danjust January 3, 2024 20:06
Copy link
Contributor

@danjust danjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is useful to have. Just for consistency I'd prefer having original_triple_ids in every case

n_entity=data[:, [0, 2]].max() + 1,
n_relation_type=data[:, 1].max() + 1,
entity_dict=entity_dict,
relation_dict=relation_dict,
type_offsets=type_offsets,
triples=triples,
)
ds.original_triple_ids = triple_ids # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more consistent for downstream analytics if original_triple_ids always was a member of the dataset. I'd add a dict with values arange(...) if the dataset is not created through from_triples

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, good point, thanks! I should have added this to every constructor

Copy link
Contributor

@danjust danjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Alberto! LGTM

@AlCatt91 AlCatt91 merged commit 4e794d1 into main Jan 8, 2024
1 check passed
@AlCatt91 AlCatt91 deleted the track_triple_id_split branch January 12, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants