-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track original triple IDs in KGDataset.from_triples #37
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is useful to have. Just for consistency I'd prefer having original_triple_ids
in every case
besskge/dataset.py
Outdated
n_entity=data[:, [0, 2]].max() + 1, | ||
n_relation_type=data[:, 1].max() + 1, | ||
entity_dict=entity_dict, | ||
relation_dict=relation_dict, | ||
type_offsets=type_offsets, | ||
triples=triples, | ||
) | ||
ds.original_triple_ids = triple_ids # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more consistent for downstream analytics if original_triple_ids
always was a member of the dataset. I'd add a dict with values arange(...)
if the dataset is not created through from_triples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, good point, thanks! I should have added this to every constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Alberto! LGTM
When creating datasets with random train/valid/test splitting with
KGDataset.from_triples
andKGDataset.from_dataframe
, for downstream tasks (e.g. comparing predictions with edge graph statistics) it can be useful to keep track of the shuffling and splitting of the original triple IDs in the(n_triple, 3)
array/df.