Skip to content

Commit

Permalink
fix: Remove claims that are only connected to deleted tweets
Browse files Browse the repository at this point in the history
  • Loading branch information
saattrupdan committed Mar 21, 2022
1 parent 156d9ad commit bc3a801
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 2 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,14 @@ and this project adheres to
[Semantic Versioning](http://semver.org/spec/v2.0.0.html).


## [Unreleased]
### Fixed
- Now removes claims that are only connected to deleted tweets when calling
`to_dgl`. This previously caused a bug that was due to a mismatch between
nodes in the dataset (which includes deleted ones) and nodes in the DGL graph
(which does not contain the deleted ones).


## [v1.6.1] - 2022-03-17
### Fixed
- Now correctly catches JSONDecodeError during rehydration.
Expand Down
12 changes: 10 additions & 2 deletions mumin/dgl.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@ def build_dgl_dataset(nodes: Dict[str, pd.DataFrame],
'`dgl` extension, like so: `pip install '
'mumin[dgl]`')

# Remove the claims that are only connected to deleted tweets
tweet_df = nodes['tweet'].dropna()
claim_df = nodes['claim']
discusses_df = relations[('tweet', 'discusses', 'claim')]
discusses_df = discusses_df[discusses_df.src.isin(tweet_df.index.tolist())]
claim_df = claim_df[claim_df.index.isin(discusses_df.tgt.tolist())]
nodes['claim'] = claim_df

# Set up the graph as a DGL graph
graph_data = dict()
for canonical_etype, rel_arr in relations.items():
Expand All @@ -66,8 +74,8 @@ def build_dgl_dataset(nodes: Dict[str, pd.DataFrame],
# Get a dataframe containing the edges between allowed source and
# target nodes (i.e., non-deleted)
rel_arr = (relations[canonical_etype][['src', 'tgt']]
.query('src in @allowed_src.values() and '
'tgt in @allowed_tgt.values()')
.query('src in @allowed_src.keys() and '
'tgt in @allowed_tgt.keys()')
.drop_duplicates())

# Convert the node indices in the edge dataframe to the new indices
Expand Down

0 comments on commit bc3a801

Please sign in to comment.