Skip to content

Commit

Permalink
Merge pull request #14 from vtraag/feature/cit_cluster
Browse files Browse the repository at this point in the history
Added cluster.py and brief explanation.
  • Loading branch information
Giovanni1085 authored Apr 17, 2020
2 parents 57a9542 + d7bb0ec commit e48793c
Show file tree
Hide file tree
Showing 2 changed files with 67 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Finally, there are three notebooks to help replicate at least part of the analys
* [Notebook_CORD-19_2_text_analysis](Notebook_CORD-19_2_text_analysis.ipynb) contains the topic modelling analysis, including its use to qualify citation network clusters.
* [Notebook_CORD-19_3_network_analysis](Notebook_CORD-19_3_network_analysis.ipynb) contains an alternative way to perform a citation network analysis, focused on the bibliographic coupling network of CORD-19 papers. Results of this analysis are comparable to what is reported in the paper.

The two citation network clustering solutions discussed in the paper, using both CORD-19 and external references, is also provided as a [separate file](datasets_input/paper_CORD19_supporting_materials/clustering_04042020.csv).
The two citation network clustering solutions discussed in the paper, using both CORD-19 and external references, is also provided as a [separate file](datasets_input/paper_CORD19_supporting_materials/clustering_04042020.csv). These results are generated using [cluster.py](cluster.py). This may require installation of the development version of [`python-igraph`](https://github.com/igraph/python-igraph), until the upcoming release (0.8.1) is out. We therefore also include the actual clustering results themselves.

Some steps in the analyses are not included here since they require proprietary data. They can be replicated by getting access to the data (see above) and following the steps detailed in the paper.

Expand Down
66 changes: 66 additions & 0 deletions cluster.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
import pandas as pd
import igraph as ig
import numpy as np
#%%
# Read files
nodes_df = pd.read_csv('../data/citation_nodes-0.txt', sep='\t',
dtype={'abstract': 'str'}, low_memory=False)
edges_df = pd.read_csv('../data/citation_edges-0.txt', sep='\t')

#%%
# Create graph
G = ig.Graph.DictList(
vertices=nodes_df.to_dict('records'),
edges=edges_df.to_dict('records'),
directed=True,
vertex_name_attr='id',
edge_foreign_keys=('citing_pub_id', 'cited_pub_id'));
del G.es['citing_pub_id']
del G.es['cited_pub_id']

#%%
# Get weakly connected component
H = G.components(mode='weak').giant()

degree = np.array(H.degree(mode='out'))
H.es['weight'] = [1.0/degree[e.source] for e in H.es]
H.to_undirected(combine_edges='sum')

#%%
# Cluster publications
import random
random.seed(0)
ig.set_random_number_generator(random)

res_params = [2e-5, 1e-5]
cluster_solutions = [None]*len(res_params)
graph = H
for idx, res in enumerate(res_params):
cluster_solutions[idx] = graph.community_leiden(resolution_parameter=res, n_iterations=10,
weights='weight',
node_weights='weight')
graph = cluster_solutions[idx].cluster_graph(combine_vertices={'weight': 'sum'},
combine_edges={'weight': 'sum'})

#%%
# Make dataframe with clustering solution
pubs_df = nodes_df.set_index('id')
membership = np.arange(H.vcount())
for idx, clusters in enumerate(cluster_solutions):
tmp_membership = np.array(clusters.membership)
membership = np.array([tmp_membership[c] for c in membership])
pubs_df.loc[H.vs['id'],'clusters_{}'.format(idx)] = membership

pubs_df = pubs_df.loc[H.vs['id'],:]
pubs_df = pubs_df[pubs_df['weight'] == 1]

for idx in range(len(cluster_solutions)):
col = 'clusters_{}'.format(idx)
pubs_df[col] = pubs_df[col].astype('int')

#%%
# Write results
pubs_df.to_csv('cluster_solutions_pubs.txt', index=True, sep='\t')

cols = ['clusters_{}'.format(i) for i in range(len(cluster_solutions))]
pubs_df[cols].to_csv('cluster_solutions.txt', index=True, sep='\t')

0 comments on commit e48793c

Please sign in to comment.