Merge pull request #14 from vtraag/feature/cit_cluster

Added cluster.py and brief explanation.
CWTSLeiden · Apr 17, 2020 · e48793c · e48793c
2 parents 57a9542 + d7bb0ec
commit e48793c
Show file tree

Hide file tree

Showing 2 changed files with 67 additions and 1 deletion.
diff --git a/README.md b/README.md
@@ -57,7 +57,7 @@ Finally, there are three notebooks to help replicate at least part of the analys
 * [Notebook_CORD-19_2_text_analysis](Notebook_CORD-19_2_text_analysis.ipynb) contains the topic modelling analysis, including its use to qualify citation network clusters.
 * [Notebook_CORD-19_3_network_analysis](Notebook_CORD-19_3_network_analysis.ipynb) contains an alternative way to perform a citation network analysis, focused on the bibliographic coupling network of CORD-19 papers. Results of this analysis are comparable to what is reported in the paper.
 
-The two citation network clustering solutions discussed in the paper, using both CORD-19 and external references, is also provided as a [separate file](datasets_input/paper_CORD19_supporting_materials/clustering_04042020.csv).
+The two citation network clustering solutions discussed in the paper, using both CORD-19 and external references, is also provided as a [separate file](datasets_input/paper_CORD19_supporting_materials/clustering_04042020.csv). These results are generated using [cluster.py](cluster.py). This may require installation of the development version of [`python-igraph`](https://github.com/igraph/python-igraph), until the upcoming release (0.8.1) is out. We therefore also include the actual clustering results themselves.
 
 Some steps in the analyses are not included here since they require proprietary data. They can be replicated by getting access to the data (see above) and following the steps detailed in the paper. 
 

diff --git a/cluster.py b/cluster.py
@@ -0,0 +1,66 @@
+import pandas as pd
+import igraph as ig
+import numpy as np
+#%%
+# Read files
+nodes_df = pd.read_csv('../data/citation_nodes-0.txt', sep='\t', 
+                       dtype={'abstract': 'str'}, low_memory=False)
+edges_df = pd.read_csv('../data/citation_edges-0.txt', sep='\t')
+
+#%%
+# Create graph
+G = ig.Graph.DictList(
+          vertices=nodes_df.to_dict('records'),
+          edges=edges_df.to_dict('records'),
+          directed=True,
+          vertex_name_attr='id',
+          edge_foreign_keys=('citing_pub_id', 'cited_pub_id'));
+del G.es['citing_pub_id']
+del G.es['cited_pub_id']
+
+#%%
+# Get weakly connected component
+H = G.components(mode='weak').giant()
+
+degree = np.array(H.degree(mode='out'))
+H.es['weight'] = [1.0/degree[e.source] for e in H.es]
+H.to_undirected(combine_edges='sum')
+
+#%%
+# Cluster publications
+import random
+random.seed(0)
+ig.set_random_number_generator(random)
+
+res_params = [2e-5, 1e-5]
+cluster_solutions = [None]*len(res_params)
+graph = H
+for idx, res in enumerate(res_params):
+  cluster_solutions[idx] = graph.community_leiden(resolution_parameter=res, n_iterations=10,
+                                weights='weight', 
+                                node_weights='weight')
+  graph = cluster_solutions[idx].cluster_graph(combine_vertices={'weight': 'sum'},
+                                                 combine_edges={'weight': 'sum'})
+
+#%%
+# Make dataframe with clustering solution
+pubs_df = nodes_df.set_index('id')
+membership = np.arange(H.vcount())
+for idx, clusters in enumerate(cluster_solutions):
+  tmp_membership = np.array(clusters.membership)
+  membership = np.array([tmp_membership[c] for c in membership])
+  pubs_df.loc[H.vs['id'],'clusters_{}'.format(idx)] = membership
+
+pubs_df = pubs_df.loc[H.vs['id'],:]
+pubs_df = pubs_df[pubs_df['weight'] == 1]
+
+for idx in range(len(cluster_solutions)):
+  col = 'clusters_{}'.format(idx)
+  pubs_df[col] = pubs_df[col].astype('int')
+
+#%%
+# Write results
+pubs_df.to_csv('cluster_solutions_pubs.txt', index=True, sep='\t')
+
+cols = ['clusters_{}'.format(i) for i in range(len(cluster_solutions))]
+pubs_df[cols].to_csv('cluster_solutions.txt', index=True, sep='\t')