Skip to content

Latest commit

 

History

History

topicModel

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Topic Modelling

Building on the concept of Latent Dirichlet Allocation (LDA) by Blei et al [1], we infer the topics (mixture of words that are indicate of each group) from each publications we saw (in publications.csv).

Additional python packages

Install the dependenceies as laid out in the project's README.md.

Workflow:

  1. First, Preprocess the publications scrapped into bag-of-words (BOW) representation.

  2. Next, using gensim to create the LDA models for the corpus derieved from all the publications. The topic models are then use in two fold:

    • To infer the topic each publication belongs to. This would be a mixture of probabilities for each topic. The most salient topic the publication belongs to will be used.
    • To infer the topic each author belongs to by using all the publications contributed by the author.
  3. Using the "LDA-space", we compare the topic network with the ground-truth communities (as saw in infnet-analysis). Due to the high dimensionality of LDA space, diemsionality reduction technique is used for visualisation

Preprocessing publications

Following standard NLP techniques to pre-process text: tokenize -> stopping -> stemming, in preprocess_pubs.ipynb we concatenate the publications,title, and abstract as our text data for each publication. It is then tokenize (split by whitespace) before stopwords are removed (using a list of common English stopwords). Finally, each term is stemmed using pystemmer.

The output from this section is pub_toks.pkl - a dataframe of all tokens for each publication.

Topic modelling

In topicmodelling.ipynb, we will see two versions of topic models being created. 1. Topic model on the entire collection of publications 2. Topic model using only last 6 years of publications

We can also explore the topics (mixture of terms) generated by each topic model. To visualise the topic models, we need to generate the visualisation (i.e. run the code in the block) everytime we start the notebook. Due to the randomisation, reproducing the same visualisation is impossible. (This is now possible).

Topic network

A topic network is generated by imposing each publication in the LDA space. This is done for individuals as well, using all the publications he/she generated. For the former, refer to topicDist_pub.ipynb, and refer to topicDist_poinf.ipync for the latter.


Todo

Things to try out

-[X] Apply on the entire publication instead of just the abstract -[X] Using lemming instead of stemming? Ref -[X] Calculate thes semantic coherence of topics generated by the LDA models Ref Repo

Different visualisation libraries

  1. bokeh
  2. beakerx

References

[1]: D. M. Blei and J. D. Lafferty, “A correlated topic model of Science,” The Annals of Applied Statistics, vol. 1, no. 1, pp. 17–35, Jun. 2007.