GitHub - centre-for-health-informatics/manual_label_enhanced_LDA

Using existing manual labels to enhance LDA-based topic modelling

Code for the paper "Using existing manual labels to enhance LDA-based topic modelling of patient concerns" Unfortunately, due to patient privacy, we cannot disclose data, so my experiments cannot be repeated, but I hope my method can be inspiring, and some functions can be reused.

Dependencies

Python 3.5.2
sklearn 0.21.3
wordcloud 1.5.0
spacy 2.1.4

Versions listed above are what I used

The _PredictScorer function in sklearn.metric._score was modified to adapt model selection of LDA. In line 94 originally it was: y_pred = estimator.predict(X) but LDA does not have predict method but transform method, thus I changed it to:

try:
    y_pred = estimator.predict(X)
except:
    y_pred = estimator.transform(X)

Methode overview

Data pre-processing

To prepare the text for analysis, we conducted several standard preprocessing steps, using the spaCy and sklearn python package. We converted all of the text to lowercase, removed the standard English stopwords defined by sklearn, plus an additional set of corpus-specific stopwords (see supplementary information), and performed lemmatization. Code is in pre-processing jupyter notebook

Training and selecting LDA models

Train a supervised model to generate dummy documents.

We randomly split the documents in a training set of 80% and a test set of 20%, then created an idealized single-topic dummy document for each class by concatenating the 100 words with the largest weight for that class.
Train and "validate" LDA model

We run a grid search on LDA model to find the model give least custom loss Code is in clean_LDA_code jupyter notebook

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
clean_LDA_code.ipynb		clean_LDA_code.ipynb
dummy_doc_1015_tfidf_exclusive		dummy_doc_1015_tfidf_exclusive
dummy_doc_1017_tfidf_exclusive		dummy_doc_1017_tfidf_exclusive
final_lda		final_lda
index_mapping_list		index_mapping_list
pipeline_helper.py		pipeline_helper.py
plot_figures.ipynb		plot_figures.ipynb
pre-processing.ipynb		pre-processing.ipynb
readme.md		readme.md
tools.py		tools.py
whole_pipeline.png		whole_pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using existing manual labels to enhance LDA-based topic modelling

Dependencies

Methode overview

Data pre-processing

Training and selecting LDA models

About

Releases

Packages

Languages

License

centre-for-health-informatics/manual_label_enhanced_LDA

Folders and files

Latest commit

History

Repository files navigation

Using existing manual labels to enhance LDA-based topic modelling

Dependencies

Methode overview

Data pre-processing

Training and selecting LDA models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages