SkimLit

An NLP model to classify abstract sentences into the role they play.
This project is the implementation of the research paper: PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts.

Dataset: PubMed 200k RCT dataset

PubMed 200k RCT is a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with its role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.

PubMed 20k is a subset of PubMed 200k. I.e., any abstract present in PubMed 20k is also present in PubMed 200k.
PubMed_200k_RCT is the same as PubMed_200k_RCT_numbers_replaced_with_at_sign, except that in the latter, all numbers had been replaced by @. (same for PubMed_20k_RCT vs PubMed_20k_RCT_numbers_replaced_with_at_sign).

Models Used:

A total of 6 models were trained on the whole dataset.

A Multinomial Naive Bayes Classifier with 74.9% accuracy.
A Convolution Custom Token Embedding model gave 83.6% accuracy.
A Pretrained Token Embedding (Universal Sentence Encoder) model gave 77.6% accuracy.
A Convolution Custom Character Embedding model gave 72.7% accuracy.
A hybrid model consisting of both Token and Character Embedding gave 77.6% accuracy.
A tribrid model consisting of Token, Character, and Position Embedding gave 86.9% accuracy.

Skimlit Tribrid Model:

This particular model consists of three different types of embeddings, namely Token, Character, and Position Embeddins. For the position embedding, the abstract was disentangled into line numbers (Sequence of the sentence in the abstract) and total lines (Total sentences in the abstract).

Model Architecture Used-->

Result:

Skimlit_Tribrid_saved_model

How the prediction looks like-->

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.devcontainer		.devcontainer
Procfile.txt		Procfile.txt
README.md		README.md
Skimlit.ipynb		Skimlit.ipynb
app.py		app.py
requirements.txt		requirements.txt
setup.sh		setup.sh
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkimLit

Dataset: PubMed 200k RCT dataset

Models Used:

Skimlit Tribrid Model:

Result:

About

Releases

Packages

Languages

garvit088/SkimLit

Folders and files

Latest commit

History

Repository files navigation

SkimLit

Dataset: PubMed 200k RCT dataset

Models Used:

Skimlit Tribrid Model:

Result:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages