Skip to content

purrlab/MICCAI-paper-analysis

 
 

Repository files navigation

MICCAI-review-thesis

This repository contains the data and code for processing MICCAI papers from 2012 and 2021, creating an initial database of basic meta data, expanding with categorising articles as classification and other, mining the articles references and processing the data, generating graphs and qualitative outputs for further processing based on data gathered through manual annotation.

This is used in my master thesis, "Exploring the landscape of MICCAI papers" from the IT-University of Copenhagen (2022)

Data

The data used is partially mined from the Springer database hosting the proceedings, the proceedings articles themselves and data I gathered through manual annotation of the articles.

Naming conventions

The folders in this repo are numbered in accordance with the work flow used in my thesis. Notebooks within folders have a similar structure, so 04-analysis is the final folder, containing 4 notebooks to be worked through sequentially: 04.01_name, 04.02_name, 04.03_name and 04.04_name

01 Combining proceedings txt

This folder contains one jupyter notebook, 11 proceedings parts and the full text that is the outcome of the notebook.

  • 01.01_combining-proceedings-parts: notebook that takes the individual txt files from proceedings parts and joins it together. The full text version is used for mining the references and creating categories.

02 Mining initial data

This folder contains one jupyter notebook, the individual html files as documents (49 in total) and 2 csv files with the initially mined data for each year.

  • 02.01_creating-inital-database: notebook that uses the html files from the Springer database (hosting the proceedings) to mine basic meta data. Each csv file has the following columns:
    • id: numbered list from 0 to number of articles total
    • title: title of the article
    • authors: names of the authors
    • page numbers: where the articles is in the entire proceedings
    • DOI: article's doi
    • year of publication: year the article is from (2012 or 2021)
    • part of publication: number of proceedings part the article appears in

03 Database creation and references

This folder contains 2 jupyter notebooks, 4 txt files with rules and keywords, the full text versions of the proceedings articles from 01, the initial database csv files from 02, the 2 csv files that are generated by one of the notebooks and a 2 graphs generated by the other notebook

  • 03.01_defining-categories: notebook that adds categories to articles based on the titles and abstracts mined from the full text versions from 01. Categories are based on keywords and rules and a threshold for information found. Generates two csv files with the following columns:

    • id: numbered list from 0 to number of articles total
    • title: title of the article
    • authors: names of the authors
    • page numbers: where the articles is in the entire proceedings
    • DOI: article's doi
    • year of publication: year the article is from (2012 or 2021)
    • part of publication: number of proceedings part the article appears in
    • category: type of article - classification, other or unknown
  • 03.02_references-mining-and-analysis: notebook that mines the references from the full text versions from 01 and analyses the findings generating two graphs. This work borrows from the ACL repo: https://github.com/coastalcph/acl-citations Comments have been added where I have altered the original code, with thanks given for the work done!

04 Analysis

This folder contains 4 notebooks, 5 csv files, and 8 graphs. The csv files contain the following data:

  • annotations_data

    • Timestamp : time of survey being answered in Google Forms
    • What is the article's index? : id of article
    • Which year is the article from? : year the article is from (2012 or 2021)
    • Is the article accurately labelled as classification? : yes or no
    • If not accurately labelled as classification, what would you label it as? : segmentation, other medical imaging task, I don't know or it was accurately labelled
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • What is the aim or task of the article? (input quote) : text based answer
    • How does the article justify this aim or task? (input quote) : text based answer
    • Which method is used for classification? : svm, graph analysis, supervised learning, unsupervised learning, transfer learning, neural network, (other) text based answer
    • Which performance measures are used? : auc, specificity, accuracy, precision, recall, f1 score, sensitivity, (other) text based answer
    • Does the article use segmentation as preprocessing? : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Does the dataset used in the article have a title? : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • What is the size of the used dataset? (input quote) : text based answer
    • What type is the dataset? : public or private
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Draft_Is the survey/method of how the dataset was obtained accessible? : yes or no
    • Draft_Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Does the article mention the demographics of the patients/images included in the used dataset? : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Does the article mention the intent for collecting the dataset? The intended task for the dataset? : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Does the article disclose any affiliations? : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Does the article include anything about respect for persons (informed consent, voluntary participation) participating in the dataset? : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Does the article have any mention of benefience, minimising risk/maximising benefit of work? : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Does the article have any mention of justice (equal treatment, fair selection of subjects)? : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Does the article mention any respect for law/public interest (transparency in methods/results, accountability for actions) : yes or no
    • Please input the quote from which you infer the answer to the previous question (if possible) : text based answer
    • Are there any other comments/interesting aspects? : text based answer
  • affiliations_data

    • counter 2012: number of articles from 2012 with this affiliation
    • counter 2021: number of articles from 2021 with this affiliation
    • type: type of affiliation - governement (gov), university (uni), hospital (hosp) or corporation (corp)
    • country: country of origin for the institution
    • name: name of institution
    • list of 2012 ids: list of the id's from 2012 articles affiliated with this institution
    • list of 2021 ids: list of the id's from 2021 articles affiliated with this institution
  • disease_data

    • year: year the article is from (2012 or 2021)
    • disease: name of disease researched in article (if mentioned)
    • body part: name of body part researched in article
    • Zhou category: higher level categorisation of body parts based on review article from Zhou et. al
    • task: name of task researched in article (if mentioned)
    • id: id of article
  • justification_data

    • year: year the article is from (2012 or 2021)
    • type: justification type - sci, nov, dis, hc
    • id: id of article
  • elite_universities - original dataset from QS World Ranking, here only a subset is used

    • rank: numbered ranking, 1-50
    • name: name of university
    • country: university's country
    • score: used to calculate rank, max 100

The four notebooks have the following contents:

  • 04.01_non-textbased-answers: notebook that takes the data from the non-text based questions in the annotations_data and creates graphs
  • 04.02_more-data-gathering: notebook that looks at the text based answers in my annotations data, used to collect the data contained in the 3 csv files: affiliations_data, disease_data and justification_data
  • 04.03_additional-data: notebook that uses affiliations_data, elite_universities, disease_data and justification_data and looks for insights in the data
  • 04.04_square-1: notebook that generates 4d scatter plots using annotations_data, disease_data and justification_data

00 Archive

contains remnants of code cut from existing files and old versions of collected data (nothing used in the final thesis, but nice to have just in case)

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%