This repository contains student project materials, including project report, data, code, and references to literature for this departmentally-sponsored project. As you complete each of the below sections in this document, please be sure to remove the preamble text so that it does not appear in your work. Please work with your first reader to answer any questions or concerns that you may have.
The purpose of this project aims to elevate the effectiveness of using machine learning for text analysis, specifically in the development of a fact-checking system. Its primary objective is to optimize the utilization of text analysis tools, catering to users such as students, researchers, or anyone interested on validating the accuracy of factual information. The project focuses on streamlining the processing of a corpus containing PubMed articles, extracting crucial textual content, and facilitating users in conducting fact-checks through an interface. In an era characterized by information overload, the project becomes essential in ensuring the credibility of facts. Through the integration of scholarly articles into the fact-checking domain, the Corroboration Dashboard strives to offer a more inclusive, automated, and dependable approach to information verification, thereby contributing significantly to the progress of research and knowledge validation. The primary research question revolves around how effectively the system can corroborate user-provided facts with information present in the scholarly articles. Overall, this project attempts to help the need for an automated corroboration mechanism in the domain of scholarly research.
The Corroboration Dashboard project presents a methodology aimed at enhancing the fact-checking process by incorporating scholarly articles from PubMed. This approach acknowledges the inadequacies of traditional fact-checking methods, particularly in navigating the dynamic landscape of information sources, and proposes the integration of scholarly literature to better the process. The Corroboration Dashboard project introduces a way in fact-checking that addresses the challenges posed by the abundance of information in society.
The scholarly foundation of the Corroboration Dashboard project resonates with the insights of leading scholars in related fields. Alpaydin (2021), for instance, delves into the intricacies of machine learning and underscores the ethical imperatives associated with its application, a perspective that informs the ethical considerations inherent in the Corroboration Dashboard project's design and implementation. Moreover, the project builds upon prior research, such as Bengfort et al.'s (2018) exploration of applied text analysis using Python and Thorne & Vlachos's (2018) investigation into automated fact-checking. The integration of PubMed articles into the fact-checking process signifies a change in how students utilize the internet and verify scholarly sources, and how the media potrays its infromation. This can indicate a departure from the usual methodologies reliant on mainstream media sources. By incorporating content this is typically scrutinized through the peer-review process, this project introduces a level of credibility and trustworthiness that is often lacking in the realm of online information.
Overall, this project represents a significant advancement in the domain of fact-checking, offering a systematic and approach to verifying information in an era of data and evolving sources. Drawing on insights from scholarly discourse, this project changes the intersection of technology, ethics, and information integrity.
Explanation of Code
The project started with examining PubMed articles in order to gain insights on the type of information within these datasets. Examining contents such as the size and structure was crucial in determining the logistical aspects of running this project. The first essential steps in starting was downloading the data from the PubMed website. After ensuring one file was able to process, the program begins by parsing and reading through the data within the XML files.
def extract_data_from_xml(xml_path):
"""Extract and process data from XML Files."""
with open(xml_path, 'r', encoding='utf-8') as file:
soup = BeautifulSoup(file, 'xml')
title_tag = soup.find('article-title')
title = title_tag.text.strip() if title_tag else ""
abstract_tag = soup.find('abstract')
abstract = abstract_tag.text.strip() if abstract_tag else ""
text = title + " " + abstract
return text
The provided Python code snippet illustrates a function named extract_data_from_xml
, which is designed to extract and process data from XML files. This function utilizes the BeautifulSoup library to navigate and extract information such as the article title and abstract from the XML files. The extracted data is then consolidated into a text format.Moving forward, the code includes functionalities for processing the entire corpus of XML files. The process_chunk
function facilitates the batch processing of a specified chunk of XML files, inserting relevant data into a SQLite database named "corroboration_db.sqlite."
def process_chunk(conn, cursor, chunk):
with conn:
for xml_path in chunk:
text = extract_data_from_xml(xml_path)
cursor.execute('INSERT INTO articles (title, text) VALUES (?, ?)', (os.path.basename(xml_path), text))
conn.commit()
The process_corpus
function orchestrates the overall processing of XML files, systematically walking through the corpus directory, assembling chunks of XML paths, and persistently storing data into the SQLite database.
Additionally, the code encompasses a fact-checking mechanism. The fact_check
function accepts a user-provided query, retrieves data from the articles stored in the database, and employs the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization technique to determine similarity scores. Articles with similarity scores exceeding a specified number are considered as potential matches, and the results are presented to the user through the Streamlit application. The addition of the show_contents checkbox to show whether to display the full text of the articles is a useful feature. This allows users to decide whether they want to see detailed contents along with the fact-checking results.
The main application, defined in the main
function, operates within the Streamlit framework. It enables users to upload a PubMed corpus in tar.gz format, process the corpus, and perform fact-checking by entering a query. The application provides informative messages and results, creating a comprehensive Corroboration Dashboard for user interaction and insights.
Testing
To test the validity of this project, I took the approach to manually test each input data of different facts. For example, I tested a scenario in which a user would type in gibberish to ensure that the program would recognize that it is an input that is not within the data file. I also tested what the programs output would look like if the user typed in nothing within the text box. If the output of the program is No matching article found to support the fact
than it was ran correctly. The output of both test is demonstrated below:
Installation
-
To run the provided Python code for the project, the user would need to install several external libraries:
-
Streamlit: For creating web applications with minimal effort.
pip install streamlit
-
SQLite3: For interacting with SQLite databases.
# SQLite3 is typically included with Python installations, so a separate installation may not be necessary
-
BeautifulSoup: For parsing HTML and XML files.
pip install beautifulsoup4
-
scikit-learn: For machine learning functionalities such as TF-IDF vectorization and cosine similarity calculation.
pip install scikit-learn
-
tarfile: For handling tar.gz file extraction.
# tarfile is a standard library module and usually does not require a separate installation
-
nltk (Natural Language Toolkit): May be required for additional text processing tasks.
pip install nltk
How to run the Program
-
Download a PubMed File onto laptop (make sure you have a significant amount of space when downloading large data files)
-
Run the command
streamlit run main.py
into your terminal to run the program. It should take your screen onto your local browser and open a tab to which you'll see the program. It should appear as the following:
- Browse through your file onto the corpus and click on the appropiate data. Afterwards click on
Process Corpus
- After the file has been successfully processed, enter a fact or a key word in which you want to explore and 🎉 TAAADAAAA 🎉 your articles should be given to you!
The outcomes of this project showcase a significant advancement in the application of machine learning for text analysis, particularly in the domain of fact-checking systems. The theoretical foundation of the project lies in the integration of scholarly articles from the PubMed corpus, enabling a comprehensive analysis of textual content. Through the implementation of TF-IDF vectorization and cosine similarity, the fact-checking system successfully identifies relevant articles that support or corroborate a given fact. An enhanced artifact, the Corroboration Dashboard, demonstrates its capabilities through a user-friendly interface, allowing users such as students, researchers, and anyone interested in fact verification to easily interact with the system.
Evaluation metrics for the artifact involve measuring the system's accuracy in identifying supporting articles and providing relevant information to users. Additionally, the scalability and efficiency of the system are crucial metrics, considering the vast amount of data present in the PubMed corpus. The Corroboration Dashboard addresses these metrics by incorporating a streamlined processing pipeline and a similarity threshold that users can adjust based on their preference for precision. In the future, I would like to enhance this project by providing a mechanism in which the program can process the data faster. It is not efficient especially with large data files.
-
Alpaydin, E. (2021). Machine learning. Mit Press.
-
Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied text analysis with Python: Enabling language-aware data products with machine learning. " O'Reilly Media, Inc.".
-
Ke, S., Olea, J. L. M., & Nesbit, J. (2019). A robust machine learning algorithm for text analysis. working paper.
-
Thorne, J., & Vlachos, A. (2018). Automated fact checking: Task formulations, methods and future directions. arXiv preprint arXiv:1806.07687.
-
Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255-260.
-
Vo, N., & Lee, K. (2019, July). Learning from fact-checkers: Analysis and generation of fact-checking language. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 335-344).
The department's project descriptions can be found at https://github.com/ReadyResearchers-2023-24/cmpsc-580-exemplar-projects
The schedule for this work can be found at https://github.com/CMPSC-580-Allegheny-College-Spring-2024/classDocs?tab=readme-ov-file#schedule