Added screenshots and more descriptions to readme

CMPSC-580-Allegheny-College-Spring-2024 · Feb 29, 2024 · 676a9ed · 676a9ed
1 parent 09885a7
commit 676a9ed
Show file tree

Hide file tree

Showing 9 changed files with 40 additions and 11 deletions.
diff --git a/README.md b/README.md
@@ -11,19 +11,25 @@ This repository contains student project materials, including project report, da
 
 ## Major: Computer Science (DS)
 
-## Project Name: Thats a Fact, No Cap
+ 🧢 🆃🅷🅰🆃🆂 🅰 🅵🅰🅲🆃 🅽🅾 🅲🅰🅿 🧢
 
 ---
 
 ## Overview
 
 The purpose of this project aims to elevate the effectiveness of using machine learning for text analysis, specifically in the development of a fact-checking system. Its primary objective is to optimize the utilization of text analysis tools, catering to users such as students, researchers, or anyone interested on validating the accuracy of factual information. The project focuses on streamlining the processing of a corpus containing PubMed articles, extracting crucial textual content, and facilitating users in conducting fact-checks through an interface. In an era characterized by information overload, the project becomes essential in ensuring the credibility of facts. Through the integration of scholarly articles into the fact-checking domain, the Corroboration Dashboard strives to offer a more inclusive, automated, and dependable approach to information verification, thereby contributing significantly to the progress of research and knowledge validation. The primary research question revolves around how effectively the system can corroborate user-provided facts with information present in the scholarly articles. Overall, this project attempts to help the need for an automated corroboration mechanism in the domain of scholarly research.
 
-## Literature Review
+## 📝 Literature Review 📝
 
-TODO: Conduct literature review by describing relevant work related to the project and hence providing an overview of the state of the art in the area of the project. This section serves to contextualize the study within the existing body of literature, presenting a thorough review of relevant prior research and scholarly contributions. In clear and meaningful language, this section aims to demonstrate the problems, gaps, controversies, or unanswered questions that are associated with the current understanding of the topic. In addition, this section serves to highlight the current study's unique contribution to the field. By summarizing and critiquing existing works, this section provides a foundation for readers to appreciate the novelty and significance of the study in relation to the broader academic discourse. The "Literature Review" section further contributes to the `why is the project important?` question. The number of scholarly work included in the literature review may vary depending on the project.
+The Corroboration Dashboard project represents a pivotal advancement in the domain of information verification and fact-checking, responding to the exigencies identified in the existing body of literature related to text analysis, machine learning, and knowledge validation. A thorough literature review reveals that contemporary challenges associated with misinformation and the credibility of textual content have spurred diverse research endeavors across multiple domains, including journalism, information retrieval, and natural language processing.
 
-## Methods
+Prior research has explored the application of machine learning algorithms to assess the credibility and authenticity of textual content. While these studies have laid a foundation for automated fact-checking, they often grapple with challenges such as scalability, precision, and the ability to handle diverse datasets. The Corroboration Dashboard project builds upon this corpus of knowledge by introducing a novel methodology grounded in the systematic processing of scholarly articles sourced from PubMed.
+
+The literature also highlights a growing concern regarding the limitations of existing fact-checking methodologies in the face of evolving information sources. As the landscape diversifies, traditional fact-checking practices confront difficulties in maintaining relevance and accuracy. The Corroboration Dashboard project addresses and extends these challenges by integrating scholarly articles into the fact-checking landscape, offering a distinctive contribution that leverages the rigor of academic discourse to enhance the verification process.
+
+Several scholarly works have emphasized the pressing need for innovative fact-checking solutions in contemporary society. The Corroboration Dashboard project emerges as a promising initiative, not merely aligning with prior research insights but introducing a novel paradigm for information validation. By harnessing the depth and reliability embedded in scholarly articles, this project provides a unique and valuable contribution to the ongoing discourse on information credibility and fact-checking methodologies. In this context, the project's focus on PubMed articles as a rich source of validated information represents a distinctive approach that holds the potential to reshape how fact-checking is approached in an era marked by information abundance and complexity.
+
+## ❓🤔 Methods 🤔❓
 
 ***Explanation of Code***
 
@@ -57,10 +63,17 @@ def process_chunk(conn, cursor, chunk):
 
 The `process_corpus` function orchestrates the overall processing of XML files, systematically walking through the corpus directory, assembling chunks of XML paths, and persistently storing data into the SQLite database.
 
-Additionally, the code encompasses a fact-checking mechanism. The `fact_check` function accepts a user-provided query, retrieves data from the articles stored in the database, and employs the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization technique to determine similarity scores. Articles with similarity scores exceeding a specified number are considered as potential matches, and the results are presented to the user through the Streamlit application.
+Additionally, the code encompasses a fact-checking mechanism. The `fact_check` function accepts a user-provided query, retrieves data from the articles stored in the database, and employs the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization technique to determine similarity scores. Articles with similarity scores exceeding a specified number are considered as potential matches, and the results are presented to the user through the Streamlit application. The addition of the show_contents checkbox to show whether to display the full text of the articles is a useful feature. This allows users to decide whether they want to see detailed contents along with the fact-checking results.
 
 The main application, defined in the `main` function, operates within the Streamlit framework. It enables users to upload a PubMed corpus in tar.gz format, process the corpus, and perform fact-checking by entering a query. The application provides informative messages and results, creating a comprehensive Corroboration Dashboard for user interaction and insights.
 
+***Testing*** 
+
+To test the validity of this project, I took the approach to manually test each input data of different facts. For example, I tested a scenario in which a user would type in gibberish to ensure that the program would recognize that it is an input that is not within the data file. I also tested what the programs output would look like if the user typed in nothing within the text box. If the output of the program is `No matching article found to support the fact` than it was ran correctly.  The output of both test is demonstrated below: 
+
+![screen_shot_four](Screenshots/screen_shot_four.png)
+![screen_shot_five](Screenshots/screen_shot_five.png)
+
 ## Using the Artifact
 
 ***Installation***
@@ -112,7 +125,7 @@ The main application, defined in the `main` function, operates within the Stream
 
 4. After the file has been successfully processed, enter a fact or a key word in which you want to explore and 🎉 **TAAADAAAA** 🎉 your articles should be given to you!
 
-
+![screen_shot_three](Screenshots/screenshot_three.png)
 
 ## Results and Outcomes
 
@@ -122,6 +135,19 @@ Evaluation metrics for the artifact involve measuring the system's accuracy in i
 
 ## References
 
+- Alpaydin, E. (2021). Machine learning. Mit Press.
+
+- Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied text analysis with Python: Enabling language-aware data products with machine learning. " O'Reilly Media, Inc.".
+
+- Ke, S., Olea, J. L. M., & Nesbit, J. (2019). A robust machine learning algorithm for text analysis. working paper.
+
+- Thorne, J., & Vlachos, A. (2018). Automated fact checking: Task formulations, methods and future directions. arXiv preprint arXiv:1806.07687.
+
+- Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255-260.
+
+- Vo, N., & Lee, K. (2019, July). Learning from fact-checkers: Analysis and generation of fact-checking language. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 335-344).
+
+
 ---
 
 ## Exemplar Projects Discussions

diff --git a/Screenshots/screen_shot_five.png b/Screenshots/screen_shot_five.png
diff --git a/Screenshots/screen_shot_four.png b/Screenshots/screen_shot_four.png
diff --git a/Screenshots/screenshot_three.png b/Screenshots/screenshot_three.png
diff --git a/src/.corroboration_db.sqlite.icloud b/src/.corroboration_db.sqlite.icloud
diff --git a/src/corroboration_db.sqlite b/src/corroboration_db.sqlite
diff --git a/src/test_cases/corroboration_db 2.sqlite b/src/test_cases/corroboration_db 2.sqlite
diff --git a/src/test_cases/corroboration_db 3.sqlite b/src/test_cases/corroboration_db 3.sqlite
diff --git a/src/test_cases/testing.py b/src/test_cases/testing.py
@@ -80,18 +80,20 @@ def fact_check(query, db_path, similarity_threshold=0.2, show_contents=False):
     article_data = [(row[0], row[1], row[2]) for row in rows]
 
     # Query Sanitization
-    # Consider adding more sophisticated text preprocessing based on your requirements
     query = query.lower()
 
     vectorizer = TfidfVectorizer(stop_words='english')
     vectors = vectorizer.fit_transform([query] + [text for _, _, text in article_data])
 
-    similarities = cosine_similarity(vectors[:-1], vectors[-1])
+    query_vector = vectors[0]
+    article_vectors = vectors[1:]
+
+    similarities = cosine_similarity(query_vector, article_vectors)
 
     matching_articles = [
-        (article_data[i][0], article_data[i][1], similarities[i])
+        (article_data[i][0], article_data[i][1], similarities[0][i])
         for i in range(len(similarities))
-        if similarities[i] > similarity_threshold
+        if similarities[0][i] >= similarity_threshold
     ]
 
     if not matching_articles:
@@ -100,14 +102,15 @@ def fact_check(query, db_path, similarity_threshold=0.2, show_contents=False):
         result = "The fact is supported by the following articles:\n\n"
         for article in matching_articles:
             article_id, title, similarity_score = article
-            result += f"Article ID: {article_id}\nTitle: {title}\nSimilarity Score: {float(similarity_score[0]):.4f}\n\n"
+            result += f"Article ID: {article_id}\nTitle: {title}\nSimilarity Score: {float(similarity_score):.4f}\n\n"
             if show_contents:
                 result += "Full Text:\n" + article_data[article_id - 1][2] + "\n\n"
 
     conn.close()
     return result
 
 
+
 def main():
     "Main Streamlit Application."
     st.title("Corroboration Dashboard")