Skip to content

Commit

Permalink
finished methods
Browse files Browse the repository at this point in the history
  • Loading branch information
colinelder committed Mar 18, 2024
1 parent 5bff3a0 commit af6fce3
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 5 deletions.
23 changes: 19 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,22 +11,37 @@ This repository contains student project materials, including project report, da

## Major: Data Science

## Project Name: Enter The Name Of Your Project

Here, think of an interesting name of the work that bring a freshness and excitement to the area of this project. Consider using a name that carries some information about what the project and provides some hint at what the project does without being too wordy.
## Project Name: Colin's Fact Finder

---

## Overview

TODO (250 words minimum): Discuss the overview of the project using and building on the project description provided by the department. In this section, a concise summary is discussed of the study's key elements, offering the reader a quick understanding of the research's scope and goals. The section continues to outline the main topics, research questions, hypotheses, and /or theories in a clear and meaningful language to provide a type of roadmap for the reader to navigate the forthcoming details of the project. This section also needs to motivate the project by providing context for the study, outlining the current state of knowledge in the field, and highlighting any gaps or limitations in existing research. The section serves as a foundational guide that enables the reader to grasp the context of the study, in addition to its structure, before moving into a more technically-based discussion in the following sections of the article. In short, the "Overview" section needs to answer the `what` and `why` questions, that is `what is the project?` and `why is the project important?`
Colin's fact finder is an exciting new development in the way that people are able to access new information and verify what they already know about scientific information. Through an interactive streamlit platform and backed by a powerful Python system, users are able to input a scientific fact and instantly have it verified or rejected. The program does this by parsing through a large corpus of scholarly articles and returning the article from which it was found in.

The fact finder program is able to complete this task by utilizing text matching techniques within the fuzzy library and text parsing techniques to move through the vast corpus. The main goal of this project was to provide users with a trustworthy tool for verifying their questions by utilizing well-established scholarly articles. This technique ensures that the data that is being checked against is valid and improves the accuracy of the program. Additionally, it encourages the reader to look more into said articles and read more scholarly publications. Ultimately, it is the vast amount of knowledge found in these publications that allow the fact finder to properly work and operate.

The significance of this project lies in the relevancy this is the increase in use of machine learning and text matching techniques. The prevalance of artificial intelligence and its capabilities has skyrocketed in recent times, and machine learning and text matching go hand in hand with that. This project is able to demonstrate the usefulness and impressive capabilities of these techniques as well as how they are able to be implemented properly.

## Literature Review

TODO: Conduct literature review by describing relevant work related to the project and hence providing an overview of the state of the art in the area of the project. This section serves to contextualize the study within the existing body of literature, presenting a thorough review of relevant prior research and scholarly contributions. In clear and meaningful language, this section aims to demonstrate the problems, gaps, controversies, or unanswered questions that are associated with the current understanding of the topic. In addition, this section serves to highlight the current study's unique contribution to the field. By summarizing and critiquing existing works, this section provides a foundation for readers to appreciate the novelty and significance of the study in relation to the broader academic discourse. The "Literature Review" section further contributes to the `why is the project important?` question. The number of scholarly work included in the literature review may vary depending on the project.

## Methods

The first step that I took to complete this project was completing primary research on the project. This included looking into different techniques for text analysis and the different possible directions that I could move forward in. I also looked into what mediums I could use for the user end of the program and what interface would be best to work with. Finally, I looked at relevant previous fact checking programs.

The next step for this project was finding relevant data to reference in order to verify the user's input. Fortunately, we had data that was provided to us by professor Bonham-Carter that was immensely valuable. Unfortunately however, this large corpus of data exceeded github's repository size limitations. Therefore the full dataset is not able to be found in the github repository. There is a small example of the data found in the data folder, as well as in the data ReadME.

From this point, I was able to start implementing my artifact. The first steps that I took to do this was verifying that my program was able to parse through each text file successfully. I started by creating the path to my data and then appended each text file to a list it moved through them.

The next step that I took was implementing the user interface/dashboard for the program. After some research I settled on streamlit for this aspect of the artifact. I decided on this because I was able to find a lot of valuable information about it as well as valuable tutorials and demonstrations.

Following this, I was able to start working on the fact checking implementation and text analysis. To find the best way to go about this, I reviewed articles that outlined various possible libraries and techniques that I could use for this implementation. One library in particular that seemed to be a very viable option was the ```Spacy``` library. While this seemed like a good option at the time, after a lot of valuable discussion in class I turned to a different library that other students had found more success with. This was the ```fuzzywuzzy```. This library allowed the programmer to implement a "similarity score" which allowed for a better range of text matching for the program. This ensures that the question inputted does not have to perfectly match the text file, but rather meet a quota for similarity.

The last thing that I did to verify the usability was manual experimentation for the program. This involves going into the streamlit dashboard myself and asking questions as if I were a user. However, I already knew the answer to my facts and where to find them in the text files. From this I could verify that proper output was being produced. Additionally, I quickly added a print statement in my parsing function so that I could verify that each text file was in fact being parsed through. This print statement printed the name of each file as it was appended to the list.


TODO: Discuss the methods of the project to be able to answer the `how` question (`how was this project completed?`). The methods section in an academic research outlines the specific procedures, techniques, and methodologies employed to conduct the study, offering a transparent and replicable framework for the research. It details the resources behind the work, in terms of, for example, the design of the algorithm and the experiment(s), data collection methods, applied software libraries, required tools, the types of statistical analyses and models which are applied to ensure the rigor and validity of the study. This section provides clarity for other researchers to understand and potentially replicate the study, contributing to the overall reliability and credibility of the research findings.

## Using the Artifact
Expand Down
15 changes: 14 additions & 1 deletion Writing/summary.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Current Articles Reviewed

## Natural Language Processing

https://www.google.com/books/edition/Natural_Language_Processing_with_Python/lVv6DwAAQBAJ?hl=en&gbpv=1&dq=spacy+python+natural+language+processing&pg=PR15&printsec=frontcover
- This article reviews the basics of using spacy for natural language processing. This includes its installation, setup, and installation of desired language packages. Additionally, it adressess how the library works and the code behind it. From there, it describes how spacy parses through text with tokenization.
Reminder for methods section: user must install spacy using 'pip install spacy' and then 'spacy download en'
Reminder for methods section: user must install spacy using 'pip install spacy' and then 'spacy download en'

https://link.springer.com/book/10.1007/978-1-4842-4354-1

- This book details everything a new user may need to know about natural language processing- from background on what it is and how it works to set up and different options the user has for execution.

## Streamlit

https://docs.streamlit.io/get-started/installation

- This website details everything that is needed to know for the setup of streamlit- which was used to create the interface for this project. This includes the installation and commands to run the program, as well as more advanced details that describe how to design the interface properly and effectively.

0 comments on commit af6fce3

Please sign in to comment.