Skip to content

Commit

Permalink
Update method
Browse files Browse the repository at this point in the history
  • Loading branch information
hayleepierce committed Feb 22, 2024
1 parent 2aff56b commit 382d37e
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,9 @@ This project uses [Streamlit](https://streamlit.io/) to create a dashboard for t
- Content
- "p"

Each article is a dictionary with each of the above pieces of information as a key-value pair. The Title is a concatenated string made up of the "article-title", a colon, and the "subtitle". Some articles do not have an "article-title" and/or "subtitle". In the case of no "subtitle", Title consists of just "article-title". If both are missing, Title is set equal to "None". The Date is also a concatenated string with the "month", "day", and "year" with a `/` between each. Some article's publication date only consists of a "month" and "year". These article dictionaries are added to a list to form the corpus.
Each article is a dictionary with each of the above pieces of information as a key-value pair. The Title is a concatenated string made up of the "article-title", a colon, and the "subtitle". Some articles do not have an "article-title" and/or "subtitle". In the case of no "subtitle", Title consists of just "article-title". If both are missing, the Title is set equal to "None". The Date is also a concatenated string with the "month", "day", and "year" with a `/` between each. Some of the articles' publication date only consists of a "month" and "year". The Author(s) is a list of concatenated strings made up of the "surname, a comma, and the "given-names". In the case that the author only has a "surname", only the "surname" is added to the list. Content is a list of strings, with each string being a paragraph from the article. The `lower()` function is used to make all characters in these strings lowercase. These article dictionaries are added to a list to form the corpus.

The user's input is taken in as a string and the `split()` method is used to divide the string into a list. A list of English stopwords from [NLTK](https://www.nltk.org/) is used to remove stopwords from this list. The [`combinations()`](https://docs.python.org/3/library/itertools.html#itertools.combinations) function from the `itertools` module is used to create several sublists of all the different combinations of the remaining words. The sublists are ordered from the sublist containing the combinations using the most words to the sublist containing the singular words.
The user's input is taken in as a string, the `lower()` function is used to make all characters in the string lowercase, and the `split()` method is used to divide the string into a list. A list of English stopwords from [NLTK](https://www.nltk.org/) is used to remove stopwords from this list. The [`combinations()`](https://docs.python.org/3/library/itertools.html#itertools.combinations) function from the `itertools` module is used to create several sublists of all the different combinations of the remaining words. The sublists are ordered from the sublist containing the combinations using the most words to the sublist containing the singular words.

Iterating through the sublists, the corpus is searched using the "Content" key for each article dictionary. If a string from the sublist is found in the content of an article, the article's dictionary is added to the `found_articles` list (unless it has already been added during a previous search). The final sublist (containing the singular words) is slightly different, with all words having to be found in the article's content for it to be added to `found_articles`.

Expand Down

0 comments on commit 382d37e

Please sign in to comment.