Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue#85: Frequency Analysis Word Cloud #94

Open
wants to merge 94 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
1e5c149
created the spring log for the documentation part of our tasks
solisa986 Mar 30, 2021
3f5ac7b
finished the spring log for issue#51
solisa986 Mar 30, 2021
2a5a4e4
Writing word frequencies to csv
Mar 31, 2021
3233a42
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
Mar 31, 2021
225ce2f
Putting different run's results into separate files
Mar 31, 2021
446215a
Update textmining.py
hadenwIV Mar 31, 2021
b638b98
Categorization of words
Mar 31, 2021
902e704
Additional elaboration on functions of tasks completed
Mar 31, 2021
a065da8
Fixed name spelling
Mar 31, 2021
c208849
Added docstrings
Mar 31, 2021
9f9733b
moving all of our code files to a folder called categorize_words
donizk Apr 1, 2021
25abe2e
created interface file, began implementation for interface
donizk Apr 1, 2021
b816c40
added notes (as comments) to myself onto the __main__.py file to keep…
donizk Apr 1, 2021
60c36e2
added some test cases
solisa986 Apr 5, 2021
9455433
classifying categories of files inputted
Apr 5, 2021
1b63ffd
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
Apr 5, 2021
f994779
Sorting assignment categories
Apr 5, 2021
48f9fae
finished documenting sprint 2 log and moved the categories_words.py file
solisa986 Apr 5, 2021
013b9db
formatting
solisa986 Apr 5, 2021
8b5347d
Merge branch 'issue#51' of https://github.com/Allegheny-Ethical-CS/Ga…
hadenwIV Apr 6, 2021
b95ebee
Revert "Merge branch 'issue#51' of https://github.com/Allegheny-Ethic…
enpuyou Apr 6, 2021
2e79165
Word categorization program
Apr 7, 2021
56c8689
Word categorization using training data and Scikit
Apr 7, 2021
ccd2cba
Start of the interface pipeline
Apr 7, 2021
85c9ce6
Beginning of interface page to for category frequency analysis
Apr 7, 2021
5dec4f7
Removed category classification model training data
Apr 7, 2021
5a9168e
Merge branch 'master' into issue#51
enpuyou Apr 10, 2021
eb71d6e
Development on categorization
Apr 14, 2021
4594407
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
Apr 14, 2021
99e2dd1
Removed sample_md_reflections training data
Apr 14, 2021
c4d689b
Readded existing sample_md_reflections
Apr 15, 2021
598e3c1
Restored original sample_md_reflections
Apr 15, 2021
3f47837
fixing
favourojo Apr 15, 2021
cee57d9
Merge branch 'issue#51' of github.com:Allegheny-Ethical-CS/GatorMiner…
favourojo Apr 15, 2021
16bdf5d
fixed
favourojo Apr 15, 2021
f96f006
starting on wordcloud
favourojo Apr 15, 2021
3fabfe1
word cloud
favourojo Apr 21, 2021
83f2f93
Merge branch 'issue#85' of github.com:Allegheny-Ethical-CS/GatorMiner…
favourojo Apr 21, 2021
c0ed0bd
Installed wordcloud and got repository and pipfile up to date
Apr 22, 2021
d045d03
Addition of popup wordcloud of most frequent words
Apr 23, 2021
c915399
Working in GUI display of wordcloud
Apr 23, 2021
8552ef3
working on word cloud
favourojo Apr 27, 2021
a8ca53b
Fix pipfile.lock
Apr 27, 2021
97edd86
Restore markdown feature in analyzer
Apr 27, 2021
80ef017
Update pipfile to the master branch
Apr 27, 2021
c166ece
Update Pipfile.lock
Apr 27, 2021
ec09458
Update importlib.metadata
Apr 27, 2021
3d0a99c
Reupdate Pipfile.lock to master
Apr 27, 2021
1b0ec5a
Remove repeat line
Apr 27, 2021
fcb1469
Removed blank line from Pipfile.lock
Apr 27, 2021
dae590b
Update sample reflections to main
Apr 27, 2021
17ec1dc
Remove sprint log
Apr 27, 2021
74cac02
Remove word_cloud_test file
Apr 27, 2021
48242c0
Word cloud for student frequency
Apr 27, 2021
5dee753
Moved question_df in overall_freq closer to relevant code
Apr 27, 2021
bb87175
Change name of question_df to avoid confusion with dataframe in quest…
Apr 27, 2021
bc3b80d
Remove incomplete and irrelevant category_freq code
Apr 28, 2021
4973bfd
Remove writing of questions_df to streamlit
Apr 28, 2021
bbf0b4a
Delete unused word_cloud_generator file
Apr 28, 2021
ee4c16c
Restore textmining to original
Apr 28, 2021
5f55990
Delete unused frequencies.py file
Apr 28, 2021
14d9ee3
Update top of file to match master
Apr 28, 2021
6c583b2
Fix flake8 errors
Apr 28, 2021
7bd0f8f
Adding test case for concatenate
hadenwIV Apr 28, 2021
b1790c7
Adding second test case to test analyzer
hadenwIV Apr 28, 2021
7e2280c
Merge branch 'master' into issue#85
favourojo Apr 28, 2021
4fd878b
trying to fix the linting error
solisa986 Apr 28, 2021
efbe319
reverting back to the original code because the error still persists
solisa986 Apr 28, 2021
d00843d
Fixed concatenate test
Apr 28, 2021
2e1906e
Removed second named_entity_recognization test
Apr 28, 2021
de3fa02
Fix linting with lines
Apr 28, 2021
34899ca
Fix linting errors
Apr 28, 2021
7f8ea21
Merge branch 'master' into issue#85
corlettim Apr 29, 2021
cd31b77
Added wordcloud to pipfile
May 3, 2021
96d13ee
Merge branch 'issue#85' of github.com:Allegheny-Ethical-CS/GatorMiner…
May 3, 2021
37c95b4
Reset pipfile and added wordcloud
May 3, 2021
d9f7330
Reserve merge issues with master
May 3, 2021
7042f09
Update Pipfile.lock and pipfile to master
May 3, 2021
94465e9
Install wordcloud on pipenv
May 3, 2021
fd835e3
Removed space before [[source]]
May 3, 2021
07028fe
Remove git standup
May 3, 2021
18a5106
Change to importlib_metadata
May 3, 2021
d49be76
Revert to importlib-metadata
May 3, 2021
60ed6c8
Reverted importlib-metadata version
May 3, 2021
6751179
Reverted importlib-metadata hash
May 3, 2021
4248d61
Change skip pipfile to skip lock
hewittk May 3, 2021
a3f3de3
Update dependencies comment
hewittk May 3, 2021
92fe845
Remove md_dict
May 3, 2021
48230b4
Update retreive_data to main
May 3, 2021
706e52b
Condense word frequency cloud code into one method
May 3, 2021
e0398d3
Remove extraneous print statement
May 3, 2021
839e77f
Save path with item to frequency_archives
May 3, 2021
1aade5d
Add punctuation to added docstrings
hewittk Jun 9, 2021
68cf4e5
Add docstring to frequency_word_cloud
hewittk Jun 9, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ jobs:
uses: dschep/install-pipenv-action@v1
- name: Install dependencies
run: |
# install dependencies according to the lock file
pipenv install --dev --ignore-pipfile --python ${{ steps.setup-python.outputs.python-version }}
# install dependencies according to the pip file
pipenv install --dev --skip-lock --python ${{ steps.setup-python.outputs.python-version }}
pipenv run python -m spacy download en_core_web_sm
- name: Run test with pytest
run: |
Expand Down
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ scipy = "*"
pylint = "*"
importlib-metadata = "*"
atomicwrites = "*"
wordcloud = "*"

[pipenv]
allow_prereleases = true
449 changes: 195 additions & 254 deletions Pipfile.lock

Large diffs are not rendered by default.

18 changes: 17 additions & 1 deletion src/analyzer.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
"""Text Proprocessing"""
from collections import Counter

from . import markdown as md

from textblob import TextBlob
import pandas as pd

import re
import string
from typing import List, Tuple
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from . import markdown as md

PARSER = spacy.load("en_core_web_sm")

Expand Down Expand Up @@ -142,6 +145,19 @@ def noun_phrase(input_text):
return n_phrase_lst


def concatenate(responses_df):
favourojo marked this conversation as resolved.
Show resolved Hide resolved
"""Remove stop words from and return contcatenated string of all words."""
words_str = ''
for i, row in responses_df.iterrows():
for col in range(len(responses_df.columns)):
val = row[col]
tokens = val.split()
for i in range(len(tokens)):
tokens[i] = tokens[i].lower()
words_str += " ".join(tokens)+" "
return words_str


def top_polarized_word(tokens_column):
"""Create columns for positive and negative words"""
# Start off with empty lists
Expand Down
40 changes: 39 additions & 1 deletion streamlit_web.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,21 +21,27 @@
import src.topic_modeling as tm
import src.visualization as vis

from wordcloud import WordCloud, STOPWORDS
hewittk marked this conversation as resolved.
Show resolved Hide resolved

# resources/sample_reflections/lab1, resources/sample_reflections/lab2

# initialize main_df and preprocessed_Df
SPACY_MODEL_NAMES = ["en_core_web_sm", "en_core_web_md"]
preprocessed_df = pd.DataFrame()
main_df = pd.DataFrame()
sample = []
assignments = None
assign_text = None
stu_id = None
success_msg = None
debug_mode = False

json_lst = []

main_md_dict = None



def main():
"""main streamlit function"""
# Title
Expand Down Expand Up @@ -165,6 +171,7 @@ def retreive_data(data_retreive):
return True



@st.cache(allow_output_mutation=True)
def load_model(name):
"""load spacy model"""
Expand Down Expand Up @@ -264,11 +271,12 @@ def frequency():


def overall_freq(freq_range):
"""page fore overall word frequency"""
"""page for overall word frequency."""
plots_range = st.sidebar.slider(
"Select the number of plots per row", 1, 5, value=3
)
freq_df = pd.DataFrame(columns=["assignments", "word", "freq"])

# calculate word frequency of each assignments
for item in assignments:
# combined text of the whole assignment
Expand All @@ -288,6 +296,13 @@ def overall_freq(freq_range):
)
)

responses_end = len(main_df.columns) - 3
responses_df = main_df[main_df.columns[1:responses_end]]
responses_df.replace("", "NA")

frequency_word_cloud(responses_df)

freq_df.to_csv('frequency_archives' + os.path.sep + str(item) + '.csv')

def student_freq(freq_range):
"""page for individual student's word frequency"""
Expand Down Expand Up @@ -331,6 +346,12 @@ def student_freq(freq_range):
)
)

responses_end = len(stu_assignment.columns) - 3
responses_df = stu_assignment[stu_assignment.columns[1:responses_end]]
responses_df.replace("", "NA")

frequency_word_cloud(responses_df)


def question_freq(freq_range):
"""page for individual question's word frequency"""
Expand Down Expand Up @@ -377,6 +398,23 @@ def question_freq(freq_range):
plots_per_row=plots_range,
)
)
frequency_word_cloud(question_df)


def frequency_word_cloud(responses_df):
"""Build wordcloud out of page's responses."""
# concatenate all words into normalized string and make into wordcloud
words = az.concatenate(responses_df)
cloud_stopwords = set(STOPWORDS)
wordcloud = (WordCloud(width = 800, height = 800,
background_color = 'white',
stopwords = cloud_stopwords,
min_font_size = 10).generate(words))

# plot wordcloud by temporarily savings as a file and displaying
wordcloud.to_file("resources/images/word_cloud.png")
st.image("resources/images/word_cloud.png")
os.remove("resources/images/word_cloud.png")


def sentiment():
Expand Down
19 changes: 18 additions & 1 deletion tests/test_analyzer.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
"""Test module for analyzer.py"""

import pytest
import src.analyzer as az
import pandas as pd
Expand Down Expand Up @@ -146,7 +147,7 @@ def test_sentence_tokenize():


def test_tfidf():
"""test tfidf return result"""
"""Test tfidf return result."""
input_tokens = [
"test",
"tokenize",
Expand All @@ -161,6 +162,22 @@ def test_tfidf():
assert vector is not None


def test_concatenate():
"""Test for contcatenated string of all words."""
input_dict = {
"What was the most important technical skill that you practiced?":
["Using pipenv and pytest", "Naming variables in Python"],
"What was the most important professional skill that you practiced?":
["Communicating with a team remotely", "Resolving issues by talking \
to teammates"]
}
input_df = pd.DataFrame(input_dict)
output = az.concatenate(input_df)
expected = "using pipenv and pytest communicating with a team remotely \
naming variables in python resolving issues by talking to teammates "
assert output == expected


def test_top_polarized_word():
"""Tests if the positive/negative words columns are created"""
df = pd.DataFrame(columns=[cts.TOKEN, cts.POSITIVE, cts.NEGATIVE])
Expand Down
Binary file added text_classifier
Binary file not shown.