Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue#33 ~ Documentation Website #89

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ scipy = "*"
pylint = "*"
importlib-metadata = "*"
atomicwrites = "*"
poetry = "*"
mkdocs = "*"
mkdocs-material = "*"

[pipenv]
allow_prereleases = true
489 changes: 374 additions & 115 deletions Pipfile.lock

Large diffs are not rendered by default.

501 changes: 501 additions & 0 deletions poetry.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions runtime.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.6.3
3 changes: 3 additions & 0 deletions scripts/netlify-build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
pip install poetry
poetry install
mkdocs build
31 changes: 31 additions & 0 deletions website/docs/b_installation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Installation

You can clone the repository by running the following command:
```
git clone [email protected]:Allegheny-Ethical-CS/GatorMiner.git
```

`cd` into the project root folder:
```
cd GatorMiner
```

This program uses [Pipenv](https://github.com/pypa/pipenv) for dependency management:
- If needed, install and upgrade the `pipenv` with `pip`:
```
pip install pipenv -U
```
- To create a default virtual environment and use the program:
```
pipenv install
```

GatorMiner relies on `en_core_web_sm` and/or `en_core_web_md`, English models trained on written web text (blogs, news, comments, et cetera...) that includes vocabulary, vectors, syntax and entities.

To install the pre-trained model, you can run one of the following commands.
```
pipenv run python -m spacy download en_core_web_sm
```
```
pipenv run python -m spacy download en_core_web_md
```
47 changes: 47 additions & 0 deletions website/docs/c_web-interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Web Interface

GatorMiner is mainly developed on its web interface with [Streamlit](https://streamlit.io/) in order to provide fast text analysis and visualizations.

In order to run the `Streamlit` interface, type and execute the following command in your terminal:

```
pipenv run streamlit run streamlit_web.py
```

You then will see something like this in your terminal window:
```
You can now view your Streamlit app in your browser.

Local URL: http://localhost:8501
Network URL: http://xxx.xxx.x.x:8501
```

The web interface will automatically be opened in your browser.

## Data Retrieving

There are currently two ways to import text data for analysis:
through local file system or AWS DynamoDB.

### Local File System

You can type in the path(s) to the directories that hold reflection markdown documents. You are welcome to try the tool with the sample documents we provided in `resources`, for example:

```
resources/sample_md_reflections/lab1, resources/sample_md_reflections/lab2, resources/sample_md_reflections/lab3
```

### AWS

Retrieving reflection documents from AWS is a feature integrated with the use of [GatorGrader](https://github.com/GatorEducator/gatorgrader) where students' markdown reflection documents are being collected and stored inside the a pre-configured DynamoDB database. In order to use this feature, you will need to have some credential tokens (listed below) stored as environment variables:

```Bash
export GATOR_ENDPOINT=<Your Endpoint>
export GATOR_API_KEY=<Your API Key>
export AWS_ACCESS_KEY_ID=<Your Access Key ID>
export AWS_SECRET_ACCESS_KEY=<Your Secret Access Key>
```

It is likely that you already have these prepared when using GatorMiner in conjunction with GatorGrader, since these would already be exported when setting up the AWS services. You can read more about setting up an AWS service with GatorGrader [here](https://github.com/enpuyou/script-api-lambda-dynamodb).

Once the documents are successfully imported, you can then navigate through the select box in the sidebar to view the text analysis.
7 changes: 7 additions & 0 deletions website/docs/d_documentRequirement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Document Requirements

GatorMiner is using markdown format for the student reflection documents. Its organized structure allows us to parse and perform text analysis easily. With that being said, there are a few requirements for the reflection document before it could be seamlessly processed and analyzed with GatorMiner. A [template](https://github.com/Allegheny-Ethical-CS/GatorMiner/blob/master/resources/reflection_template.md) is provided below. It is important to note the headers with the assignment's and student's ID/name are required. GatorMiner is set in default to take the first header as assignment name and the second header as student name.

You can also check out the [sample json report](https://github.com/Allegheny-Ethical-CS/GatorMiner/blob/master/resources/sample_json_report/report%201.json) to see the format of json reports GatorMiner gathers from AWS.

Once the documents are correctly upload, select the assignments that you want to compare.
23 changes: 23 additions & 0 deletions website/docs/e_frequencyAnalysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Frequency Analysis

## What is it?

Frequency analysis is the quantification and analysis of word usage in text (how often a word appears within a certain text). Overall, frequency analysis can provide amazing insight into the many aspects of assignments that instructors may not always be able to observe. There is a lot of value in making this information available in a user-friendly and intuitive fashion. This can be achieved using GatorMiner frequency analysis.

Within the GatorMiner tool, you have the ability to choose `Frequency Analysis` as an analysis option after the path to the desired reflection documents is submitted.

## How to use it?

When the tool runs a frequency analysis it provides 3 different options to choose from in the left sidebar:

- Overall
- Student
- Question

When `Overall` is selected, the application will display a vertical bar chart containing a list of the words used with the highest frequency for each given assignment.

When `Student` is selected, a dropdown menu is provided allowing you to pick which student the tool should display frequency data for.

As with `Overall`, this data is also displayed as a vertical bar chart and you can display multiple students' data on the same page in order to compare and contrast the types of words that are being used by students.

Also in the left sidebar there is the option to change the number of most frequent words to be calculated. You can choose anywhere from the most used word to the 20 most used words. You can also choose any where from 1 to 5 bar charts to be displayed.
26 changes: 26 additions & 0 deletions website/docs/f_sentimentAnalysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Sentiment Analysis

# What is it?

Sentiment analysis (or opinion mining) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Overall,
this is a technique to determine whether data is positive, negative, or neutral.

A sentiment score is calculated resulting in a value between -1 and 1. A value that is close to -1 corresponds to a strong negative sentiment in the document. Similarly a value that is close to a 1 corresponds to a strong positive sentiment. The closer the value is to 0, the more neutral the sentiment is.

Within the GatorMiner tool, you have the ability to choose `Sentiment Analysis` as an analysis option after the path to the desired reflection documents is submitted.

# How to use it?

When the tool runs a frequency analysis it provides 3 different options to choose from in the left sidebar:

- Overall
- Student
- Question

When `Overall` is selected, a scatter plot and a bar chart appear on the screen displaying the overall sentiment polarity in, for example, assignment-01 given by the users.

When `Student` is selected, it allows the user to choose a specific student to observe. When chosen it shows the sentiment shown by the chosen user with a mini bar graph and a bigger version of that using a histogram. Inside this feature, you can also change the number of plots per row.

Finally, when `Question` is selected, it allows the user to choose a certain question in the drop down menu. When chosen, it shows the user the sentiment the question was given.

Also in the left sidebar there is the option to change the number of most frequent words to be calculated. You can choose anywhere from the most used word to the 20 most used words. You can also choose any where from 1 to 5 bar charts to be displayed.
22 changes: 22 additions & 0 deletions website/docs/g_documentSimilarity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Document Similarity

# What is it?

Document similarity analyzes documents and compares text to determine frequency of words between documents.

Within the GatorMiner tool, you have the ability to choose `Document Similarity` as an analysis option after the path to the desired reflection documents is submitted.

# How to use it?

In the `Document Similarity` section, you are able to select the type of similarity analysis `TF-IDF` and `Spacy`.

When `TF-IDF` is selected, the application will display a frequency matrix showing the correlation between documents. It does this by dividing the frequency of the word by the total number of terms in a document.

When `Spacy` is selected, the application will display a drop down named 'Model name' with two options:

- `en_core_web_sm` which is used to produce a correlation matrix for **SMALLER** files. (<13mb)
- `en_core_web_md` which is used to produce a correlation matrix for **LARGER** files. (>13mb)

**Warning exceeding these file limits could cause the program to crash.**

**See [Spacy.io](https://spacy.io/models/en) for more details of file limits.**
11 changes: 11 additions & 0 deletions website/docs/g_summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Summary

# What is it?

Summary provides a chart containing all of the information pulled from the uploaded documents. This information includes the assignment title, the author of the reflection, and the answers to all of the questions asked in all of the documents. Therefore, if a question was asked in assignment A, but not in assignment B, the table will be blank in the assignment B column under the row that asks the question given in assignment A.

Within the GatorMiner tool, you have the ability to choose `Summary` as an analysis option after the path to the desired reflection documents is submitted.

# How to use it?

When `Summary`, the program automatically creates a table with all of the information. There are no other options for this analysis selection.
20 changes: 20 additions & 0 deletions website/docs/h_topicModeling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Topic Modeling

## What is it?

Topic modeling analyzes documents to find keywords in order to determine the documents' dominant topics.

Within the GatorMiner tool, you have the ability to choose `Topic Modeling` as an analysis option after the path to the desired reflection documents is submitted.#

## How does it work?

When the tool runs a topic modeling analysis it provides 2 different options to choose from in the left sidebar:

- Histogram
- Scatter

When `Histogram` is selected, the application will display a histogram in which the dominant topic is on the x-axis and the count of records is on the y-axis. A legend in the top right corner will display the names of the reflection files new to the color that corresponds with them.

When `Scatter` is selected, the application will display a scatter plot. The legend on the right side will display the colors that correspond to topic numbers and the shapes that correspond with topics.

Sliders are also provided that can adjust the amount of topics or adjust the amount of words per topic.
37 changes: 37 additions & 0 deletions website/docs/y_howItWorks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Important Tools used in the Creation of GatorMiner

## Getting Started

Pipenv is used to handle GatorMiner's dependencies. Pipenv is a python specific tool that can be used in the place of a variety of other packaging tools (bundler, composer, npm, cargo, yarn, etc.). Pipenv creates a virtual environment called virtualenv that is useful for separating dependencies. When you install/uninstall different packages, virtualenv add and remove files from the Pipfile. It also creates the Pipfile.lock which helps to determine the versions of the dependencies to be used.

GatorMiner was developed utilizing Streamlit. Streamlit is a python library that enables the user to easily create a visually appealing app.

## Frequency Analysis and Sentiment Analysis

spacy.load from the SpaCy class reads the pipeline's configuration and loads in the data from the documents. SpaCy is a library used to analyze and understand large amounts of data.

Regular Expressions are provided through the re module. Regular Expressions are used to match certain strings in the document data.

Counter from the collections class and the dict subclass counts hashable objects. The total counts are stored in a dictionary as the values in the objects are stored as the keys.

TfidfVectorizer from the sklearn.feature_extraction.text class is used to convert raw documents to a matrix of TF-IDF features. TF-IDF evaluates how relevant a word is in a document in relation to a variety of documents. This is calculated by multiplying the number of times a word appears in a document with the inverse document frequency.

CountVectorizer from the sklearn.feature_extraction.text class is used to create a vector of term counts from the words in the documents. The differences between TfidfVectorizer and CountVectorizer is TfidfVectorizer returns a float while the CountVectorizer returns integers.

## Document similarity

TfidfVectorizer from the sklearn.feature_extraction.text class is used to convert raw documents to a matrix of TF-IDF features. TF-IDF evaluates how relevant a word is in a document in relation to a variety of documents. This is calculated by multpliying the number of times a word appears in a document with the inverse document frequency.

Numpy.dot is used to find the dot product of two arrays. NumPy is a library that is helpful when analyzing large and complex arrays and matrices. It can also perform mathematical functions on these arrays.

SpaCy is used to compute document similarity. SpaCy is a library used to analyze and understand large amounts of data.

## Summary

Summerize from the gensim.summarization.summarizer class is used to summarize all of the documents uploaded to GatorMiner. It uses the TextRank algorithm. The Gensim library is used for topic modelling, document indexing and similarity retrieval with large collections of data.

## Topic Modeling

Gensim is used to create a dictionary from a list by adding a key to each word. It is also used to create and LDA model. The LDA model tries to determine topics based on the text in the documents. The Gensim library is used for topic modeling, document indexing and similarity retrieval with large collections of data.

The Pandas DataFrame is used to create a two dimensional table like structure with labeled axises.
30 changes: 30 additions & 0 deletions website/mkdocs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
site_name: GatorMiner
repo_url: https://github.com/Allegheny-Ethical-CS/GatorMiner
repo_name: Allegheny-Ethical-CS/GatorMiner

theme:
name: material

markdown_extensions:
- toc:
permalink: "#"
- smarty
- admonition
- footnotes
- codehilite
- pymdownx.arithmatex
- pymdownx.betterem:
smart_enable: all
- pymdownx.caret
- pymdownx.critic
- pymdownx.details
- pymdownx.emoji:
emoji_generator: !!python/name:pymdownx.emoji.to_svg
- pymdownx.inlinehilite
- pymdownx.magiclink
- pymdownx.mark
- pymdownx.smartsymbols
- pymdownx.superfences
- pymdownx.tasklist:
custom_checkbox: true
- pymdownx.tilde
3 changes: 3 additions & 0 deletions website/netlify.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[build]
publish = "site"
command = "bash scripts/netlify-build.sh"
16 changes: 16 additions & 0 deletions website/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[tool.poetry]
name = "GatorMiner"
version = "0.1.0"
description = "A documentation website for the interactive text analysis tool, GatorMiner."
authors = ["kailaniwoodard <[email protected]>"]

[tool.poetry.dependencies]
python = "^3.6"
mkdocs = "^1.1.2"

[tool.poetry.dev-dependencies]
mkdocs-material = "^7.1.2"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
Loading