Skip to content
This repository has been archived by the owner on Apr 4, 2021. It is now read-only.

Commit

Permalink
Merge pull request #16 from dkutin/readme
Browse files Browse the repository at this point in the history
Readme Changes
  • Loading branch information
dkutin authored Feb 15, 2021
2 parents 957deba + f200cf4 commit 9e70e2d
Showing 1 changed file with 199 additions and 16 deletions.
215 changes: 199 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,216 @@


# CSI4107 Assignment 1

## Reference

http://www.site.uottawa.ca/~diana/csi4107/A1_2021/A1_2021.htm

## Group Members

### Dmitry Kutin - 300015920
### Dilanga Algama - 8253677
### Joshua O Erivwo - 8887065
Dmitry Kutin - 300015920
Dilanga Algama - 8253677
Joshua O Erivwo - 8887065

## Reference
## Task Distribution

Microblog information retrieval system: http://www.site.uottawa.ca/~diana/csi4107/A1_2021/A1_2021.htm
Dmitry Kutin
- Step 1, Step 3 (Improvements), Step 5, README.

## Functionality
Dilanga Algama
- Step 2, Step 3 (Initial Implementation), Step 4.

Joshua O Erivwo
- Step 3, README.

## Setting up

Prerequisites:
1. `python3` installed on your computer
Prerequisites:

1. `python3` installed and executable.
2. `nltk` libaries installed (all of these can be downloaded using `python3` -> `import nltk` -> `nltk.download('corpus | tokenize | stem.porter')`:
* `corpus`
* `tokenize`
* `stem.porter`
- `corpus`
- `tokenize`
- `stem.porter`

**Note: These will be downloaded and imported by `main.py` during execution**

## Project Overview

The `assets/` directory contains all the information provided for this assignment:
- `tweet_list.txt` - Contains the list of Documents.
- `stop_words.txt` - A collection of stopwords, and
- `test_queries.txt` - A collection of 49 test queries.
- `Trec_microblog11-qrels.txt` - Provided relevance feedback file.

run `python3 main.py`
The `dist/` directory is where results of the execution are stored.
- `Results.txt` - Contains a collection of all 49 test queries, and their corresponding relevant documents, ordered by highest to lowest relevance.
- `trec_eval.txt` - Resulting file from Trec Eval execution. This file contains a detailed comparision of `Results.txt` against `Trec_microblog11-qrels.txt`.


Topic_id Q0 docno rank score tag
1 Q0 30198105513140224 1 0.588467208018523 myRun
1 Q0 30260724248870912 2 0.5870127971399565 myRun
1 Q0 32229379287289857 3 0.5311552466369023 myRun

## Execution

## Algorithms, Data Structures, and Optimizations
Once all of the prerequesits are met, the program can be ran with:
**```python3 main.py```**

This will generate `Results.txt` in the `dist/` directory in the following format:

Topic_id Q0 docno rank score tag
1 Q0 30198105513140224 1 0.588467208018523 myRun
1 Q0 30260724248870912 2 0.5870127971399565 myRun
1 Q0 32229379287289857 3 0.5311552466369023 myRun

## Evaluation

To evaluate the effectiveness of our Microblog retrieval system:

- Run the `eval.sh` script. Running this script will create a txt file called `trec_eval.txt` which will list the overall performance measures of the code for all the queries as a whole.

- Run the `full-eval.sh` script to see all the trec_eval measures for each query. Running this script will create a txt file called `trec_eval_all_query.txt` which will list out all the measures the trec_eval module has to offer for each query that was run through the code.

## Functionality

Our task for this assignment was to implement an Information Retrieval (IR) system for a collection of documents (Twitter messages). A quick recap of what our code does as a whole is as follows:

1. We import both the data files, one with the test queries and the other with the list of tweets to format the information to a Python code readable manner and to organize the words for our functions to read (we used dictionaries to store the data). This step also runs the words through a stemming and stop word removal process that basically stems all the words and handles the removal of stop words from the list of words.

2. We create an inverted index dictionary for all the words in each tweet in the list of tweets. During this process we calculate the `idf` for each word and the `tf-idf` for the words within the tweets. Which is also add to the dictionary created during this step.

3. We calculate the idf and tf-idf for the words in the test queries. Once these measures are calculated. We use the measures calculated in step 2 with the query measures we calculated in this step to calculate the lengths of the queries and tweets. Once the lengths of the queries and tweets are calculated we this information to calculate the `CosSim` for the tweets, to find their similarity score to the query. We order the tweets in a dictionary from highest to lowest similarity score and pass this information to step 4.

4. We use the data calculated in step 3 to write to a txt file (Results.txt) in the format mentioned in the assignment.

## Algorithms, Data Structures, and Optimizations

Our implementation of the information retrieval system was based on the guidelines provided in the assignment. The folder contains five python files containing the function used in implementing the IR system.

### Project Specific Files

#### `main.py`:
This file contains the main() function. In the `main()`, we started by importing the important functions that were used for implementing the IR system. The first step was to import the tweets and the queries from the `assert folder`. By importing the tweets and queries from the `asset folder`, `step1: preprocessing` was being done using the `filterSentence` that was implemented directly in the `import` function. After importing the tweets and queries from the text and then filtering them. We then moved to build the `inverted index` for the tweet. We got the `length of the document` for the indexes and tweets, which was then used to retrieve the length of the queries from the text file. In the `retrieval` function, the CosSimalarity scores were calculated and then ranked in descending order. To understand what was happening in the `main()` function, we created a set of print statements that would notify the user when the preprocessing and the ranking of the document are done. The user then gets informed of the creation of the result file.
#### Preprocess.py:
This file contains the process of developing `step1:Preprocessing` and `step2:Indexing` using python. Below are the functions implemented in the `preprocess.py`
- isNumeric(subject): Check if a string contains numerical values
- importTweets(): imports the tweets from the collection. We first started by opening the text files, then we filter the file using our filterSentence function.
- importQuery(): imports query from the collection. Same process as the importTweet().
- filterSentence(sentence): Filters sentences from tweets and queries. This function builds a list of `stopwords` and then `tokenizes` each word in the sentences by removing any numerics, punctuation, or stopwords contained in the list. Each imported tweet and query runs through the `NLTK's stopword list`, our `custom stopword list` that included the `URLs and abbreviations`, and the provided `stopword list`. After this step, each word is tokenized and stemmed with `Porter stemmer`. Under the `additional libraries` section, we discussed in-depth the use of `tokenization`, `stopwords`, and `porter stemmer`.
- buildIndex(documents): builds the inverted index for each entry word in the vocabulary. The data structure used for the implementation was hash maps. In the realms of python development, dictionaries are equivalent to hash maps. We used dictionaries for storing the data that was being processed and used for the documents and queries. We initialized the `inverted index` and the `word_idf` as empty dictionaries that would be returned in the end. The next step stored the frequency of each word inside the already filtered documents. The final step calculated the `IDF` and the `TF-IDF` for all words contained in a document.
- lengthOfDocument(index, tweets): calculates the length of documents for each tweet.
#### Result.py:
This file contains the function for calculating the Cosimilarity values for the set of documents against each queries and then ranks the similarity scores in descending order. Dictionary was used as our main source for storing the values of the `query_index`, `retrieval`, and the `query_length`. The function comprises mainly on `for loops`. At the start, we first calculated the occurrences of the token in each query. We then moved to calculate the `TF-IDF` and the `length of the query`. After getting the necessary calculations needed, we then moved to solving the `CosSimalarity values` and then `ranking the document` according to the order that was specified.
#### write.py:
This file contains the procedure for implementing `step4`. The function creates a table for each of the results generated in the `result.py` and then stores it in the `dist folder` as a text file.

### Additional Libraries

#### Prettytable (`prettytable.py`):

A helper library to format the output for the `Results.txt` file. Used in the implementation of the `write.py`.

#### NLTK:

#### PorterStemmer
Porter stemmer was an external resource that was used in the implementation of `filterSentence(sentence)`. It was used for normalizing the data for each token that was created. Stemming helps remove the morphological and inflexional endings from words in the text file.
#### Stopwords
Stopwords were also used in the preprocessing of the data. Since stopwords are common that generally do not contribute to the meaning of a sentence, we tend to filter them out which can be seen done in the `filterSentence(sentence)` function.
#### Tokenizer
We Tokenized our data in the `filterSentence(sentence)` so as to provide a link between queries and documents. Tokens are sequences of alphanumeric characters separated by nonalphanumeric characters, which are performed as part of the preprocessing (`step1` requirement).

## Final Result Discussion
The following is the evaluation of our system using the trec_eval script by comparing our results (`dist/Results.txt`) with the expected results from the provided relevance feedback file.

runid all myRun
num_q all 49
num_ret all 39091
num_rel all 2640
num_rel_ret all 2054
map all 0.1634
gm_map all 0.0919
Rprec all 0.1856
bpref all 0.1465
recip_rank all 0.3484
iprec_at_recall_0.00 all 0.4229
iprec_at_recall_0.10 all 0.3001
iprec_at_recall_0.20 all 0.2653
iprec_at_recall_0.30 all 0.2195
iprec_at_recall_0.40 all 0.2025
iprec_at_recall_0.50 all 0.1770
iprec_at_recall_0.60 all 0.1436
iprec_at_recall_0.70 all 0.1230
iprec_at_recall_0.80 all 0.1027
iprec_at_recall_0.90 all 0.0685
iprec_at_recall_1.00 all 0.0115
P_5 all 0.1714
P_10 all 0.1796
P_15 all 0.1796
P_20 all 0.1776
P_30 all 0.1714
P_100 all 0.1406
P_200 all 0.1133
P_500 all 0.0713
P_1000 all 0.0419


From an overall perspective, The result seemed okay, though not as great as we would have hoped. Paying attention to the map, which represents the overall performance of our searching. We got a MAP score of `16.3%` and `p_10` of `0.17`. The map score seemed much better after re-evaluating the `result.py`. We made some optimization to our retrieval and ranking after discovering some anomalies in our calculations for the queries in the inverted index. This optimization must have made the map score slightly increase to the `number recorded above`. When performing searches manually, it seemed much better and relevant as the numbers begin to make more sense.

## Results from Queries 3 and 20

### Query 3

3 Q0 32333726654398464 1 0.69484735460699 myRun

3 Q0 32910196598636545 2 0.6734426036041226 myRun

3 Q0 35040428893937664 3 0.5424091725376433 myRun

3 Q0 35039337598947328 4 0.5424091725376433 myRun

3 Q0 29613127372898304 5 0.5233927588038552 myRun

3 Q0 29615296666931200 6 0.5054085301107222 myRun

3 Q0 32204788955357184 7 0.48949945859699995 myRun

3 Q0 33711164877701120 8 0.47740062368197117 myRun

3 Q0 33995136060882945 9 0.47209559331399364 myRun

3 Q0 31167954573852672 10 0.47209559331399364 myRun

## Task Assigned
Each member was assigned to at least one of the steps provided by the assignment. We all contributed and helped each other with each step we assigned to ourselves step by reviewing and improving the code. Below contains how the step was divided:
Dimitry was tasked with step1 and step2 and was reviewed by Joshua and Don. Joshua and Don were tasked with step3 and was reviewed and improved by Dimitry. Don was tasked with step 4 and was reviewed by Joshua and Dimitry. Step 5 was performed by the group and Joshua was tasked with the final step which was the README file.
### Query 20

20 Q0 33356942797701120 1 0.8821317020383918 myRun

20 Q0 34082003779330048 2 0.7311611336720092 myRun

20 Q0 34066620821282816 3 0.7311611336720092 myRun

20 Q0 33752688764125184 4 0.7311611336720092 myRun

20 Q0 33695252271480832 5 0.7311611336720092 myRun

20 Q0 33580510970126337 6 0.7311611336720092 myRun

20 Q0 32866366780342272 7 0.7311611336720092 myRun

20 Q0 32269178773708800 8 0.7311611336720092 myRun

20 Q0 32179898437218304 9 0.7311611336720092 myRun

20 Q0 31752644565409792 10 0.7311611336720092 myRun


## Vocabulary

Our vocabulary size was `88422` tokens

Below is the sample of 100 tokens from our vocabulary:

```['bbc', 'world', 'servic', 'staff', 'cut', 'fifa', 'soccer', 'haiti', 'aristid', 'return', 'mexico', 'drug', 'war', 'diplomat', 'arrest', 'murder', 'phone', 'hack', 'british', 'politician', 'toyota', 'reca', 'egyptian', 'protest', 'attack', 'museumkubica', 'crash', 'assang', 'nobel', 'peac', 'nomin', 'oprah', 'winfrey', 'half-sist', 'known', 'unknown', 'white', 'stripe', 'breakup', 'william', 'kate', 'fax', 'save-the-da', 'cuomo', 'budget', 'super', 'bowl', 'seat', 'tsa', 'airport', 'screen', 'unemploymen', 'reduc', 'energi', 'consumpt', 'detroit', 'auto', 'global', 'warm', 'weather', 'keith', 'olbermann', 'job', 'special', 'athlet', 'state', 'union', 'dog', 'whisper', 'cesar', 'millan', "'s", 'techniqu', 'msnbc', 'rachel', 'maddow', 'sargent', 'shriver', 'tribut', 'moscow', 'bomb', 'gifford', 'recoveri', 'jordan', 'curfew', 'beck', 'piven', 'obama', 'birth', 'certifica', 'campaign', 'social', 'media', 'veneta', 'organ', 'farm', 'requir', 'evacu', 'carbon', 'monoxid']```

0 comments on commit 9e70e2d

Please sign in to comment.