CS253-TermProject

This is the final version of my term project for CS253 (Fall 2017).

It includes the following features:

Given a document, will rank additionally provided documents by similarity and reading level
Estimates the reading level of a document per the Coleman-Laiu index
Computes similarity using TFIDF - Term Frequency Inverse Document Frequency (used by early web browsers)
Performs word stemming using the Porter Algorithm
Is extremely efficient - without multithreading, it can compare the complete works of William Shakespeare in about 3 seconds
Extensive error-checking (with helpful error messages printed to console).

It is written in C++. Please note this was written earlier on in my programming career, and was soley designed by me - as such, there aren't as many comments as there should be! :)

Instructions:

To compile: make

it will create an executable called PA9

To run: ./PA9 file1 file2 num1 num2

where:

file1: is the name of the text document against which all others should be compared
file2: is the name of a text document containing names of other text documents (in the same directory), each seperated by whitespace. The first filename listed in file2 is the exceptions file; its required format is listed below. The rest are the documents to be compared against file1.
num1: the lowest acceptable reading level (as defined by the Coleman-Laiu index)
num2: the highest acceptable reading level (as defined by the Coleman-Laiu index)

Exceptions file format: Some words should be exempt from the word stemming. While this program does classify and ignore names, locations, etc. when performing word stemming, this allows for even greater flexibility. The correct format for this file is:

word replaceWith

word2 replaceWith2

and so on.

So, if you prefer the word "smiled" is changed to "smiling" (instead of being stemmed to "smile"), then your exception file would look like:

smiled smiling

Please note all document files should have a text extension (.txt). :)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Driver.cpp		Driver.cpp
Driver.h		Driver.h
Makefile		Makefile
Parse_File.cpp		Parse_File.cpp
Parse_File.h		Parse_File.h
README.md		README.md
Stem.cpp		Stem.cpp
Stem.h		Stem.h
Stem_Exceptions.cpp		Stem_Exceptions.cpp
Stem_Exceptions.h		Stem_Exceptions.h
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS253-TermProject

About

Releases

Packages

Languages

dboston1/Advanced-Document-Similarity

Folders and files

Latest commit

History

Repository files navigation

CS253-TermProject

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages