Skip to content

SiriuslySirius/positional-index-query-implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Positional Index Query Implementation
This program constructs positional indexes and processing phrase and proximity queries from a large document corpus (over 30,000 documents) from Project Gutenberg. Query results are stored in CSV files for ease of reading through spreadsheet software. Query results are bi-direction, meaning the order of the queried words do not matter as both ordering will be included.

There are two CSV files that will be generated by the program, one with just the DocID, first term index and second term index, which are required by the assignment, and another file that is a detailed version that has everything from the first file, but it also includes the filepath to the text file and the exact phrase from the text file. A new file will be created for every unique query, otherwise non-unique queries will have their files overwritten in case you decide to use a different corpus or add on to it. When validating from the detailed version, keep in mind that symbols, white spaces, and numbers are not included in the results, so you will need to account for that if you're using the search function of whatever text editor you are using to test the query results against results from searching the document from a text editor.

How to Compile:
javac -O .\PositionalIndex.java

How to Run and their Parameters:
java PositionalIndex <path-to-input-files> <path-to-output-result-files> <first-word> <second-word> <int-distance-between-words>

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages