Information-Retrieval

This is a semester-long project. You will work in groups of no more than 3. You will need to do a lot of coding. Doing it all by one person will be very difficult.

You can use any programming language and libraries that are included in the default installation of the language. However, you should NOT use any open-source or third party code, except the stop list and the stemmer. You cannot use tools that automatically generate code. You are not allowed to share code.

In this project, you are expected to implement an information retrieval system that contains the following components:

Document processing and indexing. You will be provided with a zip file that contains 63 HTML documents collected from Wikipedia. First, pre-process the documents by removing all HTML tags and convert everything into lower case. Implement a stop list and a stemmer to pre-process the documents (for the stop list and stemmer, you are allowed to use third-party open source code). Second, build an inverted index (including dictionary and posting lists) for the documents. Please make sure to keep all the frequency information.

Vector Space model. The goal is to provide a TF-IDF-based ranking for the documents. Since you have already collected frequency information in step 1, please further compute IDF for each term. For each document, find a way to calculate the length of the corresponding document vector. For each incoming query, pre-process the query with the stop list and stemmer. Identify candidate documents that contain at least one query term. Meanwhile, compute the length of the query vector. Finally, compute the TF-IDF similarity score between the query and each candidate document [hint: there is no need to construct the complete document vector, or loop through all dimensions in the vector space], and sort the documents by the score.

Niche crawler. You should identify a domain of interest (e.g., ku, Wikipedia, nfl, etc.). Ideally, the size of the domain should be manageable, and the link structure is not too complicate to follow. Your crawler should contain at least three components: (1) a multi-threaded spider that fetches and parses webpages, (2) the URL frontier which stores to-be-crawled URLs; and (3) the URL repository that stores crawled URLs. Please be polite to the site. Please collect a few hundreds to a few thousands of pages.

Please feed the collected documents to the search engine that you implemented in step 2. Please implement a Web-based interface to take user queries and return answers (document names, snapshot with search term(s) highlighted, and URL) to the user. You only need to provide a reasonable (not so fancy) interface, you can use WYSIWYG editors to generate HTML. Keep this version of your search engine, since it will be compared with two future versions.

Add term proximity into your scoring mechanism. Define your own score that reflects the proximity of search terms in each document. Define your own algorithm to integrate term proximity score with the tf-idf score from step 2.

Add one of the following to your search engine:

6.1. Search personalization: use cookies to track users. Record each search and each click-through. For a new query, add a small component of the "search history" as query expansion.

6.2. Relevance feedback. For each query, allow the user to identify a set of "positive" and "negative" results. Use user feedback to update the query and return new (refined) results to the user.

Please evaluate and compare the performance of the original search engine (step 4), and the new versions (step 5 and 6). You are expected to do a project presentation during the last two weeks of the semester. Please make a few powerpoint slides to introduce your search engine, especially the system architecture and design notes. Please also prepare a short demonstration of your search engine.

You need to submit a project report will all your source code (please do NOT include crawled documents). It is strongly suggested that you make the search engine available when we grade your project report.

Git help: https://help.github.com/articles/adding-an-existing-project-to-github-using-the-command-line/

Niche crawler references: (Here are some useful materials for us to develop a niche crawler) https://gist.github.com/jwickett/261551/f62b5ff83d6dbfe004f9ae5941a23a2e3110c57e http://ipython.iteye.com/blog/685835 http://blog.csdn.net/tingyuanss/article/details/7588700

Term Proximity https://etd.ohiolink.edu/rws_etd/document/get/case1197213718/inline

Relevalce feedback: https://github.com/rohanag/ADB-relevance-feedback-rocchio/blob/master/bingRelevance.py

returning value from php selected box: http://www.formget.com/php-select-option-and-php-radio-button/

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
Removed		Removed
cleaned		cleaned
docsnew		docsnew
ku_crawled_files		ku_crawled_files
php_files		php_files
CrawledUrlList2.txt		CrawledUrlList2.txt
EECS 767 Final Report_v1.docx		EECS 767 Final Report_v1.docx
EECS 767 Report1 version 2.docx		EECS 767 Report1 version 2.docx
EECS767_05032016.pptx		EECS767_05032016.pptx
IR_Flowchart.vsdx		IR_Flowchart.vsdx
QueryWithTF.py		QueryWithTF.py
QueryWithTF.pyc		QueryWithTF.pyc
README.md		README.md
RemoveStopWd.py		RemoveStopWd.py
RemovedString.py		RemovedString.py
SearchEngine_05022016.vsd		SearchEngine_05022016.vsd
Writepglistfile.py		Writepglistfile.py
boolean_dict.pkl		boolean_dict.pkl
boolean_dictonary.pkl		boolean_dictonary.pkl
candidateVector.py		candidateVector.py
candidateVector.pyc		candidateVector.pyc
candidatefile.py		candidatefile.py
candidatefile.pyc		candidatefile.pyc
collect_html_info.py		collect_html_info.py
df_dict.pkl		df_dict.pkl
df_dictonary.pkl		df_dictonary.pkl
doc_tfidf_matrix.pkl		doc_tfidf_matrix.pkl
doc_tfidf_normalize.pkl		doc_tfidf_normalize.pkl
extractValueFromList.py		extractValueFromList.py
feededUlrLists.py		feededUlrLists.py
filepath_list.pkl		filepath_list.pkl
finalFeedList.pkl		finalFeedList.pkl
finalFeedList.txt		finalFeedList.txt
inverted_index.py		inverted_index.py
inverted_index.pyc		inverted_index.pyc
ir_database.db		ir_database.db
ku_crawled_files2.zip		ku_crawled_files2.zip
locate_terms.py		locate_terms.py
locate_terms.pyc		locate_terms.pyc
minWindow.py		minWindow.py
minWindow.pyc		minWindow.pyc
mycrawler.py		mycrawler.py
outputQuery.py		outputQuery.py
path_related_sites.py		path_related_sites.py
path_value_list.pkl		path_value_list.pkl
php_python.py		php_python.py
php_python.pyc		php_python.pyc
preprocess.py		preprocess.py
preprocess.pyc		preprocess.pyc
preprocessed_docs_list.pkl		preprocessed_docs_list.pkl
process.py		process.py
process.pyc		process.pyc
relevance_feedback.py		relevance_feedback.py
relevance_feedback.pyc		relevance_feedback.pyc
similarity.py		similarity.py
similarity.pyc		similarity.pyc
testModule.py		testModule.py
tf_idf_calculate.py		tf_idf_calculate.py
tf_idf_calculate_new.py		tf_idf_calculate_new.py
tfidf_for_Query.py		tfidf_for_Query.py
tfidf_for_Query.pyc		tfidf_for_Query.pyc
togetherIndiceResult.py		togetherIndiceResult.py
togetherIndiceResult.pyc		togetherIndiceResult.pyc
utils.py		utils.py
utils.pyc		utils.pyc
word_dict.pkl		word_dict.pkl
word_dictonary.pkl		word_dictonary.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information-Retrieval

About

Releases

Packages

Contributors 3

Languages

yoconana/Information-Retrieval

Folders and files

Latest commit

History

Repository files navigation

Information-Retrieval

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages