Skip to content

nicolaasmatthijs/AlterEgo

Repository files navigation

// Nicolaas Matthijs - nm417 - CSTIT //

AlterEgo :: Personalized Web Search
-----------------------------------

This code is the result of a research project conducted in Easter Term 2010 
in order to obtain the MPhil Computer Speech, Text and Internet Technology 
at the University of Cambridge. It investigates the use of profile based 
personalized web search using browsing history and thorough offline and 
online evaluation in a complete and self-contained research project. It 
involves the full research cycle and includes pre-research investigation, 
suggesting and implementing a new personalization strategy, capturing data, 
finding participants, two different evaluations, parameter investigation, 
comparison to previous best systems and making the developed Firefox add-on 
available for real-life usage.

This README file first gives an overview of all of the files available in this
repository (http://github.com/nicolaasmatthijs/AlterEgo). Next, a manual is 
provided that explains the different steps required to get the project up and running.

All of the code is highly configurable and extensively documented.

///////////////////
// File Overview //
///////////////////

*****************************
** 01_ProjectSpecification **
*****************************

++ Project Proposal

Original project proposal as submitted before the start of the project

++ Workplan

Workplan outlining the proposed ideas, goals and timeline

*********************
** 02_FirefoxAddOn **
*********************

++ Add-On

Code for the AlterEgo Firefox add-on.

++ DataStorage

Scripts that are responsible for saving a user's browsing history into
a database.

++ Website

Files for the website that was used to present the project and recruit people
to participate in the experiments.

******************************
** 03_PersonalizationServer **
******************************

++ Data Extraction

All script necessairy for processing a user's browsing history and extracting
some of the structure into a large XML user document.

++ Term Extraction

Re-implemenation of the Vu term extraction algorithm using the web.py python.
webserver

++ JAR Files

List of JAR files used by the AlterEgo personalization server.

++ Personalization

Main project output. This server, written in Java, does data processing,
profile generation, searching, result re-ranking, evaluation and evaluation
processing. This code also includes the MIT WordNet API.

*********************
** 04_3rdPartyCode **
*********************

This folder contains references to the different 3rd party tools that were
used for the implementation of this project.

++ ccparser

The Clark & Currant statistical natural language parser is used to parse the
text present on web pages in order to extract noun phrases.

++ google-n-gram-corpus

The Google N-Gram corpus is used to estimate the number of occurences of an
N-Gram on the web.

++ memcached-1.4.5
++ redis-1.2.6

Redis is a caching framework built on top of MemCacheD that is used to store the 
Google N-Gram corpus in memory and perform extremely fast retrieval of entries from
the corpus.

++ opennlp-tools-1.3.0

The OpenNLP Tools are used for sentence splitting and tokenization to prepare text
for parsing.

++ wordnet-3.0

The lexical database WordNet is used to filter a list of terms on Part of Speech.

***********************
** 05_DataAndResults **
***********************

++ Usage Data

Summary of the data collected by the Firefox add-on. 

++ Relevance Judgements

All documents related to the relevance judgements evaluation. This contains the
document that was given to the participants in the offline relevance judgements
sessions and the processed results coming out of that.

++ Interleaving

Contains all of the results coming out of the online interleaved evaluation
session.

*****************************
** 06_ProjectDocumentation **
*****************************

This folder contains of all of the files that try to document the overall
project and its outcomes.

++ Presentations

Powerpoint presentations that were used for intermediary project presentations

++ Thesis

The Masters thesis that was produced as output of this project

++ Conference Paper

LaTeX files that contain the first draft of the conference paper extracted out
of the produced paper. This paper, once finished, will be submitted to the
WSDM 2011 Conference in Hong Kong.


/////////////////////////
// Installation Manual //
/////////////////////////

***************************
** Checking out the code **
***************************

1) Install GIT (http://git-scm.com/)

2) Check out the Personalized Search code from [email protected]:nicolaasmatthijs/AlterEgo.git

********************************
** Installing 3rd Party Tools **
********************************

3) Install Memcached (see 04_3rdPartyCode/memcached-1.4.5)

4) Install the Redis caching server (see 04_3rdPartyCode/redis-1.2.6)

5) Start the Redis server using redis.sh

6) Install the Clark & Curran Statistical Natural Language Parser (see 04_3rdPartyCode/ccparser)

7) Download the models file required for the Clark & Curran Statistical Natural Language Parser

8) Download the latest version of the lexical database WordNet (see 04_3rdPartyCode/wordnet-3.0)

********************
** Firefox add-on **
********************

9) Install MySQL

10) Put the files found in 02_FirefoxAddOn/Website on an Apache server

11) Update the configuration parameters in 02_FirefoxAddOn/Website/config.php

12) Put the files found in 02_FirefoxAddOn/DataStorage on an Apache server

13) Update the configuration parameters in 02_FirefoxAddOn/DataStorage/config.php

*********************
** Data processing **
*********************

14) Put the files found in 03_PersonalizationServer/Data Extraction onto an Apache server

15) Put the term extraction script found in 03_PersonalizationServer/Term Extraction on a server that support python

16) Start the term extraction script by calling Term Extraction/web.py

***********************
** Re-ranking server **
***********************

17) Install the main personalization server found at 03_PersonalizationServer/Personalization on a Tomcat 6 web server

18) Add the JAR files found in 03_PersonalizationServer/JAR Files to the TomCat shared/lib folder

19) Update the configuration file at AlterEgoServer/build/src/org/cl/nm417/config/config.properties to comply with the local installation

20) Start the Personalization server

21) The following pages represent the main functionality
	 * index.html				Main personalization User Interface
	 * indexEvaluate.html		Relevance judgements experiment User Interface
	 * generateProfiles.jsp		Generate all profile for the Relevance judgements experiment
	 * results.jsp				Relevance judgements experiment result processing
	 * profilegenerator.jsp		Script used to continously keep user profiles up to date
	 * interleavingresults.jsp	Process the results from the online interleaved evaluation