Information Retrieval System

Overview

Welcome to this student repository for the KTH DD2477 Search Engine course project!

This repository contains the source code skeleton for implementing a rudimentary search engine as part of the assignments for the KTH DD2477 Search Engine course. The objective of this project is to develop a basic search engine that will be evaluated on a corpus of linked documents from the wiki for the US town Davis. The dataset can be found at DavisWiki.

Through this project, I have the opportunity to learn and apply concepts related to information retrieval, indexing, ranking algorithms, and search engine implementation. The codebase provided in this repository serves as the foundation for implementing the search engine functionalities across three assignments, with each assignment building upon the previous one.

For detailed instructions on each assignment, please refer to the tasks folder in this repository. The code base were created by Prof. Johan Boye et al., refactored and implemented by David Cao.

Dataset

Before getting started, ensure you have downloaded the datasets in this repository:

davisWiki Dataset
guardian Dataset

Unzip them in the src/main/datasets directory.

Getting Started

Prerequisites

Java Development Kit (JDK) installed on your system (Java 8 and above)
Git installed on your system (if cloning the repository)

Installation

Clone or download this repository to your local machine:
```
git clone https://github.com/DavidCWQ/SearchEngine.git
```
Alternatively, download the repository as a ZIP file and extract it to a directory of your choice.
Navigate to the directory containing the cloned or extracted files:
```
cd SearchEngine
```
Navigate to the directory containing the project scripts:
```
cd scripts
```
If you're on a Unix-like system (Linux, macOS), compile the lab skeleton by running:
```
sh compile_all.sh
```
If you're on a Windows platform, you can use the batch file instead:
```
./compile_all.bat
```
You may encounter some warnings during compilation; these can typically be ignored.

Usage & Examples

Once the compilation is successful, you can run the search engine using the provided scripts. Here's an example script command to get started:

cd ..

java -cp target/classes -Xmx1g ir.Engine -d davisWiki -l dd2477.png -p patterns.txt -r pagerank_result.txt -t davisTitles.txt -lk linksDavis.txt

You can also run the search engine with a persistent index if you have the dataset indexed:

java -cp target/classes -Xmx1g ir.Engine -d davisWiki -l dd2477.png -p patterns.txt -r pagerank_result.txt -t davisTitles.txt -lk linksDavis.txt -ni

Please remember to recompile the project after making any changes to the source code.

You can do this by running the compile_all.sh script (for Unix-like systems) or compile_all.bat batch file (for Windows) located in the scripts directory.

Command Line Options

The program supports the following command-line options:

-d [dataset_name]: Specify the dataset name (davisWiki by default).
-p [pattern_file]: Specify the regex file used in tokenization.
-l [search_logo]: Specify the project logo file dd2477.png.
-r [rank_result]: Specify the pagerank result file.
-t [rank_title]: Specify the pagerank title file.
-lk [link_file]: Specify the pagerank link file.
-ni: Disable indexing.

Directory Structure

Here's an overview of the directory structure for this project.

.
├── scripts/                 # Directory for scripts related to project automation
├── src/                     # Source code files
│   ├── main/                # Main application code
│   │   ├── datasets/        # Directory for dataset files used in the project
│   │   ├── ir/              # Java source files for Information Retrieval components
│   │   ├── lib/             # Third-party libraries used in the project
│   │   └── resources/       # Additional resource files used in the application
│   └── test/                # Test code (partial)
│       └── resources/       # Resource files used in test cases
├── target/                  # Compiled output directory
├── tasks/                   # Directory for task-related documentation and assignment PDF
└── README.md                # Project README file providing an overview of the project

Feel free to explore each directory for more details on the contents.

License

This project is licensed under the Apache 2.0 License.

Contributing

Contributions to this project are welcome. Feel free to submit bug reports, feature requests, or pull requests.

Acknowledgements

Special thanks to KTH DD2477 course instructors and TAs for providing the course materials and guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
scripts		scripts
src		src
tasks		tasks
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SearchEngine.iml		SearchEngine.iml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval System

Overview

Dataset

Getting Started

Prerequisites

Installation

Usage & Examples

Command Line Options

Directory Structure

License

Contributing

Acknowledgements

About

Releases

Packages

Languages

License

DavidCWQ/SearchEngine

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval System

Overview

Dataset

Getting Started

Prerequisites

Installation

Usage & Examples

Command Line Options

Directory Structure

License

Contributing

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages