This is the group project of DD2477 Search Engines and Information Retrieval Systems (60034) at KTH. We implemented a small search engine which allows user to search for their interested podcast clips based on their query and time constraint. We use the famous SPOTIFY PODCAST DATASET, which includes the text information of the podcast transcripts and time markers. The main backend framework is Elasticsearch. It's used to index the transcriptions of the podcasts dataset and return ranked search results. The GUI is implemented with PyQt, where users can specify the query and time limit in the text box.
- Minchong Li: Backend logic design
- Tengfei Lu: GUI design
- Zihao Xu: Data indexing and Elasticsearch query design
-
Make sure you have downloaded the SPOFITY PODCAST DATASET, and most importantly, the Elasticsearch engine is installed on your machine.
-
Go to
config.yaml
: modify themeta_path
andtrans_root
values to your own path for the dataset. -
Launch Elasticsearch by running
elasticsearch.bat
. -
Run
index.bat
if you have not indexed the dataset. -
Run
search.bat
to start the search engine.
-
Select search method and input the time limit in the corresponding text box. The status bar in the bottom will show as "set time limit: x min" if time limit is set successfuly.
-
Input the query and hit the
Enter
key. The result box at the middle lists top of the filename of related transcripts form sorted by score from high to low. The status bar will show how much time was consumed during search. -
Click on an item, and left your cursor on that. There will be a float window that shows the episode_name of that item.
-
Double click on an item and the transcript text will be displayed in the text box at the bottom.