A program that takes a Wikipedia page as its input and returns the most important words on the page based on their term frequency-inverse document frequency (TF-IDF) score.
The program was written in Java 17 using Maven.
A program that takes a Wikipedia page as its input and computes the most important words on that page. It constructs a corpus of all English-language Wikipedia pages reachable in a given number of steps from the input page, or all such pages reachable within the provided time limit. The Wikipedia pages are parsed using the JSoup library. For each page, the relevant page content is stored and bag-of-words model is constructed. A list of frequent words in the English language (stopwords.txt) is used to filter out common words. After the corpus is constructed, the TF-IDF score is computed for each token on the input page.
The (relative) term frequency of term t in document d is given by:
$$ tf(t,d) = {f_{t,d} \over \sum_{t' \in d} f_{t',d}} $$
Here,
The inverse document frequency of term t relative to the full corpus of documents D is given by:
$$ idf(t,D) = log{N \over |\{d \in D : t \in D \}|} $$
Here, N is the total number of documents in D.
The highest-ranking tokens based on their TF-IDF score are then printed.
Clone this repository:
$ git clone https://github.com/Vishengel/Web-IQ.git
Either:
- Run "mvn package" in the root directory to download dependencies and create WebIQ-1.0-SNAPSHOT-jar-with-dependencies.jar
Or:
- Directly use the provided .jar file with dependencies included WebIQ-1.0-SNAPSHOT-jar-with-dependencies.jar
Then:
- Run the program as follows:
> java -jar target/WebIQ-1.0-SNAPSHOT-jar-with-dependencies.jar [Wikipedia page title] [N maximum steps from starting page] [Max runtime in minutes] [N results to print]
e.g.
> java -jar target/WebIQ-1.0-SNAPSHOT-jar-with-dependencies.jar Elephant 2 10 50
Provide the page title without the preceding url, i.e. the part that follows /wiki/. Replace spaces by an underscore, e.g. Open-source_intelligence.
Running WebIQ-1.0-SNAPSHOT-jar-with-dependencies.jar without parameters will use "Open-source_intelligence 2 5 25" as default settings.