Skip to content

Latest commit

 

History

History
52 lines (39 loc) · 4.99 KB

README.md

File metadata and controls

52 lines (39 loc) · 4.99 KB

Stanford STATS 202 Prediction 2024: URL Relevance Prediction

Search engines have become an integral part of our daily lives, offering relevant documents and websites based on user input and browsing behavior, such as cookies. To determine the relevance of a document, search engines utilize hundreds of signals, ultimately returning a ranked list of documents based on these signals.

In this project, we are provided with a training dataset comprising 80,046 observations and 10 attributes, including "query_length", "is_homepaged", and eight unnamed signals. The dataset also includes a binary output indicating whether an observation is relevant, based on search engine query and URL data. Additionally, we have a test dataset containing 30,001 observations with the same 10 attributes, but without the relevance output.

Our objective is to develop a model using the training dataset to predict the relevance of each observation in the test dataset.

Install

  1. Clone the repository and navigate to the RLPruner working directory
git clone https://github.com/Beryex/URL-Relevance-Prediction.git
cd URL-Relevance-Prediction
  1. Set up environment
pip install -r requirements.txt

Usage

We have implemented multiple dataset preprocessing and data mining methods, experimented with them and tried to find the true relationships between the 10 attributes and the relevance.

Results

For traditional methods, boosting has shown the best results and we have recorded the effect of dataset preprocessing on boosting with the optimal hyperparameters.

Standardize Remove Sig5 Remove Outlier Kernel Methods Apply PCA CV Average Accuracy Test Accuracy
No No No No No 66.83 67.43
No No No No Yes 58.83 N/A
Yes No No No Yes 65.97 N/A
Yes No Yes No No 66.73 N/A
Yes No No Yes No 66.73 N/A
Yes No No Yes Yes 65.87 N/A
Yes Yes No No No 66.73 68.25
Yes Yes Yes No No 66.64 67.53
Yes Yes No Yes No 66.58 N/A

We observed that removing outliers (despite eliminating only 23 samples in total) resulted in performance declines across all models. We hypothesize that these outliers represent high-leverage extreme cases that are beneficial for the model's fitting. Additionally, using degree-2 kernel methods to add features also led to performance degradation, possibly due to the interference of redundant features with the model's fitting. Similarly, applying PCA did not improve model performance, which we attribute to the loss of original information when the attributes were linearly mapped into an orthogonal feature space.

Standardizing the data proved to be effective for us. While not standardizing the data would slightly improve the average cross-validation accuracy in some cases, standardizing clearly yielded better results on the test dataset. Removing sig5 was also beneficial; although it did not significantly affect the average cross-validation accuracy, it slightly improved the test accuracy.

For deep learning, we also tested several models and recorded their results.

Model Initial Learning Rate Weight Decay Total Epoch Validation Accuracy
VGG11 0.1 1e-4 10 66.66
VGG19 0.05 1e-2 20 66.72
ResNet18 0.05 1e-4 10 66.76
ResNet50 0.05 1e-2 20 66.23

The neural networks tended to overfit the dataset easily (with the highest validation accuracy reaching 68.34%) and exhibited training instability, even when we reduced the learning rate. To mitigate overfitting, we increased the optimizer's weight decay, adjusted the validation set ratio, and applied dropout layers. Although neural networks have strong representational power, they are prone to overfitting or failing to capture the true relationships. The final results were not as good as those achieved with the traditional methods mentioned above.