Skip to content

hpylieva/FutureOfWikipedia

 
 

Repository files navigation

Originally, our problem statement was the following: “Find pages in Ukrainian Wikipedia that should be translated”. But this task is very subjective and hard to evaluate. How do we define “better” pages?

So as a result of our discussion we decided to change problem statement to: “Prediction of page translation from its historical data”.

So to solve our original task we find pages in Ukrainian wikipedia that most probably will be translated. Here we've made 2 assumptions:

  • a page should be translated because similar type of pages were translated before
  • a page was translated from Ukrainian to English if Ukrainian page was created before the corresponding English page.

Structure of project

data contains our final time series data with all features and data in aggregateed format for translated and untranslated pages

data collection contains all scripts related to data collection

  • create_timeseries_views_revisions_contributors.ipynb : functions for creating timeseries for given titles (from pickle file) and few supporting functions

preprocessing/data_preprocessing.ipynb converts data in kind of time series format into an ordinary tabular format where each article is characterized by 1 row of data.

modeling/modeling.ipynb contains partitionaing data into train and test and model fitting and evaluation.

visualization contains visualizations on R for distribution of pages by their age on date of translation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 95.0%
  • Jupyter Notebook 5.0%