CDiscout-Classification

This code contains, among file utils, various learning methods designed for sparse datasets, where keys are represented by strings. This allows to keep track of "what is going on" in every algorithm : what do the centroids look like ? what do the decision trees look like ?

k-Nearest Neighbours (based on a TF-IDF representation of the documents)
Nearest Centroids (based on a TF-IDF representation of the documents)
"Logic" Decision tree, proposing formulas such as if(not word1 in doc and word2 in doc and word3 in doc) then Cat k
SGD for multiclass problems (with a high number of classes)
Bag of Words (not used in the final solution)

Optimizations include :

Inversion of indexes
Parallelization
Pre-allocation of memory
Unsafe code

Some parts of the code come from other sources :

Stemming is a C# port of Snowball
Text to TFIDF utils

The final solution has been generated following these steps :

Shuffle the training file according to 7 seeds : 0,1...,6
Down-sample each file so that there a no more than 1000 elements per class
On each file train : a 3-NN model (lookup parameter 0.25, no stemming), a Nearest Centroid model (using PureInteractions mapping, stemmed), A hierarchical SGD model, a SGD model and a logic decision tree (max depth 4500, min elements per leaf = 8, no stemming)

They correspond to the default models in the main window.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
NDSB		NDSB
.gitattributes		.gitattributes
.gitignore		.gitignore
NDSB.sln		NDSB.sln
NDSB.suo		NDSB.suo
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDiscout-Classification

About

Releases

Packages

Languages

turboNinja2/CDiscount-Classification

Folders and files

Latest commit

History

Repository files navigation

CDiscout-Classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages