Skip to content
chandlerkincaid edited this page Sep 21, 2016 · 14 revisions

https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html

This is the best explanation of Datasets I have found. In contrast to RDD's, Datasets are strongly-typed collections of objects, meaning you can (and must) define them in the way that you want. There are side by side comparisons of how to code simple things in RDD's and Datasets.

RANDOM FORESTS

Random Forests

PARAMETERS FOR RANDOM FORESTS

Parameters: input - Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.

numClasses - number of classes for classification.

categoricalFeaturesInfo - Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.

numTrees - Number of trees in the random forest.

featureSubsetStrategy - Number of features to consider for splits at each node. Supported: "auto", "all", "sqrt", "log2", "onethird". If "auto" is set, this parameter is set based on numTrees: if numTrees == 1, set to "all"; if numTrees > 1 (forest) set to "sqrt".

impurity - Criterion used for information gain calculation. Supported values: "gini" (recommended) or "entropy". "Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset."-Wikipedia Decision Tree learning. The entropy measure is taken from information theory. An analogy would be to compare a decision tree to a game of twenty questions, a good question in that game would score highly in information gain. Similarly a highly discriminating attribute makes a good branch in a decision tree.

maxDepth - Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (suggested value: 4)

maxBins - maximum number of bins used for splitting features (suggested value: 100)

seed - Random seed for bootstrapping and choosing feature subsets.

PARAMETERS FOR GRADIENT BOOSTED TREES

input - Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.

boostingStartegy - Structure that governs the operation of the algorithm. Has the following parameters parameters:

treeStartegy - Structure that governs the creation of the decision trees, has most of the same parameters as Random Forest (eg. maxDepth, maxBins, impurity, etc.) Also have parameter we used called minInformationGain which requires a minimum value of info gained for a node split to be valid.

loss - The loss function to be minimized during training

numIterations - Number of iterations used in the boosting e.g. number of trees used in final model

Learning rate - Step size for the weights traversing the gradient

validationTotal - Error Tolerance value that governs the termination condition for iteration termination based on accuracy of the model on the validation set

Clone this wiki locally