-
Notifications
You must be signed in to change notification settings - Fork 65
Pre processing
As with all families of machine learning algorithms, performance is a function of the quality of the input data. Clust4j includes the following Transformer
classes:
BoxCoxTransformer
MeanCenterer
MedianCenterer
MinMaxScaler
PCA
RobustScaler
StandardScaler
WeightTransformer
YeoJohnsonTransformer
As with many other clust4j classes, their interface is written to be familiar to sklearn users. All Transformer
classes can be used in the following pseudo-code method:
RealMatrix X1 = some_data;
RealMatrix X2 = some_other_data;
// initialize and fit
Transformer t = new StandardScaler().fit(X1);
// transform train and test
RealMatrix train = t.transform(X1);
RealMatrix test = t.transform(X2);
// you can also inverse transform
RealMatrix inverse_train = t.inverseTransform(train); // should equal X1
Clust4j includes a toy DataSet
of intertwining crescents for bench-marking various algorithms (see ExampleDataSets.loadToyMoons()
). X1 vs X2:
X1 vs X3 (notice that in this dimension, we can achieve linear separability!):
Head:
X1 | X2 | X3 | labels |
---|---|---|---|
1.582023 | -0.445815 | 0.461456 | 1 |
0.066045 | 0.439207 | 0.480332 | 1 |
0.736631 | -0.398963 | 0.501694 | 1 |
-1.056928 | 0.242456 | 0.025548 | 0 |
Here's the setup for an example you can run:
// load the dataset
DataSet moons = ExampleDataSets.loadToyMoons();
final int[] actual_labels = moons.getLabels();
Most algorithms cannot segment the two classes without any pre-processing:
RealMatrix data = moons.getData();
KMeansParameters params = new KMeansParameters(2);
KMeans model = params.fitNewModel(data);
int[] predicted_labels = model.getLabels(); // maybe 50% accurate (depending on random state)?
However, using a WeightTransformer
, we can emphasize the importance of the X3
feature over the others:
// With just a bit of preprocessing...
UnsupervisedPipeline<KMeans> pipe = new UnsupervisedPipeline<KMeans>(
params,
new WeightTransformer(new double[]{0.5, 0.0, 2.0})
);
predicted_labels = pipe.fit(data).getLabels();
Though this is a trivial example and rarely will there ever be a perfectly linearly-separable hyperplane in your data, it emphasizes the importance of exploring your data before modeling, and applying transformations or pre-preprocessing techniques where appropriate to achieve maximal efficacy in your clustering.