ValueError: This solver needs samples of at least 2 classes in the data #49

mrshanth · 2015-07-07T14:10:03Z

Hi,

I am using SparkLinearSVC. The code is as follows:

svm_model = SparkLinearSVC(class_weight='auto')
svm_fitted = svm_model.fit(train_Z,classes=np.unique(train_y))

and I get the following error:

File "/DATA/sdw1/hadoop/yarn/local/usercache/ad79139/filecache/328/spark-assembly-1.2.1.2.2.4.2-2-hadoop2.6.0.2.2.4.2-2.jar/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 258, in func
    return f(iterator)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 820, in <lambda>
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/usr/lib/python2.6/site-packages/splearn/linear_model/base.py", line 81, in <lambda>
    mapper = lambda X_y: super(cls, self).fit(
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/classes.py", line 207, in fit
    self.loss
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/base.py", line 809, in _fit_liblinear
    " class: %r" % classes_[0])
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

whereas, I have 2 classes, namely 0 and 1. The block size of the DictRDD is 2000. The percentage of classes 0 and 1 are 92% and 8% respectively

The text was updated successfully, but these errors were encountered:

kszucs · 2015-07-07T19:39:45Z

Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:

train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()

To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.

mrshanth · 2015-07-08T12:31:22Z

Thanks

jaydee92 · 2017-12-14T04:32:01Z

I believe I found a workaround for this. Considering these problems tend to happen to highly imbalanced datasets, I would suggest using StratifiedShuffleSplit, and alter the train_size or test_size ratio as an alternative as seen below:

for trainRatio in np.arange(0.05, 1, 0.05):
    split = StratifiedShuffleSplit(n_splits=2, train_size=trainRatio)
    for trainIdx, testIdx in split.split(X, y):
        Xtrain, Xtest = X[trainIdx], X[testIdx]
        ytrain, ytest = y[trainIdx], y[testIdx]
        model = someModel()
        model.fit(Xtrain, ytrain)
        pred = model.predict(Xtest)

YoannCheung · 2019-02-19T17:34:48Z

Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:
train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()
To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.

Can't believe that this bug is still not fixed! Sad!

kszucs added the bug label Jul 7, 2015

shuvayan mentioned this issue May 11, 2017

Intent classification error : Asks for atleast 2 classes in the sample when there are 2 samples already RasaHQ/rasa#359

Closed

ma-zhiyuan mentioned this issue Apr 8, 2019

LR train.py train error ma-zhiyuan/PersonalRecommendation#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: This solver needs samples of at least 2 classes in the data #49

ValueError: This solver needs samples of at least 2 classes in the data #49

mrshanth commented Jul 7, 2015

kszucs commented Jul 7, 2015

mrshanth commented Jul 8, 2015

jaydee92 commented Dec 14, 2017

YoannCheung commented Feb 19, 2019

ValueError: This solver needs samples of at least 2 classes in the data #49

ValueError: This solver needs samples of at least 2 classes in the data #49

Comments

mrshanth commented Jul 7, 2015

kszucs commented Jul 7, 2015

mrshanth commented Jul 8, 2015

jaydee92 commented Dec 14, 2017

YoannCheung commented Feb 19, 2019