Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: This solver needs samples of at least 2 classes in the data #49

Open
mrshanth opened this issue Jul 7, 2015 · 4 comments
Open
Labels

Comments

@mrshanth
Copy link

mrshanth commented Jul 7, 2015

Hi,

I am using SparkLinearSVC. The code is as follows:

svm_model = SparkLinearSVC(class_weight='auto')
svm_fitted = svm_model.fit(train_Z,classes=np.unique(train_y))

and I get the following error:

File "/DATA/sdw1/hadoop/yarn/local/usercache/ad79139/filecache/328/spark-assembly-1.2.1.2.2.4.2-2-hadoop2.6.0.2.2.4.2-2.jar/pyspark/worker.py", line 98, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 2081, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 258, in func
    return f(iterator)
  File "/usr/hdp/2.2.4.2-2/spark/python/pyspark/rdd.py", line 820, in <lambda>
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/usr/lib/python2.6/site-packages/splearn/linear_model/base.py", line 81, in <lambda>
    mapper = lambda X_y: super(cls, self).fit(
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/classes.py", line 207, in fit
    self.loss
  File "/usr/lib64/python2.6/site-packages/sklearn/svm/base.py", line 809, in _fit_liblinear
    " class: %r" % classes_[0])
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

whereas, I have 2 classes, namely 0 and 1. The block size of the DictRDD is 2000. The percentage of classes 0 and 1 are 92% and 8% respectively

@kszucs kszucs added the bug label Jul 7, 2015
@kszucs
Copy link
Contributor

kszucs commented Jul 7, 2015

Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:

train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()

To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.

@mrshanth
Copy link
Author

mrshanth commented Jul 8, 2015

Thanks

@jaydee92
Copy link

I believe I found a workaround for this. Considering these problems tend to happen to highly imbalanced datasets, I would suggest using StratifiedShuffleSplit, and alter the train_size or test_size ratio as an alternative as seen below:

for trainRatio in np.arange(0.05, 1, 0.05):
    split = StratifiedShuffleSplit(n_splits=2, train_size=trainRatio)
    for trainIdx, testIdx in split.split(X, y):
        Xtrain, Xtest = X[trainIdx], X[testIdx]
        ytrain, ytest = y[trainIdx], y[testIdx]
        model = someModel()
        model.fit(Xtrain, ytrain)
        pred = model.predict(Xtest)

@YoannCheung
Copy link

Sadly this is a bug indeed. Sparkit trains sklearn's linear models in parallel, then averages them in a reduce step. There is at least one block, which contains only one of the labels. To check try the following:

train_Z[:, 'y']._rdd.map(lambda x: np.unique(x).size).filter(lambda x: x < 2).count()

To resolve You could randomize the train data to avoid blocks with one label, but this is still waiting for a clever solution.

Can't believe that this bug is still not fixed! Sad!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants