Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formasaurus init fails with scikit-learn 1.2.0 #31

Open
mlec1 opened this issue Dec 16, 2022 · 1 comment
Open

Formasaurus init fails with scikit-learn 1.2.0 #31

mlec1 opened this issue Dec 16, 2022 · 1 comment

Comments

@mlec1
Copy link

mlec1 commented Dec 16, 2022

It seems that the version of scikit-learn v1.2.0 releases in Dec 2022 is breaking the formasaurus init command. See the following output:

Training form type detector on 1423 example(s)...
#9 4.760 Traceback (most recent call last):
#9 4.760   File "/usr/local/bin/formasaurus", line 33, in <module>
#9 4.761     sys.exit(load_entry_point('formasaurus==0.9.0', 'console_scripts', 'formasaurus')())
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/__main__.py", line 72, in main
#9 4.761     formasaurus.FormFieldClassifier.load()
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 101, in load
#9 4.761     ex = cls.trained_on(DEFAULT_DATA_PATH)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 119, in trained_on
#9 4.761     ex.train(annotations)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 131, in train
#9 4.761     self.form_classifier.train(annotations)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 266, in train
#9 4.761     self.model = formtype_model.train(
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/formtype_model.py", line 128, in train
#9 4.762     return model.fit(X, y)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 402, in fit
#9 4.762     Xt = self._fit(X, y, **fit_params_steps)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 360, in _fit
#9 4.762     X, fitted_transformer = fit_transform_one_cached(
#9 4.762   File "/usr/local/lib/python3.9/site-packages/joblib/memory.py", line 349, in __call__
#9 4.762     return self.func(*args, **kwargs)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 894, in _fit_transform_one
#9 4.762     res = transformer.fit_transform(X, y, **fit_params)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
#9 4.763     data_to_wrap = f(self, X, *args, **kwargs)
#9 4.763   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 1193, in fit_transform
#9 4.763     results = self._parallel_func(X, y, fit_params, _fit_transform_one)
#9 4.763   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 1215, in _parallel_func
#9 4.763     return Parallel(n_jobs=self.n_jobs)(
#9 4.763   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 1088, in __call__
#9 4.764     while self.dispatch_one_batch(iterator):
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
#9 4.764     self._dispatch(tasks)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
#9 4.764     job = self._backend.apply_async(batch, callback=cb)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
#9 4.764     result = ImmediateResult(func)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
#9 4.764     self.results = batch()
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
#9 4.765     return [func(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
#9 4.765     return [func(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/fixes.py", line 117, in __call__
#9 4.765     return self.function(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 894, in _fit_transform_one
#9 4.765     res = transformer.fit_transform(X, y, **fit_params)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 446, in fit_transform
#9 4.766     return last_step.fit_transform(Xt, y, **fit_params_last_step)
#9 4.766   File "/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 2121, in fit_transform
#9 4.766     X = super().fit_transform(raw_documents)
#9 4.766   File "/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1358, in fit_transform
#9 4.768     self._validate_params()
#9 4.768   File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 570, in _validate_params
#9 4.768     validate_parameter_constraints(
#9 4.768   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
#9 4.768     raise InvalidParameterError(
#9 4.768 sklearn.utils._param_validation.InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None. Got {'and', 'of', 'or'} instead.

This command works fine with the previous version of scikit-learn v1.1.3

@kmike
Copy link
Contributor

kmike commented Jul 29, 2024

This should be fixed in https://github.com/scrapinghub/Formasaurus (released as 0.9.0). Unfortunately we lost access to this repo, so the development is moved to another location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants