Experimental plugin for scikit-learn that implements a backend for (some) scikit-learn
estimators, written in pytorch
, so that it benefits from pytorch
ability to
dispatch data and compute to many devices, providing the appropriate pytorch extensions
are installed.
This package requires working with the following experimental branch of scikit-learn:
feature/engine-api
branch on https://github.com/scikit-learn/scikit-learn
sklearn.cluster.KMeans
for the standard LLoyd's algorithm on dense data arrays, includingkmeans++
support.
Getting started requires a working python environment for using pytorch
. Depending on
the device you target, install PyTorch extensions accordingly, including (but not
limited to):
-
using one of the native distributions for cuda (for nvidia gpus), rocm (amd gpus) or mps (apple gpus) support
-
using Intel distributions for xpu (for intel gpus), has experimental (unofficial) support for igpus if compiling from source with appropriate flags
Using the plugin requires the experimental development branch feature/engine-api
of
scikit-learn that implements the compatible plugin system. The sklearn_pytorch_engine
plugin is compatible with the commit 2ccfc8c4bdf66db005d7681757b4145842944fb9 available
in the fork fcharras/scikit-learn .
Please refer to the relevant scikit-learn documentation page
for a comprehensive guide regarding installing from source. For instance, using pip
and apt
(assuming apt
-based environment):
apt-get update --quiet
# Install prerequisites
apt-get install -y build-essential python3-dev git
pip install cython numpy scipy joblib threadpoolctl
# Build and install
pip install git+https://github.com/fcharras/scikit-learn.git@2ccfc8c4bdf66db005d7681757b4145842944fb9#egg=scikit-learn
When loaded into your PyTorch + scikit-learn environment, run:
git clone https://github.com/soda-inria/sklearn-pytorch-engine
cd sklearn-pytorch-engine
pip install -e .
See the sklearn_pytorch_engine/kmeans/tests
folder for example usage.
🚧 TODO: write some examples here instead.
To run the tests run the following from the root of the sklearn_pytorch_engine
repository:
pytest sklearn_pytorch_engine
To run the scikit-learn
tests with the sklearn_pytorch_engine
engine you can run the
following:
SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
(change the --pyargs
option accordingly to select other test suites).
The --sklearn-engine-provider sklearn_pytorch_engine
option offered by the sklearn
pytest plugin will automatically activate the sklearn_pytorch_engine
engine for all
tests.
Tests covering unsupported features (that trigger
sklearn.exceptions.FeatureNotCoveredByPluginError
) will be automatically marked as
xfailed.
By default, the engine will use the compute follow data principle, meaning that it
will run the compute on the device that manages the data. For instance kmeans.fit(X)
will run compute on corresponding xpu device if X
is a torch.tensor
array such that
X.device.type
is "xpu"
, and will run on cpu if X.device.type
is "cpu"
, etc.
It's possible to alter this behavior and have the engine force offload the compute to
a specific device, using the environment variable
SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE
. For instance, on a compatible computer,
SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=mps
will force the compute to the
mps
-compatible device, even if it requires copying the input data under the hood to
do so.
Both internal and scikit-learn test suites can run with any value of
SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE
as long as the compatible pytorch extension
is available and that the host hardware is compatible, for instance:
export SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=xpu
pytest sklearn_pytorch_engine
SKLEARN_PYTORCH_ENGINE_TESTING_MODE=1 pytest --sklearn-engine-provider sklearn_pytorch_engine --pyargs sklearn.cluster.tests.test_k_means
will run all compute on the relevant xpu
device.
At the moment, both tests suite will create test data that is hosted on the CPU by
default. For internal tests, this behavior can be changed with the environment variable
SKLEARN_PYTORCH_ENGINE_TEST_INPUTS_DEVICE
, for instance the command
SKLEARN_PYTORCH_ENGINE_TEST_INPUTS_DEVICE=cuda SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE=cpu pytest sklearn_pytorch_engine
will run the tests while enforcing that the test data is generated on the cuda device
but the compute is done on cpu (since SKLEARN_PYTORCH_ENGINE_DEFAULT_DEVICE
is set
to cpu
).
All combinations of those two environment variables makes for a reasonably exhaustive test matrix regarding internal data conversions.
In many machine learning applications, operations using single-precision (float32) floating point data require twice as less memory that double-precision (float64), are regarded as faster, accurate enough and more suitable for GPU compute. Besides, most GPUs used in machine learning projects are significantly faster with float32 than with double-precision (float64) floating point data.
To leverage the full potential of GPU execution, it's strongly advised to use a float32 data type.
By default, unless specified otherwise numpy array are created with type float64, so be especially careful to the type whenever the loader does not explicitly document the type nor expose a type option.
Transforming NumPy arrays from float64 to float32 is also possible using
numpy.ndarray.astype
,
although it is less recommended to prevent avoidable data copies. numpy.ndarray.astype
can be used as follows:
X = my_data_loader()
X_float32 = X.astype(float32)
my_gpu_compute(X_float32)