[MRG+1] Threshold for pairs learners (#168)

* add some tests for testing that different scores work using the scoring function * ENH: Add tests and basic threshold implementation * Add support for LSML and more generally quadruplets * Make CalibratedClassifierCV work (for preprocessor case) thanks to classes_ * Fix some tests and PEP8 errors * change the sign in decision function * Add docstring for threshold_ and classes_ in the base _PairsClassifier class * remove quadruplets from the test with scikit learn custom scorings * Remove argument y in quadruplets learners and lsml * FIX fix docstrings of decision functions * FIX the threshold by taking the opposite (to be adapted to the decision function) * Fix tests to have no y for quadruplets' estimator fit * Remove isin to be compatible with old numpy versions * Fix threshold so that it has a positive value and add small test * Fix threshold for itml * FEAT: Add calibrate_threshold and tests * MAINT: remove starred syntax for compatibility with older versions of python * Remove debugging prints and make tests for ITML pass, while waiting for #175 to be solved * FIX: from __future__ import division to pass tests for python 2.7 * Add some documentation for calibration * DOC: fix style * Address most comments from aurelien's reviews * Remove classes_ attribute and test for CalibratedClassifierCV * Rename make_args_inc_quadruplets into remove_y_quadruplets * TST: Fix remaining threshold into min_rate * Remove default_threshold and put calibrate_threshold instead * Use calibrate_threshold for ITML, and remove description * ENH: use calibrate_threshold by default and display its parameters from the fit method * Add a small test to test automatic calibration * Update documentation of the default threshold * Inverse sense for threshold comparison to be more intuitive * Address remaining review comments * MAINT: Rename threshold_params into calibration_params * TST: Add test for extreme cases * MAINT: rename threshold_params into calibration_params * MAINT: rename threshold_params into calibration_params * FIX: Make tests work, and add the right threshold (mean between lowest accepted value and highest rejected value), and max + 1 or min - 1 for extreme points * Go back to previous version of finding the threshold * Extract method for validating calibration parameters * Validate calibration params before fit * Address #168 (comment)
scikit-learn-contrib · Apr 15, 2019 · edad55d · edad55d
1 parent b28933c
commit edad55d
Show file tree

Hide file tree

Showing 11 changed files with 1,066 additions and 148 deletions.
diff --git a/doc/weakly_supervised.rst b/doc/weakly_supervised.rst
@@ -148,8 +148,47 @@ tuples you're working with (pairs, triplets...). See the docstring of the
 `score` method of the estimator you use.
 
 
+Learning on pairs
+=================
+
+Some metric learning algorithms learn on pairs of samples. In this case, one
+should provide the algorithm with ``n_samples`` pairs of points, with a
+corresponding target containing ``n_samples`` values being either +1 or -1.
+These values indicate whether the given pairs are similar points or
+dissimilar points.
+
+
+.. _calibration:
+
+Thresholding
+------------
+In order to predict whether a new pair represents similar or dissimilar
+samples, we need to set a distance threshold, so that points closer (in the
+learned space) than this threshold are predicted as similar, and points further
+away are predicted as dissimilar. Several methods are possible for this
+thresholding.
+
+- **At fit time**: The threshold is set with `calibrate_threshold` (see
+ below) on the trainset. You can specify the calibration parameters directly
+ in the `fit` method with the `threshold_params` parameter (see the
+ documentation of the `fit` method of any metric learner that learns on pairs
+ of points for more information). This method can cause a little bit of
+ overfitting. If you want to avoid that, calibrate the threshold after
+ fitting, on a validation set.
+
+- **Manual**: calling `set_threshold` will set the threshold to a
+ particular value.
+
+- **Calibration**: calling `calibrate_threshold` will calibrate the
+ threshold to achieve a particular score on a validation set, the score
+ being among the classical scores for classification (accuracy, f1 score...).
+
+
+See also: `sklearn.calibration`.
+
+
 Algorithms
-==================
+==========
 
 ITML
 ----
@@ -192,39 +231,6 @@ programming.
  .. [2] Adapted from Matlab code at http://www.cs.utexas.edu/users/pjain/
  itml/
 
-
-LSML
-----
-
-`LSML`: Metric Learning from Relative Comparisons by Minimizing Squared
-Residual
-
-.. topic:: Example Code:
-
-::
-
- from metric_learn import LSML
-
- quadruplets = [[[1.2, 7.5], [1.3, 1.5], [6.4, 2.6], [6.2, 9.7]],
- [[1.3, 4.5], [3.2, 4.6], [6.2, 5.5], [5.4, 5.4]],
- [[3.2, 7.5], [3.3, 1.5], [8.4, 2.6], [8.2, 9.7]],
- [[3.3, 4.5], [5.2, 4.6], [8.2, 5.5], [7.4, 5.4]]]
-
- # we want to make closer points where the first feature is close, and
- # further if the second feature is close
-
- lsml = LSML()
- lsml.fit(quadruplets)
-
-.. topic:: References:
-
- .. [1] Liu et al.
- "Metric Learning from Relative Comparisons by Minimizing Squared
- Residual". ICDM 2012. http://www.cs.ucla.edu/~weiwang/paper/ICDM12.pdf
-
- .. [2] Adapted from https://gist.github.com/kcarnold/5439917
-
-
 SDML
 ----
 
@@ -343,3 +349,46 @@ method. However, it is one of the earliest and a still often cited technique.
  -with-side-information.pdf>`_ Xing, Jordan, Russell, Ng.
  .. [2] Adapted from Matlab code `here <http://www.cs.cmu
  .edu/%7Eepxing/papers/Old_papers/code_Metric_online.tar.gz>`_.
+
+Learning on quadruplets
+=======================
+
+A type of information even weaker than pairs is information about relative
+comparisons between pairs. The user should provide the algorithm with a
+quadruplet of points, where the two first points are closer than the two
+last points. No target vector (``y``) is needed, since the supervision is
+already in the order that points are given in the quadruplet.
+
+Algorithms
+==========
+
+LSML
+----
+
+`LSML`: Metric Learning from Relative Comparisons by Minimizing Squared
+Residual
+
+.. topic:: Example Code:
+
+::
+
+ from metric_learn import LSML
+
+ quadruplets = [[[1.2, 7.5], [1.3, 1.5], [6.4, 2.6], [6.2, 9.7]],
+ [[1.3, 4.5], [3.2, 4.6], [6.2, 5.5], [5.4, 5.4]],
+ [[3.2, 7.5], [3.3, 1.5], [8.4, 2.6], [8.2, 9.7]],
+ [[3.3, 4.5], [5.2, 4.6], [8.2, 5.5], [7.4, 5.4]]]
+
+ # we want to make closer points where the first feature is close, and
+ # further if the second feature is close
+
+ lsml = LSML()
+ lsml.fit(quadruplets)
+
+.. topic:: References:
+
+ .. [1] Liu et al.
+ "Metric Learning from Relative Comparisons by Minimizing Squared
+ Residual". ICDM 2012. http://www.cs.ucla.edu/~weiwang/paper/ICDM12.pdf
+
+ .. [2] Adapted from https://gist.github.com/kcarnold/5439917