scikit-learn-contrib · mvargas33 · Sep 15, 2021 · Sep 17, 2021 · Sep 17, 2021 · Sep 17, 2021
diff --git a/doc/introduction.rst b/doc/introduction.rst
@@ -4,17 +4,16 @@
 What is Metric Learning?
 ========================
 
-Many approaches in machine learning require a measure of distance between data
-points. Traditionally, practitioners would choose a standard distance metric
+Many approaches in machine learning require a measure of distance (or similarity)
+between data points. Traditionally, practitioners would choose a standard metric
 (Euclidean, City-Block, Cosine, etc.) using a priori knowledge of the
 domain. However, it is often difficult to design metrics that are well-suited
 to the particular data and task of interest.
 
-Distance metric learning (or simply, metric learning) aims at
-automatically constructing task-specific distance metrics from (weakly)
-supervised data, in a machine learning manner. The learned distance metric can
-then be used to perform various tasks (e.g., k-NN classification, clustering,
-information retrieval).
+Metric learning (or simply, metric learning) aims at automatically constructing
+task-specific metrics from (weakly) supervised data, in a machine learning manner.
+The learned metric can then be used to perform various tasks (e.g.,
+k-NN classification, clustering, information retrieval).
 
 Problem Setting
 ===============
@@ -25,27 +24,27 @@ of supervision available about the training data:
 - :doc:`Supervised learning <supervised>`: the algorithm has access to
  a set of data points, each of them belonging to a class (label) as in a
  standard classification problem.
- Broadly speaking, the goal in this setting is to learn a distance metric
+ Broadly speaking, the goal in this setting is to learn a metric
  that puts points with the same label close together while pushing away
  points with different labels.
 - :doc:`Weakly supervised learning <weakly_supervised>`: the
  algorithm has access to a set of data points with supervision only
  at the tuple level (typically pairs, triplets, or quadruplets of
  data points). A classic example of such weaker supervision is a set of
- positive and negative pairs: in this case, the goal is to learn a distance
+ positive and negative pairs: in this case, the goal is to learn a
  metric that puts positive pairs close together and negative pairs far away.
 
 Based on the above (weakly) supervised data, the metric learning problem is
 generally formulated as an optimization problem where one seeks to find the
-parameters of a distance function that optimize some objective function
+parameters of a function that optimize some objective function
 measuring the agreement with the training data.
 
 .. _mahalanobis_distances:
 
 Mahalanobis Distances
 =====================
 
-In the metric-learn package, all algorithms currently implemented learn 
+In the metric-learn package, most algorithms currently implemented learn 
 so-called Mahalanobis distances. Given a real-valued parameter matrix
 :math:`L` of shape ``(num_dims, n_features)`` where ``n_features`` is the
 number features describing the data, the Mahalanobis distance associated with
@@ -79,6 +78,35 @@ necessarily the identity of indiscernibles.
  parameterizations are equivalent. In practice, an algorithm may thus solve
  the metric learning problem with respect to either :math:`M` or :math:`L`.
 
+.. _bilinear_similarity:
+
+Bilinear Similarities
+=====================
+
+Some algorithms in the package learn bilinear similarity functions. These
+similarity functions are not pseudo-distances: they simply output real values
+such that the larger the similarity value, the more similar the two examples.
+Given a real-valued parameter matrix :math:`W` of shape
+``(n_features, n_features)`` where ``n_features`` is the number features
+describing the data, the bilinear similarity associated with :math:`W` is
+defined as follows:
+
+.. math:: S_W(x, x') = x^\top W x'
+
+The matrix :math:`W` is not required to be positive semi-definite (PSD) or
+even symmetric, so the distance properties (nonnegativity, identity of
+indiscernibles, symmetry and triangle inequality) do not hold in general.
+
+This allows some algorithms to optimize :math:`S_W` in an online manner using a
+simple and efficient procedure, and thus can be applied to problems with
+millions of training instances and achieves state-of-the-art performance
+on an image search task using :math:`k`-NN.
+
+The absence of PSD constraint can enable the design of more efficient
+algorithms. It is also relevant in applications where the underlying notion
+of similarity does not satisfy the triangle inequality, as known to be the
+case for visual judgments.
+
 .. _use_cases:
 
 Use-cases
@@ -99,9 +127,9 @@ examples (for code illustrating some of these use-cases, see the
  elements of a database that are semantically closest to a query element.
 - Dimensionality reduction: metric learning may be seen as a way to reduce the
  data dimension in a (weakly) supervised setting.
-- More generally, the learned transformation :math:`L` can be used to project
- the data into a new embedding space before feeding it into another machine
- learning algorithm.
+- More generally with Mahalanobis distances, the learned transformation :math:`L`
+ can be used to project the data into a new embedding space before feeding it
+ into another machine learning algorithm.
 
 The API of metric-learn is compatible with `scikit-learn
 <https://scikit-learn.org/>`_, the leading library for machine

diff --git a/doc/supervised.rst b/doc/supervised.rst
@@ -41,70 +41,93 @@ two numbers.
 
 Fit, transform, and so on
 -------------------------
-The goal of supervised metric-learning algorithms is to transform
-points in a new space, in which the distance between two points from the
-same class will be small, and the distance between two points from different
-classes will be large. To do so, we fit the metric learner (example:
-`NCA`).
+The goal of supervised metric learning algorithms is to learn a (distance or
+similarity) metric such that two points from the same class will be similar
+(e.g., have small distance) and points from different classes will be dissimilar
+(e.g., have large distance).
+
+To do so, we first need to fit the supervised metric learner on a labeled dataset,
+as in the example below with ``NCA``.
 
 >>> from metric_learn import NCA
 >>> nca = NCA(random_state=42)
 >>> nca.fit(X, y)
 NCA(init='auto', max_iter=100, n_components=None,
  preprocessor=None, random_state=42, tol=None, verbose=False)
 
-
 Now that the estimator is fitted, you can use it on new data for several
 purposes.
 
-First, you can transform the data in the learned space, using `transform`:
-Here we transform two points in the new embedding space.
+We can now use the learned metric to **score** new pairs of points with ``pair_score``
+(the larger the score, the more similar the pair). For Mahalanobis learners,
+it is equal to the opposite of the distance.
 
->>> X_new = np.array([[9.4, 4.1], [2.1, 4.4]])
->>> nca.transform(X_new)
-array([[ 5.91884732, 10.25406973],
- [ 3.1545886 , 6.80350083]])
+>>> score = nca.pair_score([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]])
+>>> score
+array([-0.49627072, -3.65287282, -6.06079877])
 
-Also, as explained before, our metric learners has learn a distance between
-points. You can use this distance in two main ways:
+This is useful because ``pair_score`` matches the **score** semantic of 
+scikit-learn's `Classification metrics
+<https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics>`_.
 
-- You can either return the distance between pairs of points using the
- `pair_distance` function:
+For metric learners that learn a distance metric, there is also the ``pair_distance``
+method.
 
 >>> nca.pair_distance([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]])
 array([0.49627072, 3.65287282, 6.06079877])
 
-- Or you can return a function that will return the distance (in the new
- space) between two 1D arrays (the coordinates of the points in the original
- space), similarly to distance functions in `scipy.spatial.distance`.
+.. warning::
+
+ If you try to use ``pair_distance`` with a bilinear similarity learner, an error
+ will be thrown, as it does not learn a distance.
+
+You can also return a function that will return the metric learned. It can
+compute the metric between two 1D arrays, similarly to distance functions in
+`scipy.spatial.distance`. To do that, use the ``get_metric`` method.
 
 >>> metric_fun = nca.get_metric()
 >>> metric_fun([3.5, 3.6], [5.6, 2.4])
 0.4962707194621285
 
-- Alternatively, you can use `pair_score` to return the **score** between
- pairs of points (the larger the score, the more similar the pair).
- For Mahalanobis learners, it is equal to the opposite of the distance.
+You can also call ``get_metric`` with bilinear similarity learners, and you will get
+a function that will return the similarity between 1D arrays.
 
->>> score = nca.pair_score([[[3.5, 3.6], [5.6, 2.4]], [[1.2, 4.2], [2.1, 6.4]], [[3.3, 7.8], [10.9, 0.1]]])
->>> score
-array([-0.49627072, -3.65287282, -6.06079877])
+>>> similarity_fun = algorithm.get_metric()
+>>> similarity_fun([3.5, 3.6], [5.6, 2.4])
+-0.04752
 
-This is useful because `pair_score` matches the **score** semantic of 
-scikit-learn's `Classification metrics
-<https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics>`_.
+Finally, as explained in :ref:`mahalanobis_distances`, these are equivalent to the Euclidean
+distance in a transformed space, and can thus be used to transform data points in
+a new embedding space. You can use ``transform`` to do so.
+
+>>> X_new = np.array([[9.4, 4.1], [2.1, 4.4]])
+>>> nca.transform(X_new)
+array([[ 5.91884732, 10.25406973],
+ [ 3.1545886 , 6.80350083]])
+
+.. warning::
+
+ If you try to use ``transform`` with a bilinear similarity learner, an error will
+ be thrown, as you cannot transform the data using them.
 
 .. note::
 
  If the metric learner that you use learns a :ref:`Mahalanobis distance
- <mahalanobis_distances>` (like it is the case for all algorithms
- currently in metric-learn), you can get the plain learned Mahalanobis
- matrix using `get_mahalanobis_matrix`.
+ <mahalanobis_distances>`, you can get the learned Mahalanobis
+ matrix :math:`M` using `get_mahalanobis_matrix`.
 
  >>> nca.get_mahalanobis_matrix()
  array([[0.43680409, 0.89169412],
  [0.89169412, 1.9542479 ]])
 
+ If the metric learner that you use learns a :ref:`bilinear similarity
+ <bilinear_similarity>`, you can get the plain learned Bilinear
+ matrix :math:`W` using `get_bilinear_matrix`.
+
+ >>> algorithm.get_bilinear_matrix()
+ array([[-0.72680409, -0.153213],
+ [1.45542269, 7.8135546 ]])
+
 
 Scikit-learn compatibility
 --------------------------
@@ -116,7 +139,7 @@ All supervised algorithms are scikit-learn estimators
 scikit-learn model selection routines 
 (`sklearn.model_selection.cross_val_score`,
 `sklearn.model_selection.GridSearchCV`, etc).
-You can also use some of the scoring functions from `sklearn.metrics`.
+You can also use some scoring functions from `sklearn.metrics`.
 
 Algorithms
 ==========
@@ -248,12 +271,12 @@ the sum of probability of being correctly classified:
 Local Fisher Discriminant Analysis (:py:class:`LFDA <metric_learn.LFDA>`)
 
 `LFDA` is a linear supervised dimensionality reduction method which effectively combines the ideas of `Linear Discriminant Analysis <https://en.wikipedia.org/wiki/Linear_discriminant_analysis>` and Locality-Preserving Projection . It is
-particularly useful when dealing with multi-modality, where one ore more classes
+particularly useful when dealing with multi-modality, where one or more classes
 consist of separate clusters in input space. The core optimization problem of
 LFDA is solved as a generalized eigenvalue problem.
 
 
-The algorithm define the Fisher local within-/between-class scatter matrix 
+The algorithm defines the Fisher local within-/between-class scatter matrix 
 :math:`\mathbf{S}^{(w)}/ \mathbf{S}^{(b)}` in a pairwise fashion:
 
 .. math::
@@ -408,7 +431,7 @@ method will look at all the samples from a different class and sample randomly
 a pair among them. The method will try to build `num_constraints` positive
 pairs and `num_constraints` negative pairs, but sometimes it cannot find enough
 of one of those, so forcing `same_length=True` will return both times the
-minimum of the two lenghts.
+minimum of the two lengths.
 
 For using quadruplets learners (see :ref:`learning_on_quadruplets`) in a
 supervised way, positive and negative pairs are sampled as above and