1098 lines
46 KiB
ReStructuredText
1098 lines
46 KiB
ReStructuredText
|
.. _decompositions:
|
|||
|
|
|||
|
|
|||
|
=================================================================
|
|||
|
Decomposing signals in components (matrix factorization problems)
|
|||
|
=================================================================
|
|||
|
|
|||
|
.. currentmodule:: sklearn.decomposition
|
|||
|
|
|||
|
|
|||
|
.. _PCA:
|
|||
|
|
|||
|
|
|||
|
Principal component analysis (PCA)
|
|||
|
==================================
|
|||
|
|
|||
|
Exact PCA and probabilistic interpretation
|
|||
|
------------------------------------------
|
|||
|
|
|||
|
PCA is used to decompose a multivariate dataset in a set of successive
|
|||
|
orthogonal components that explain a maximum amount of the variance. In
|
|||
|
scikit-learn, :class:`PCA` is implemented as a *transformer* object
|
|||
|
that learns :math:`n` components in its ``fit`` method, and can be used on new
|
|||
|
data to project it on these components.
|
|||
|
|
|||
|
PCA centers but does not scale the input data for each feature before
|
|||
|
applying the SVD. The optional parameter ``whiten=True`` makes it
|
|||
|
possible to project the data onto the singular space while scaling each
|
|||
|
component to unit variance. This is often useful if the models down-stream make
|
|||
|
strong assumptions on the isotropy of the signal: this is for example the case
|
|||
|
for Support Vector Machines with the RBF kernel and the K-Means clustering
|
|||
|
algorithm.
|
|||
|
|
|||
|
Below is an example of the iris dataset, which is comprised of 4
|
|||
|
features, projected on the 2 dimensions that explain most variance:
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_pca_vs_lda_001.png
|
|||
|
:target: ../auto_examples/decomposition/plot_pca_vs_lda.html
|
|||
|
:align: center
|
|||
|
:scale: 75%
|
|||
|
|
|||
|
|
|||
|
The :class:`PCA` object also provides a
|
|||
|
probabilistic interpretation of the PCA that can give a likelihood of
|
|||
|
data based on the amount of variance it explains. As such it implements a
|
|||
|
:term:`score` method that can be used in cross-validation:
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_pca_vs_fa_model_selection_001.png
|
|||
|
:target: ../auto_examples/decomposition/plot_pca_vs_fa_model_selection.html
|
|||
|
:align: center
|
|||
|
:scale: 75%
|
|||
|
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_pca_iris.py`
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_pca_vs_lda.py`
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_pca_vs_fa_model_selection.py`
|
|||
|
|
|||
|
|
|||
|
.. _IncrementalPCA:
|
|||
|
|
|||
|
Incremental PCA
|
|||
|
---------------
|
|||
|
|
|||
|
The :class:`PCA` object is very useful, but has certain limitations for
|
|||
|
large datasets. The biggest limitation is that :class:`PCA` only supports
|
|||
|
batch processing, which means all of the data to be processed must fit in main
|
|||
|
memory. The :class:`IncrementalPCA` object uses a different form of
|
|||
|
processing and allows for partial computations which almost
|
|||
|
exactly match the results of :class:`PCA` while processing the data in a
|
|||
|
minibatch fashion. :class:`IncrementalPCA` makes it possible to implement
|
|||
|
out-of-core Principal Component Analysis either by:
|
|||
|
|
|||
|
* Using its ``partial_fit`` method on chunks of data fetched sequentially
|
|||
|
from the local hard drive or a network database.
|
|||
|
|
|||
|
* Calling its fit method on a memory mapped file using
|
|||
|
``numpy.memmap``.
|
|||
|
|
|||
|
:class:`IncrementalPCA` only stores estimates of component and noise variances,
|
|||
|
in order update ``explained_variance_ratio_`` incrementally. This is why
|
|||
|
memory usage depends on the number of samples per batch, rather than the
|
|||
|
number of samples to be processed in the dataset.
|
|||
|
|
|||
|
As in :class:`PCA`, :class:`IncrementalPCA` centers but does not scale the
|
|||
|
input data for each feature before applying the SVD.
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_incremental_pca_001.png
|
|||
|
:target: ../auto_examples/decomposition/plot_incremental_pca.html
|
|||
|
:align: center
|
|||
|
:scale: 75%
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_incremental_pca_002.png
|
|||
|
:target: ../auto_examples/decomposition/plot_incremental_pca.html
|
|||
|
:align: center
|
|||
|
:scale: 75%
|
|||
|
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_incremental_pca.py`
|
|||
|
|
|||
|
|
|||
|
.. _RandomizedPCA:
|
|||
|
|
|||
|
PCA using randomized SVD
|
|||
|
------------------------
|
|||
|
|
|||
|
It is often interesting to project data to a lower-dimensional
|
|||
|
space that preserves most of the variance, by dropping the singular vector
|
|||
|
of components associated with lower singular values.
|
|||
|
|
|||
|
For instance, if we work with 64x64 pixel gray-level pictures
|
|||
|
for face recognition,
|
|||
|
the dimensionality of the data is 4096 and it is slow to train an
|
|||
|
RBF support vector machine on such wide data. Furthermore we know that
|
|||
|
the intrinsic dimensionality of the data is much lower than 4096 since all
|
|||
|
pictures of human faces look somewhat alike.
|
|||
|
The samples lie on a manifold of much lower
|
|||
|
dimension (say around 200 for instance). The PCA algorithm can be used
|
|||
|
to linearly transform the data while both reducing the dimensionality
|
|||
|
and preserve most of the explained variance at the same time.
|
|||
|
|
|||
|
The class :class:`PCA` used with the optional parameter
|
|||
|
``svd_solver='randomized'`` is very useful in that case: since we are going
|
|||
|
to drop most of the singular vectors it is much more efficient to limit the
|
|||
|
computation to an approximated estimate of the singular vectors we will keep
|
|||
|
to actually perform the transform.
|
|||
|
|
|||
|
For instance, the following shows 16 sample portraits (centered around
|
|||
|
0.0) from the Olivetti dataset. On the right hand side are the first 16
|
|||
|
singular vectors reshaped as portraits. Since we only require the top
|
|||
|
16 singular vectors of a dataset with size :math:`n_{samples} = 400`
|
|||
|
and :math:`n_{features} = 64 \times 64 = 4096`, the computation time is
|
|||
|
less than 1s:
|
|||
|
|
|||
|
.. |orig_img| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_001.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. |pca_img| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_002.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. centered:: |orig_img| |pca_img|
|
|||
|
|
|||
|
If we note :math:`n_{\max} = \max(n_{\mathrm{samples}}, n_{\mathrm{features}})` and
|
|||
|
:math:`n_{\min} = \min(n_{\mathrm{samples}}, n_{\mathrm{features}})`, the time complexity
|
|||
|
of the randomized :class:`PCA` is :math:`O(n_{\max}^2 \cdot n_{\mathrm{components}})`
|
|||
|
instead of :math:`O(n_{\max}^2 \cdot n_{\min})` for the exact method
|
|||
|
implemented in :class:`PCA`.
|
|||
|
|
|||
|
The memory footprint of randomized :class:`PCA` is also proportional to
|
|||
|
:math:`2 \cdot n_{\max} \cdot n_{\mathrm{components}}` instead of :math:`n_{\max}
|
|||
|
\cdot n_{\min}` for the exact method.
|
|||
|
|
|||
|
Note: the implementation of ``inverse_transform`` in :class:`PCA` with
|
|||
|
``svd_solver='randomized'`` is not the exact inverse transform of
|
|||
|
``transform`` even when ``whiten=False`` (default).
|
|||
|
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_applications_plot_face_recognition.py`
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_faces_decomposition.py`
|
|||
|
|
|||
|
.. rubric:: References
|
|||
|
|
|||
|
* Algorithm 4.3 in
|
|||
|
:arxiv:`"Finding structure with randomness: Stochastic algorithms for
|
|||
|
constructing approximate matrix decompositions" <0909.4061>`
|
|||
|
Halko, et al., 2009
|
|||
|
|
|||
|
* :arxiv:`"An implementation of a randomized algorithm for principal component
|
|||
|
analysis" <1412.3510>` A. Szlam et al. 2014
|
|||
|
|
|||
|
.. _SparsePCA:
|
|||
|
|
|||
|
Sparse principal components analysis (SparsePCA and MiniBatchSparsePCA)
|
|||
|
-----------------------------------------------------------------------
|
|||
|
|
|||
|
:class:`SparsePCA` is a variant of PCA, with the goal of extracting the
|
|||
|
set of sparse components that best reconstruct the data.
|
|||
|
|
|||
|
Mini-batch sparse PCA (:class:`MiniBatchSparsePCA`) is a variant of
|
|||
|
:class:`SparsePCA` that is faster but less accurate. The increased speed is
|
|||
|
reached by iterating over small chunks of the set of features, for a given
|
|||
|
number of iterations.
|
|||
|
|
|||
|
|
|||
|
Principal component analysis (:class:`PCA`) has the disadvantage that the
|
|||
|
components extracted by this method have exclusively dense expressions, i.e.
|
|||
|
they have non-zero coefficients when expressed as linear combinations of the
|
|||
|
original variables. This can make interpretation difficult. In many cases,
|
|||
|
the real underlying components can be more naturally imagined as sparse
|
|||
|
vectors; for example in face recognition, components might naturally map to
|
|||
|
parts of faces.
|
|||
|
|
|||
|
Sparse principal components yields a more parsimonious, interpretable
|
|||
|
representation, clearly emphasizing which of the original features contribute
|
|||
|
to the differences between samples.
|
|||
|
|
|||
|
The following example illustrates 16 components extracted using sparse PCA from
|
|||
|
the Olivetti faces dataset. It can be seen how the regularization term induces
|
|||
|
many zeros. Furthermore, the natural structure of the data causes the non-zero
|
|||
|
coefficients to be vertically adjacent. The model does not enforce this
|
|||
|
mathematically: each component is a vector :math:`h \in \mathbf{R}^{4096}`, and
|
|||
|
there is no notion of vertical adjacency except during the human-friendly
|
|||
|
visualization as 64x64 pixel images. The fact that the components shown below
|
|||
|
appear local is the effect of the inherent structure of the data, which makes
|
|||
|
such local patterns minimize reconstruction error. There exist sparsity-inducing
|
|||
|
norms that take into account adjacency and different kinds of structure; see
|
|||
|
[Jen09]_ for a review of such methods.
|
|||
|
For more details on how to use Sparse PCA, see the Examples section, below.
|
|||
|
|
|||
|
|
|||
|
.. |spca_img| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_005.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. centered:: |pca_img| |spca_img|
|
|||
|
|
|||
|
Note that there are many different formulations for the Sparse PCA
|
|||
|
problem. The one implemented here is based on [Mrl09]_ . The optimization
|
|||
|
problem solved is a PCA problem (dictionary learning) with an
|
|||
|
:math:`\ell_1` penalty on the components:
|
|||
|
|
|||
|
.. math::
|
|||
|
(U^*, V^*) = \underset{U, V}{\operatorname{arg\,min\,}} & \frac{1}{2}
|
|||
|
||X-UV||_{\text{Fro}}^2+\alpha||V||_{1,1} \\
|
|||
|
\text{subject to } & ||U_k||_2 <= 1 \text{ for all }
|
|||
|
0 \leq k < n_{components}
|
|||
|
|
|||
|
:math:`||.||_{\text{Fro}}` stands for the Frobenius norm and :math:`||.||_{1,1}`
|
|||
|
stands for the entry-wise matrix norm which is the sum of the absolute values
|
|||
|
of all the entries in the matrix.
|
|||
|
The sparsity-inducing :math:`||.||_{1,1}` matrix norm also prevents learning
|
|||
|
components from noise when few training samples are available. The degree
|
|||
|
of penalization (and thus sparsity) can be adjusted through the
|
|||
|
hyperparameter ``alpha``. Small values lead to a gently regularized
|
|||
|
factorization, while larger values shrink many coefficients to zero.
|
|||
|
|
|||
|
.. note::
|
|||
|
|
|||
|
While in the spirit of an online algorithm, the class
|
|||
|
:class:`MiniBatchSparsePCA` does not implement ``partial_fit`` because
|
|||
|
the algorithm is online along the features direction, not the samples
|
|||
|
direction.
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_faces_decomposition.py`
|
|||
|
|
|||
|
.. rubric:: References
|
|||
|
|
|||
|
.. [Mrl09] `"Online Dictionary Learning for Sparse Coding"
|
|||
|
<https://www.di.ens.fr/sierra/pdfs/icml09.pdf>`_
|
|||
|
J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009
|
|||
|
.. [Jen09] `"Structured Sparse Principal Component Analysis"
|
|||
|
<https://www.di.ens.fr/~fbach/sspca_AISTATS2010.pdf>`_
|
|||
|
R. Jenatton, G. Obozinski, F. Bach, 2009
|
|||
|
|
|||
|
|
|||
|
.. _kernel_PCA:
|
|||
|
|
|||
|
Kernel Principal Component Analysis (kPCA)
|
|||
|
==========================================
|
|||
|
|
|||
|
Exact Kernel PCA
|
|||
|
----------------
|
|||
|
|
|||
|
:class:`KernelPCA` is an extension of PCA which achieves non-linear
|
|||
|
dimensionality reduction through the use of kernels (see :ref:`metrics`) [Scholkopf1997]_. It
|
|||
|
has many applications including denoising, compression and structured
|
|||
|
prediction (kernel dependency estimation). :class:`KernelPCA` supports both
|
|||
|
``transform`` and ``inverse_transform``.
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_kernel_pca_002.png
|
|||
|
:target: ../auto_examples/decomposition/plot_kernel_pca.html
|
|||
|
:align: center
|
|||
|
:scale: 75%
|
|||
|
|
|||
|
.. note::
|
|||
|
:meth:`KernelPCA.inverse_transform` relies on a kernel ridge to learn the
|
|||
|
function mapping samples from the PCA basis into the original feature
|
|||
|
space [Bakir2003]_. Thus, the reconstruction obtained with
|
|||
|
:meth:`KernelPCA.inverse_transform` is an approximation. See the example
|
|||
|
linked below for more details.
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_kernel_pca.py`
|
|||
|
* :ref:`sphx_glr_auto_examples_applications_plot_digits_denoising.py`
|
|||
|
|
|||
|
.. rubric:: References
|
|||
|
|
|||
|
.. [Scholkopf1997] Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller.
|
|||
|
`"Kernel principal component analysis."
|
|||
|
<https://people.eecs.berkeley.edu/~wainwrig/stat241b/scholkopf_kernel.pdf>`_
|
|||
|
International conference on artificial neural networks.
|
|||
|
Springer, Berlin, Heidelberg, 1997.
|
|||
|
|
|||
|
.. [Bakir2003] Bakır, Gökhan H., Jason Weston, and Bernhard Schölkopf.
|
|||
|
`"Learning to find pre-images."
|
|||
|
<https://papers.nips.cc/paper/2003/file/ac1ad983e08ad3304a97e147f522747e-Paper.pdf>`_
|
|||
|
Advances in neural information processing systems 16 (2003): 449-456.
|
|||
|
|
|||
|
.. _kPCA_Solvers:
|
|||
|
|
|||
|
Choice of solver for Kernel PCA
|
|||
|
-------------------------------
|
|||
|
|
|||
|
While in :class:`PCA` the number of components is bounded by the number of
|
|||
|
features, in :class:`KernelPCA` the number of components is bounded by the
|
|||
|
number of samples. Many real-world datasets have large number of samples! In
|
|||
|
these cases finding *all* the components with a full kPCA is a waste of
|
|||
|
computation time, as data is mostly described by the first few components
|
|||
|
(e.g. ``n_components<=100``). In other words, the centered Gram matrix that
|
|||
|
is eigendecomposed in the Kernel PCA fitting process has an effective rank that
|
|||
|
is much smaller than its size. This is a situation where approximate
|
|||
|
eigensolvers can provide speedup with very low precision loss.
|
|||
|
|
|||
|
|
|||
|
.. dropdown:: Eigensolvers
|
|||
|
|
|||
|
The optional parameter ``eigen_solver='randomized'`` can be used to
|
|||
|
*significantly* reduce the computation time when the number of requested
|
|||
|
``n_components`` is small compared with the number of samples. It relies on
|
|||
|
randomized decomposition methods to find an approximate solution in a shorter
|
|||
|
time.
|
|||
|
|
|||
|
The time complexity of the randomized :class:`KernelPCA` is
|
|||
|
:math:`O(n_{\mathrm{samples}}^2 \cdot n_{\mathrm{components}})`
|
|||
|
instead of :math:`O(n_{\mathrm{samples}}^3)` for the exact method
|
|||
|
implemented with ``eigen_solver='dense'``.
|
|||
|
|
|||
|
The memory footprint of randomized :class:`KernelPCA` is also proportional to
|
|||
|
:math:`2 \cdot n_{\mathrm{samples}} \cdot n_{\mathrm{components}}` instead of
|
|||
|
:math:`n_{\mathrm{samples}}^2` for the exact method.
|
|||
|
|
|||
|
Note: this technique is the same as in :ref:`RandomizedPCA`.
|
|||
|
|
|||
|
In addition to the above two solvers, ``eigen_solver='arpack'`` can be used as
|
|||
|
an alternate way to get an approximate decomposition. In practice, this method
|
|||
|
only provides reasonable execution times when the number of components to find
|
|||
|
is extremely small. It is enabled by default when the desired number of
|
|||
|
components is less than 10 (strict) and the number of samples is more than 200
|
|||
|
(strict). See :class:`KernelPCA` for details.
|
|||
|
|
|||
|
.. rubric:: References
|
|||
|
|
|||
|
* *dense* solver:
|
|||
|
`scipy.linalg.eigh documentation
|
|||
|
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.eigh.html>`_
|
|||
|
|
|||
|
* *randomized* solver:
|
|||
|
|
|||
|
* Algorithm 4.3 in
|
|||
|
:arxiv:`"Finding structure with randomness: Stochastic
|
|||
|
algorithms for constructing approximate matrix decompositions" <0909.4061>`
|
|||
|
Halko, et al. (2009)
|
|||
|
|
|||
|
* :arxiv:`"An implementation of a randomized algorithm
|
|||
|
for principal component analysis" <1412.3510>`
|
|||
|
A. Szlam et al. (2014)
|
|||
|
|
|||
|
* *arpack* solver:
|
|||
|
`scipy.sparse.linalg.eigsh documentation
|
|||
|
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.eigsh.html>`_
|
|||
|
R. B. Lehoucq, D. C. Sorensen, and C. Yang, (1998)
|
|||
|
|
|||
|
|
|||
|
.. _LSA:
|
|||
|
|
|||
|
Truncated singular value decomposition and latent semantic analysis
|
|||
|
===================================================================
|
|||
|
|
|||
|
:class:`TruncatedSVD` implements a variant of singular value decomposition
|
|||
|
(SVD) that only computes the :math:`k` largest singular values,
|
|||
|
where :math:`k` is a user-specified parameter.
|
|||
|
|
|||
|
:class:`TruncatedSVD` is very similar to :class:`PCA`, but differs
|
|||
|
in that the matrix :math:`X` does not need to be centered.
|
|||
|
When the columnwise (per-feature) means of :math:`X`
|
|||
|
are subtracted from the feature values,
|
|||
|
truncated SVD on the resulting matrix is equivalent to PCA.
|
|||
|
|
|||
|
.. dropdown:: About truncated SVD and latent semantic analysis (LSA)
|
|||
|
|
|||
|
When truncated SVD is applied to term-document matrices
|
|||
|
(as returned by :class:`~sklearn.feature_extraction.text.CountVectorizer` or
|
|||
|
:class:`~sklearn.feature_extraction.text.TfidfVectorizer`),
|
|||
|
this transformation is known as
|
|||
|
`latent semantic analysis <https://nlp.stanford.edu/IR-book/pdf/18lsi.pdf>`_
|
|||
|
(LSA), because it transforms such matrices
|
|||
|
to a "semantic" space of low dimensionality.
|
|||
|
In particular, LSA is known to combat the effects of synonymy and polysemy
|
|||
|
(both of which roughly mean there are multiple meanings per word),
|
|||
|
which cause term-document matrices to be overly sparse
|
|||
|
and exhibit poor similarity under measures such as cosine similarity.
|
|||
|
|
|||
|
.. note::
|
|||
|
LSA is also known as latent semantic indexing, LSI,
|
|||
|
though strictly that refers to its use in persistent indexes
|
|||
|
for information retrieval purposes.
|
|||
|
|
|||
|
Mathematically, truncated SVD applied to training samples :math:`X`
|
|||
|
produces a low-rank approximation :math:`X`:
|
|||
|
|
|||
|
.. math::
|
|||
|
X \approx X_k = U_k \Sigma_k V_k^\top
|
|||
|
|
|||
|
After this operation, :math:`U_k \Sigma_k`
|
|||
|
is the transformed training set with :math:`k` features
|
|||
|
(called ``n_components`` in the API).
|
|||
|
|
|||
|
To also transform a test set :math:`X`, we multiply it with :math:`V_k`:
|
|||
|
|
|||
|
.. math::
|
|||
|
X' = X V_k
|
|||
|
|
|||
|
.. note::
|
|||
|
Most treatments of LSA in the natural language processing (NLP)
|
|||
|
and information retrieval (IR) literature
|
|||
|
swap the axes of the matrix :math:`X` so that it has shape
|
|||
|
``(n_features, n_samples)``.
|
|||
|
We present LSA in a different way that matches the scikit-learn API better,
|
|||
|
but the singular values found are the same.
|
|||
|
|
|||
|
While the :class:`TruncatedSVD` transformer
|
|||
|
works with any feature matrix,
|
|||
|
using it on tf-idf matrices is recommended over raw frequency counts
|
|||
|
in an LSA/document processing setting.
|
|||
|
In particular, sublinear scaling and inverse document frequency
|
|||
|
should be turned on (``sublinear_tf=True, use_idf=True``)
|
|||
|
to bring the feature values closer to a Gaussian distribution,
|
|||
|
compensating for LSA's erroneous assumptions about textual data.
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`
|
|||
|
|
|||
|
.. rubric:: References
|
|||
|
|
|||
|
* Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2008),
|
|||
|
*Introduction to Information Retrieval*, Cambridge University Press,
|
|||
|
chapter 18: `Matrix decompositions & latent semantic indexing
|
|||
|
<https://nlp.stanford.edu/IR-book/pdf/18lsi.pdf>`_
|
|||
|
|
|||
|
|
|||
|
|
|||
|
.. _DictionaryLearning:
|
|||
|
|
|||
|
Dictionary Learning
|
|||
|
===================
|
|||
|
|
|||
|
.. _SparseCoder:
|
|||
|
|
|||
|
Sparse coding with a precomputed dictionary
|
|||
|
-------------------------------------------
|
|||
|
|
|||
|
The :class:`SparseCoder` object is an estimator that can be used to transform signals
|
|||
|
into sparse linear combination of atoms from a fixed, precomputed dictionary
|
|||
|
such as a discrete wavelet basis. This object therefore does not
|
|||
|
implement a ``fit`` method. The transformation amounts
|
|||
|
to a sparse coding problem: finding a representation of the data as a linear
|
|||
|
combination of as few dictionary atoms as possible. All variations of
|
|||
|
dictionary learning implement the following transform methods, controllable via
|
|||
|
the ``transform_method`` initialization parameter:
|
|||
|
|
|||
|
* Orthogonal matching pursuit (:ref:`omp`)
|
|||
|
|
|||
|
* Least-angle regression (:ref:`least_angle_regression`)
|
|||
|
|
|||
|
* Lasso computed by least-angle regression
|
|||
|
|
|||
|
* Lasso using coordinate descent (:ref:`lasso`)
|
|||
|
|
|||
|
* Thresholding
|
|||
|
|
|||
|
Thresholding is very fast but it does not yield accurate reconstructions.
|
|||
|
They have been shown useful in literature for classification tasks. For image
|
|||
|
reconstruction tasks, orthogonal matching pursuit yields the most accurate,
|
|||
|
unbiased reconstruction.
|
|||
|
|
|||
|
The dictionary learning objects offer, via the ``split_code`` parameter, the
|
|||
|
possibility to separate the positive and negative values in the results of
|
|||
|
sparse coding. This is useful when dictionary learning is used for extracting
|
|||
|
features that will be used for supervised learning, because it allows the
|
|||
|
learning algorithm to assign different weights to negative loadings of a
|
|||
|
particular atom, from to the corresponding positive loading.
|
|||
|
|
|||
|
The split code for a single sample has length ``2 * n_components``
|
|||
|
and is constructed using the following rule: First, the regular code of length
|
|||
|
``n_components`` is computed. Then, the first ``n_components`` entries of the
|
|||
|
``split_code`` are
|
|||
|
filled with the positive part of the regular code vector. The second half of
|
|||
|
the split code is filled with the negative part of the code vector, only with
|
|||
|
a positive sign. Therefore, the split_code is non-negative.
|
|||
|
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_sparse_coding.py`
|
|||
|
|
|||
|
|
|||
|
Generic dictionary learning
|
|||
|
---------------------------
|
|||
|
|
|||
|
Dictionary learning (:class:`DictionaryLearning`) is a matrix factorization
|
|||
|
problem that amounts to finding a (usually overcomplete) dictionary that will
|
|||
|
perform well at sparsely encoding the fitted data.
|
|||
|
|
|||
|
Representing data as sparse combinations of atoms from an overcomplete
|
|||
|
dictionary is suggested to be the way the mammalian primary visual cortex works.
|
|||
|
Consequently, dictionary learning applied on image patches has been shown to
|
|||
|
give good results in image processing tasks such as image completion,
|
|||
|
inpainting and denoising, as well as for supervised recognition tasks.
|
|||
|
|
|||
|
Dictionary learning is an optimization problem solved by alternatively updating
|
|||
|
the sparse code, as a solution to multiple Lasso problems, considering the
|
|||
|
dictionary fixed, and then updating the dictionary to best fit the sparse code.
|
|||
|
|
|||
|
.. math::
|
|||
|
(U^*, V^*) = \underset{U, V}{\operatorname{arg\,min\,}} & \frac{1}{2}
|
|||
|
||X-UV||_{\text{Fro}}^2+\alpha||U||_{1,1} \\
|
|||
|
\text{subject to } & ||V_k||_2 <= 1 \text{ for all }
|
|||
|
0 \leq k < n_{\mathrm{atoms}}
|
|||
|
|
|||
|
|
|||
|
.. |pca_img2| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_002.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. |dict_img2| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_007.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. centered:: |pca_img2| |dict_img2|
|
|||
|
|
|||
|
:math:`||.||_{\text{Fro}}` stands for the Frobenius norm and :math:`||.||_{1,1}`
|
|||
|
stands for the entry-wise matrix norm which is the sum of the absolute values
|
|||
|
of all the entries in the matrix.
|
|||
|
After using such a procedure to fit the dictionary, the transform is simply a
|
|||
|
sparse coding step that shares the same implementation with all dictionary
|
|||
|
learning objects (see :ref:`SparseCoder`).
|
|||
|
|
|||
|
It is also possible to constrain the dictionary and/or code to be positive to
|
|||
|
match constraints that may be present in the data. Below are the faces with
|
|||
|
different positivity constraints applied. Red indicates negative values, blue
|
|||
|
indicates positive values, and white represents zeros.
|
|||
|
|
|||
|
|
|||
|
.. |dict_img_pos1| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_010.png
|
|||
|
:target: ../auto_examples/decomposition/plot_image_denoising.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. |dict_img_pos2| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_011.png
|
|||
|
:target: ../auto_examples/decomposition/plot_image_denoising.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. |dict_img_pos3| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_012.png
|
|||
|
:target: ../auto_examples/decomposition/plot_image_denoising.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. |dict_img_pos4| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_013.png
|
|||
|
:target: ../auto_examples/decomposition/plot_image_denoising.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. centered:: |dict_img_pos1| |dict_img_pos2|
|
|||
|
.. centered:: |dict_img_pos3| |dict_img_pos4|
|
|||
|
|
|||
|
|
|||
|
The following image shows how a dictionary learned from 4x4 pixel image patches
|
|||
|
extracted from part of the image of a raccoon face looks like.
|
|||
|
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_image_denoising_001.png
|
|||
|
:target: ../auto_examples/decomposition/plot_image_denoising.html
|
|||
|
:align: center
|
|||
|
:scale: 50%
|
|||
|
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_image_denoising.py`
|
|||
|
|
|||
|
|
|||
|
.. rubric:: References
|
|||
|
|
|||
|
* `"Online dictionary learning for sparse coding"
|
|||
|
<https://www.di.ens.fr/sierra/pdfs/icml09.pdf>`_
|
|||
|
J. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009
|
|||
|
|
|||
|
.. _MiniBatchDictionaryLearning:
|
|||
|
|
|||
|
Mini-batch dictionary learning
|
|||
|
------------------------------
|
|||
|
|
|||
|
:class:`MiniBatchDictionaryLearning` implements a faster, but less accurate
|
|||
|
version of the dictionary learning algorithm that is better suited for large
|
|||
|
datasets.
|
|||
|
|
|||
|
By default, :class:`MiniBatchDictionaryLearning` divides the data into
|
|||
|
mini-batches and optimizes in an online manner by cycling over the mini-batches
|
|||
|
for the specified number of iterations. However, at the moment it does not
|
|||
|
implement a stopping condition.
|
|||
|
|
|||
|
The estimator also implements ``partial_fit``, which updates the dictionary by
|
|||
|
iterating only once over a mini-batch. This can be used for online learning
|
|||
|
when the data is not readily available from the start, or for when the data
|
|||
|
does not fit into the memory.
|
|||
|
|
|||
|
.. currentmodule:: sklearn.cluster
|
|||
|
|
|||
|
.. image:: ../auto_examples/cluster/images/sphx_glr_plot_dict_face_patches_001.png
|
|||
|
:target: ../auto_examples/cluster/plot_dict_face_patches.html
|
|||
|
:scale: 50%
|
|||
|
:align: right
|
|||
|
|
|||
|
.. topic:: **Clustering for dictionary learning**
|
|||
|
|
|||
|
Note that when using dictionary learning to extract a representation
|
|||
|
(e.g. for sparse coding) clustering can be a good proxy to learn the
|
|||
|
dictionary. For instance the :class:`MiniBatchKMeans` estimator is
|
|||
|
computationally efficient and implements on-line learning with a
|
|||
|
``partial_fit`` method.
|
|||
|
|
|||
|
Example: :ref:`sphx_glr_auto_examples_cluster_plot_dict_face_patches.py`
|
|||
|
|
|||
|
.. currentmodule:: sklearn.decomposition
|
|||
|
|
|||
|
.. _FA:
|
|||
|
|
|||
|
Factor Analysis
|
|||
|
===============
|
|||
|
|
|||
|
In unsupervised learning we only have a dataset :math:`X = \{x_1, x_2, \dots, x_n
|
|||
|
\}`. How can this dataset be described mathematically? A very simple
|
|||
|
`continuous latent variable` model for :math:`X` is
|
|||
|
|
|||
|
.. math:: x_i = W h_i + \mu + \epsilon
|
|||
|
|
|||
|
The vector :math:`h_i` is called "latent" because it is unobserved. :math:`\epsilon` is
|
|||
|
considered a noise term distributed according to a Gaussian with mean 0 and
|
|||
|
covariance :math:`\Psi` (i.e. :math:`\epsilon \sim \mathcal{N}(0, \Psi)`), :math:`\mu` is some
|
|||
|
arbitrary offset vector. Such a model is called "generative" as it describes
|
|||
|
how :math:`x_i` is generated from :math:`h_i`. If we use all the :math:`x_i`'s as columns to form
|
|||
|
a matrix :math:`\mathbf{X}` and all the :math:`h_i`'s as columns of a matrix :math:`\mathbf{H}`
|
|||
|
then we can write (with suitably defined :math:`\mathbf{M}` and :math:`\mathbf{E}`):
|
|||
|
|
|||
|
.. math::
|
|||
|
\mathbf{X} = W \mathbf{H} + \mathbf{M} + \mathbf{E}
|
|||
|
|
|||
|
In other words, we *decomposed* matrix :math:`\mathbf{X}`.
|
|||
|
|
|||
|
If :math:`h_i` is given, the above equation automatically implies the following
|
|||
|
probabilistic interpretation:
|
|||
|
|
|||
|
.. math:: p(x_i|h_i) = \mathcal{N}(Wh_i + \mu, \Psi)
|
|||
|
|
|||
|
For a complete probabilistic model we also need a prior distribution for the
|
|||
|
latent variable :math:`h`. The most straightforward assumption (based on the nice
|
|||
|
properties of the Gaussian distribution) is :math:`h \sim \mathcal{N}(0,
|
|||
|
\mathbf{I})`. This yields a Gaussian as the marginal distribution of :math:`x`:
|
|||
|
|
|||
|
.. math:: p(x) = \mathcal{N}(\mu, WW^T + \Psi)
|
|||
|
|
|||
|
Now, without any further assumptions the idea of having a latent variable :math:`h`
|
|||
|
would be superfluous -- :math:`x` can be completely modelled with a mean
|
|||
|
and a covariance. We need to impose some more specific structure on one
|
|||
|
of these two parameters. A simple additional assumption regards the
|
|||
|
structure of the error covariance :math:`\Psi`:
|
|||
|
|
|||
|
* :math:`\Psi = \sigma^2 \mathbf{I}`: This assumption leads to
|
|||
|
the probabilistic model of :class:`PCA`.
|
|||
|
|
|||
|
* :math:`\Psi = \mathrm{diag}(\psi_1, \psi_2, \dots, \psi_n)`: This model is called
|
|||
|
:class:`FactorAnalysis`, a classical statistical model. The matrix W is
|
|||
|
sometimes called the "factor loading matrix".
|
|||
|
|
|||
|
Both models essentially estimate a Gaussian with a low-rank covariance matrix.
|
|||
|
Because both models are probabilistic they can be integrated in more complex
|
|||
|
models, e.g. Mixture of Factor Analysers. One gets very different models (e.g.
|
|||
|
:class:`FastICA`) if non-Gaussian priors on the latent variables are assumed.
|
|||
|
|
|||
|
Factor analysis *can* produce similar components (the columns of its loading
|
|||
|
matrix) to :class:`PCA`. However, one can not make any general statements
|
|||
|
about these components (e.g. whether they are orthogonal):
|
|||
|
|
|||
|
.. |pca_img3| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_002.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. |fa_img3| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_008.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. centered:: |pca_img3| |fa_img3|
|
|||
|
|
|||
|
The main advantage for Factor Analysis over :class:`PCA` is that
|
|||
|
it can model the variance in every direction of the input space independently
|
|||
|
(heteroscedastic noise):
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_009.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:align: center
|
|||
|
:scale: 75%
|
|||
|
|
|||
|
This allows better model selection than probabilistic PCA in the presence
|
|||
|
of heteroscedastic noise:
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_pca_vs_fa_model_selection_002.png
|
|||
|
:target: ../auto_examples/decomposition/plot_pca_vs_fa_model_selection.html
|
|||
|
:align: center
|
|||
|
:scale: 75%
|
|||
|
|
|||
|
Factor Analysis is often followed by a rotation of the factors (with the
|
|||
|
parameter `rotation`), usually to improve interpretability. For example,
|
|||
|
Varimax rotation maximizes the sum of the variances of the squared loadings,
|
|||
|
i.e., it tends to produce sparser factors, which are influenced by only a few
|
|||
|
features each (the "simple structure"). See e.g., the first example below.
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_varimax_fa.py`
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_pca_vs_fa_model_selection.py`
|
|||
|
|
|||
|
|
|||
|
.. _ICA:
|
|||
|
|
|||
|
Independent component analysis (ICA)
|
|||
|
====================================
|
|||
|
|
|||
|
Independent component analysis separates a multivariate signal into
|
|||
|
additive subcomponents that are maximally independent. It is
|
|||
|
implemented in scikit-learn using the :class:`Fast ICA <FastICA>`
|
|||
|
algorithm. Typically, ICA is not used for reducing dimensionality but
|
|||
|
for separating superimposed signals. Since the ICA model does not include
|
|||
|
a noise term, for the model to be correct, whitening must be applied.
|
|||
|
This can be done internally using the whiten argument or manually using one
|
|||
|
of the PCA variants.
|
|||
|
|
|||
|
It is classically used to separate mixed signals (a problem known as
|
|||
|
*blind source separation*), as in the example below:
|
|||
|
|
|||
|
.. figure:: ../auto_examples/decomposition/images/sphx_glr_plot_ica_blind_source_separation_001.png
|
|||
|
:target: ../auto_examples/decomposition/plot_ica_blind_source_separation.html
|
|||
|
:align: center
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
|
|||
|
ICA can also be used as yet another non linear decomposition that finds
|
|||
|
components with some sparsity:
|
|||
|
|
|||
|
.. |pca_img4| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_002.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. |ica_img4| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_004.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. centered:: |pca_img4| |ica_img4|
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_ica_blind_source_separation.py`
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_ica_vs_pca.py`
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_faces_decomposition.py`
|
|||
|
|
|||
|
|
|||
|
.. _NMF:
|
|||
|
|
|||
|
Non-negative matrix factorization (NMF or NNMF)
|
|||
|
===============================================
|
|||
|
|
|||
|
NMF with the Frobenius norm
|
|||
|
---------------------------
|
|||
|
|
|||
|
:class:`NMF` [1]_ is an alternative approach to decomposition that assumes that the
|
|||
|
data and the components are non-negative. :class:`NMF` can be plugged in
|
|||
|
instead of :class:`PCA` or its variants, in the cases where the data matrix
|
|||
|
does not contain negative values. It finds a decomposition of samples
|
|||
|
:math:`X` into two matrices :math:`W` and :math:`H` of non-negative elements,
|
|||
|
by optimizing the distance :math:`d` between :math:`X` and the matrix product
|
|||
|
:math:`WH`. The most widely used distance function is the squared Frobenius
|
|||
|
norm, which is an obvious extension of the Euclidean norm to matrices:
|
|||
|
|
|||
|
.. math::
|
|||
|
d_{\mathrm{Fro}}(X, Y) = \frac{1}{2} ||X - Y||_{\mathrm{Fro}}^2 = \frac{1}{2} \sum_{i,j} (X_{ij} - {Y}_{ij})^2
|
|||
|
|
|||
|
Unlike :class:`PCA`, the representation of a vector is obtained in an additive
|
|||
|
fashion, by superimposing the components, without subtracting. Such additive
|
|||
|
models are efficient for representing images and text.
|
|||
|
|
|||
|
It has been observed in [Hoyer, 2004] [2]_ that, when carefully constrained,
|
|||
|
:class:`NMF` can produce a parts-based representation of the dataset,
|
|||
|
resulting in interpretable models. The following example displays 16
|
|||
|
sparse components found by :class:`NMF` from the images in the Olivetti
|
|||
|
faces dataset, in comparison with the PCA eigenfaces.
|
|||
|
|
|||
|
.. |pca_img5| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_002.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. |nmf_img5| image:: ../auto_examples/decomposition/images/sphx_glr_plot_faces_decomposition_003.png
|
|||
|
:target: ../auto_examples/decomposition/plot_faces_decomposition.html
|
|||
|
:scale: 60%
|
|||
|
|
|||
|
.. centered:: |pca_img5| |nmf_img5|
|
|||
|
|
|||
|
|
|||
|
The `init` attribute determines the initialization method applied, which
|
|||
|
has a great impact on the performance of the method. :class:`NMF` implements the
|
|||
|
method Nonnegative Double Singular Value Decomposition. NNDSVD [4]_ is based on
|
|||
|
two SVD processes, one approximating the data matrix, the other approximating
|
|||
|
positive sections of the resulting partial SVD factors utilizing an algebraic
|
|||
|
property of unit rank matrices. The basic NNDSVD algorithm is better fit for
|
|||
|
sparse factorization. Its variants NNDSVDa (in which all zeros are set equal to
|
|||
|
the mean of all elements of the data), and NNDSVDar (in which the zeros are set
|
|||
|
to random perturbations less than the mean of the data divided by 100) are
|
|||
|
recommended in the dense case.
|
|||
|
|
|||
|
Note that the Multiplicative Update ('mu') solver cannot update zeros present in
|
|||
|
the initialization, so it leads to poorer results when used jointly with the
|
|||
|
basic NNDSVD algorithm which introduces a lot of zeros; in this case, NNDSVDa or
|
|||
|
NNDSVDar should be preferred.
|
|||
|
|
|||
|
:class:`NMF` can also be initialized with correctly scaled random non-negative
|
|||
|
matrices by setting `init="random"`. An integer seed or a
|
|||
|
``RandomState`` can also be passed to `random_state` to control
|
|||
|
reproducibility.
|
|||
|
|
|||
|
In :class:`NMF`, L1 and L2 priors can be added to the loss function in order to
|
|||
|
regularize the model. The L2 prior uses the Frobenius norm, while the L1 prior
|
|||
|
uses an elementwise L1 norm. As in :class:`~sklearn.linear_model.ElasticNet`,
|
|||
|
we control the combination of L1 and L2 with the `l1_ratio` (:math:`\rho`)
|
|||
|
parameter, and the intensity of the regularization with the `alpha_W` and
|
|||
|
`alpha_H` (:math:`\alpha_W` and :math:`\alpha_H`) parameters. The priors are
|
|||
|
scaled by the number of samples (:math:`n\_samples`) for `H` and the number of
|
|||
|
features (:math:`n\_features`) for `W` to keep their impact balanced with
|
|||
|
respect to one another and to the data fit term as independent as possible of
|
|||
|
the size of the training set. Then the priors terms are:
|
|||
|
|
|||
|
.. math::
|
|||
|
(\alpha_W \rho ||W||_1 + \frac{\alpha_W(1-\rho)}{2} ||W||_{\mathrm{Fro}} ^ 2) * n\_features
|
|||
|
+ (\alpha_H \rho ||H||_1 + \frac{\alpha_H(1-\rho)}{2} ||H||_{\mathrm{Fro}} ^ 2) * n\_samples
|
|||
|
|
|||
|
and the regularized objective function is:
|
|||
|
|
|||
|
.. math::
|
|||
|
d_{\mathrm{Fro}}(X, WH)
|
|||
|
+ (\alpha_W \rho ||W||_1 + \frac{\alpha_W(1-\rho)}{2} ||W||_{\mathrm{Fro}} ^ 2) * n\_features
|
|||
|
+ (\alpha_H \rho ||H||_1 + \frac{\alpha_H(1-\rho)}{2} ||H||_{\mathrm{Fro}} ^ 2) * n\_samples
|
|||
|
|
|||
|
NMF with a beta-divergence
|
|||
|
--------------------------
|
|||
|
|
|||
|
As described previously, the most widely used distance function is the squared
|
|||
|
Frobenius norm, which is an obvious extension of the Euclidean norm to
|
|||
|
matrices:
|
|||
|
|
|||
|
.. math::
|
|||
|
d_{\mathrm{Fro}}(X, Y) = \frac{1}{2} ||X - Y||_{Fro}^2 = \frac{1}{2} \sum_{i,j} (X_{ij} - {Y}_{ij})^2
|
|||
|
|
|||
|
Other distance functions can be used in NMF as, for example, the (generalized)
|
|||
|
Kullback-Leibler (KL) divergence, also referred as I-divergence:
|
|||
|
|
|||
|
.. math::
|
|||
|
d_{KL}(X, Y) = \sum_{i,j} (X_{ij} \log(\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})
|
|||
|
|
|||
|
Or, the Itakura-Saito (IS) divergence:
|
|||
|
|
|||
|
.. math::
|
|||
|
d_{IS}(X, Y) = \sum_{i,j} (\frac{X_{ij}}{Y_{ij}} - \log(\frac{X_{ij}}{Y_{ij}}) - 1)
|
|||
|
|
|||
|
These three distances are special cases of the beta-divergence family, with
|
|||
|
:math:`\beta = 2, 1, 0` respectively [6]_. The beta-divergence are
|
|||
|
defined by :
|
|||
|
|
|||
|
.. math::
|
|||
|
d_{\beta}(X, Y) = \sum_{i,j} \frac{1}{\beta(\beta - 1)}(X_{ij}^\beta + (\beta-1)Y_{ij}^\beta - \beta X_{ij} Y_{ij}^{\beta - 1})
|
|||
|
|
|||
|
.. image:: ../images/beta_divergence.png
|
|||
|
:align: center
|
|||
|
:scale: 75%
|
|||
|
|
|||
|
Note that this definition is not valid if :math:`\beta \in (0; 1)`, yet it can
|
|||
|
be continuously extended to the definitions of :math:`d_{KL}` and :math:`d_{IS}`
|
|||
|
respectively.
|
|||
|
|
|||
|
.. dropdown:: NMF implemented solvers
|
|||
|
|
|||
|
:class:`NMF` implements two solvers, using Coordinate Descent ('cd') [5]_, and
|
|||
|
Multiplicative Update ('mu') [6]_. The 'mu' solver can optimize every
|
|||
|
beta-divergence, including of course the Frobenius norm (:math:`\beta=2`), the
|
|||
|
(generalized) Kullback-Leibler divergence (:math:`\beta=1`) and the
|
|||
|
Itakura-Saito divergence (:math:`\beta=0`). Note that for
|
|||
|
:math:`\beta \in (1; 2)`, the 'mu' solver is significantly faster than for other
|
|||
|
values of :math:`\beta`. Note also that with a negative (or 0, i.e.
|
|||
|
'itakura-saito') :math:`\beta`, the input matrix cannot contain zero values.
|
|||
|
|
|||
|
The 'cd' solver can only optimize the Frobenius norm. Due to the
|
|||
|
underlying non-convexity of NMF, the different solvers may converge to
|
|||
|
different minima, even when optimizing the same distance function.
|
|||
|
|
|||
|
NMF is best used with the ``fit_transform`` method, which returns the matrix W.
|
|||
|
The matrix H is stored into the fitted model in the ``components_`` attribute;
|
|||
|
the method ``transform`` will decompose a new matrix X_new based on these
|
|||
|
stored components::
|
|||
|
|
|||
|
>>> import numpy as np
|
|||
|
>>> X = np.array([[1, 1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
|
|||
|
>>> from sklearn.decomposition import NMF
|
|||
|
>>> model = NMF(n_components=2, init='random', random_state=0)
|
|||
|
>>> W = model.fit_transform(X)
|
|||
|
>>> H = model.components_
|
|||
|
>>> X_new = np.array([[1, 0], [1, 6.1], [1, 0], [1, 4], [3.2, 1], [0, 4]])
|
|||
|
>>> W_new = model.transform(X_new)
|
|||
|
|
|||
|
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_decomposition_plot_faces_decomposition.py`
|
|||
|
* :ref:`sphx_glr_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py`
|
|||
|
|
|||
|
.. _MiniBatchNMF:
|
|||
|
|
|||
|
Mini-batch Non Negative Matrix Factorization
|
|||
|
--------------------------------------------
|
|||
|
|
|||
|
:class:`MiniBatchNMF` [7]_ implements a faster, but less accurate version of the
|
|||
|
non negative matrix factorization (i.e. :class:`~sklearn.decomposition.NMF`),
|
|||
|
better suited for large datasets.
|
|||
|
|
|||
|
By default, :class:`MiniBatchNMF` divides the data into mini-batches and
|
|||
|
optimizes the NMF model in an online manner by cycling over the mini-batches
|
|||
|
for the specified number of iterations. The ``batch_size`` parameter controls
|
|||
|
the size of the batches.
|
|||
|
|
|||
|
In order to speed up the mini-batch algorithm it is also possible to scale
|
|||
|
past batches, giving them less importance than newer batches. This is done
|
|||
|
introducing a so-called forgetting factor controlled by the ``forget_factor``
|
|||
|
parameter.
|
|||
|
|
|||
|
The estimator also implements ``partial_fit``, which updates ``H`` by iterating
|
|||
|
only once over a mini-batch. This can be used for online learning when the data
|
|||
|
is not readily available from the start, or when the data does not fit into memory.
|
|||
|
|
|||
|
.. rubric:: References
|
|||
|
|
|||
|
.. [1] `"Learning the parts of objects by non-negative matrix factorization"
|
|||
|
<http://www.cs.columbia.edu/~blei/fogm/2020F/readings/LeeSeung1999.pdf>`_
|
|||
|
D. Lee, S. Seung, 1999
|
|||
|
|
|||
|
.. [2] `"Non-negative Matrix Factorization with Sparseness Constraints"
|
|||
|
<https://www.jmlr.org/papers/volume5/hoyer04a/hoyer04a.pdf>`_
|
|||
|
P. Hoyer, 2004
|
|||
|
|
|||
|
.. [4] `"SVD based initialization: A head start for nonnegative
|
|||
|
matrix factorization"
|
|||
|
<https://www.boutsidis.org/Boutsidis_PRE_08.pdf>`_
|
|||
|
C. Boutsidis, E. Gallopoulos, 2008
|
|||
|
|
|||
|
.. [5] `"Fast local algorithms for large scale nonnegative matrix and tensor
|
|||
|
factorizations."
|
|||
|
<https://www.researchgate.net/profile/Anh-Huy-Phan/publication/220241471_Fast_Local_Algorithms_for_Large_Scale_Nonnegative_Matrix_and_Tensor_Factorizations>`_
|
|||
|
A. Cichocki, A. Phan, 2009
|
|||
|
|
|||
|
.. [6] :arxiv:`"Algorithms for nonnegative matrix factorization with
|
|||
|
the beta-divergence" <1010.1763>`
|
|||
|
C. Fevotte, J. Idier, 2011
|
|||
|
|
|||
|
.. [7] :arxiv:`"Online algorithms for nonnegative matrix factorization with the
|
|||
|
Itakura-Saito divergence" <1106.4198>`
|
|||
|
A. Lefevre, F. Bach, C. Fevotte, 2011
|
|||
|
|
|||
|
.. _LatentDirichletAllocation:
|
|||
|
|
|||
|
Latent Dirichlet Allocation (LDA)
|
|||
|
=================================
|
|||
|
|
|||
|
Latent Dirichlet Allocation is a generative probabilistic model for collections of
|
|||
|
discrete dataset such as text corpora. It is also a topic model that is used for
|
|||
|
discovering abstract topics from a collection of documents.
|
|||
|
|
|||
|
The graphical model of LDA is a three-level generative model:
|
|||
|
|
|||
|
.. image:: ../images/lda_model_graph.png
|
|||
|
:align: center
|
|||
|
|
|||
|
Note on notations presented in the graphical model above, which can be found in
|
|||
|
Hoffman et al. (2013):
|
|||
|
|
|||
|
* The corpus is a collection of :math:`D` documents.
|
|||
|
* A document is a sequence of :math:`N` words.
|
|||
|
* There are :math:`K` topics in the corpus.
|
|||
|
* The boxes represent repeated sampling.
|
|||
|
|
|||
|
In the graphical model, each node is a random variable and has a role in the
|
|||
|
generative process. A shaded node indicates an observed variable and an unshaded
|
|||
|
node indicates a hidden (latent) variable. In this case, words in the corpus are
|
|||
|
the only data that we observe. The latent variables determine the random mixture
|
|||
|
of topics in the corpus and the distribution of words in the documents.
|
|||
|
The goal of LDA is to use the observed words to infer the hidden topic
|
|||
|
structure.
|
|||
|
|
|||
|
.. dropdown:: Details on modeling text corpora
|
|||
|
|
|||
|
When modeling text corpora, the model assumes the following generative process
|
|||
|
for a corpus with :math:`D` documents and :math:`K` topics, with :math:`K`
|
|||
|
corresponding to `n_components` in the API:
|
|||
|
|
|||
|
1. For each topic :math:`k \in K`, draw :math:`\beta_k \sim
|
|||
|
\mathrm{Dirichlet}(\eta)`. This provides a distribution over the words,
|
|||
|
i.e. the probability of a word appearing in topic :math:`k`.
|
|||
|
:math:`\eta` corresponds to `topic_word_prior`.
|
|||
|
|
|||
|
2. For each document :math:`d \in D`, draw the topic proportions
|
|||
|
:math:`\theta_d \sim \mathrm{Dirichlet}(\alpha)`. :math:`\alpha`
|
|||
|
corresponds to `doc_topic_prior`.
|
|||
|
|
|||
|
3. For each word :math:`i` in document :math:`d`:
|
|||
|
|
|||
|
a. Draw the topic assignment :math:`z_{di} \sim \mathrm{Multinomial}
|
|||
|
(\theta_d)`
|
|||
|
b. Draw the observed word :math:`w_{ij} \sim \mathrm{Multinomial}
|
|||
|
(\beta_{z_{di}})`
|
|||
|
|
|||
|
For parameter estimation, the posterior distribution is:
|
|||
|
|
|||
|
.. math::
|
|||
|
p(z, \theta, \beta |w, \alpha, \eta) =
|
|||
|
\frac{p(z, \theta, \beta|\alpha, \eta)}{p(w|\alpha, \eta)}
|
|||
|
|
|||
|
Since the posterior is intractable, variational Bayesian method
|
|||
|
uses a simpler distribution :math:`q(z,\theta,\beta | \lambda, \phi, \gamma)`
|
|||
|
to approximate it, and those variational parameters :math:`\lambda`,
|
|||
|
:math:`\phi`, :math:`\gamma` are optimized to maximize the Evidence
|
|||
|
Lower Bound (ELBO):
|
|||
|
|
|||
|
.. math::
|
|||
|
\log\: P(w | \alpha, \eta) \geq L(w,\phi,\gamma,\lambda) \overset{\triangle}{=}
|
|||
|
E_{q}[\log\:p(w,z,\theta,\beta|\alpha,\eta)] - E_{q}[\log\:q(z, \theta, \beta)]
|
|||
|
|
|||
|
Maximizing ELBO is equivalent to minimizing the Kullback-Leibler(KL) divergence
|
|||
|
between :math:`q(z,\theta,\beta)` and the true posterior
|
|||
|
:math:`p(z, \theta, \beta |w, \alpha, \eta)`.
|
|||
|
|
|||
|
|
|||
|
:class:`LatentDirichletAllocation` implements the online variational Bayes
|
|||
|
algorithm and supports both online and batch update methods.
|
|||
|
While the batch method updates variational variables after each full pass through
|
|||
|
the data, the online method updates variational variables from mini-batch data
|
|||
|
points.
|
|||
|
|
|||
|
.. note::
|
|||
|
|
|||
|
Although the online method is guaranteed to converge to a local optimum point, the quality of
|
|||
|
the optimum point and the speed of convergence may depend on mini-batch size and
|
|||
|
attributes related to learning rate setting.
|
|||
|
|
|||
|
When :class:`LatentDirichletAllocation` is applied on a "document-term" matrix, the matrix
|
|||
|
will be decomposed into a "topic-term" matrix and a "document-topic" matrix. While
|
|||
|
"topic-term" matrix is stored as `components_` in the model, "document-topic" matrix
|
|||
|
can be calculated from ``transform`` method.
|
|||
|
|
|||
|
:class:`LatentDirichletAllocation` also implements ``partial_fit`` method. This is used
|
|||
|
when data can be fetched sequentially.
|
|||
|
|
|||
|
.. rubric:: Examples
|
|||
|
|
|||
|
* :ref:`sphx_glr_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py`
|
|||
|
|
|||
|
.. rubric:: References
|
|||
|
|
|||
|
* `"Latent Dirichlet Allocation"
|
|||
|
<https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf>`_
|
|||
|
D. Blei, A. Ng, M. Jordan, 2003
|
|||
|
|
|||
|
* `"Online Learning for Latent Dirichlet Allocation”
|
|||
|
<https://papers.nips.cc/paper/3902-online-learning-for-latent-dirichlet-allocation.pdf>`_
|
|||
|
M. Hoffman, D. Blei, F. Bach, 2010
|
|||
|
|
|||
|
* `"Stochastic Variational Inference"
|
|||
|
<https://www.cs.columbia.edu/~blei/papers/HoffmanBleiWangPaisley2013.pdf>`_
|
|||
|
M. Hoffman, D. Blei, C. Wang, J. Paisley, 2013
|
|||
|
|
|||
|
* `"The varimax criterion for analytic rotation in factor analysis"
|
|||
|
<https://link.springer.com/article/10.1007%2FBF02289233>`_
|
|||
|
H. F. Kaiser, 1958
|
|||
|
|
|||
|
See also :ref:`nca_dim_reduction` for dimensionality reduction with
|
|||
|
Neighborhood Components Analysis.
|