sklearn/doc/faq.rst

531 lines
24 KiB
ReStructuredText

.. raw:: html
<style>
/* h3 headings on this page are the questions; make them rubric-like */
h3 {
font-size: 1rem;
font-weight: bold;
padding-bottom: 0.2rem;
margin: 2rem 0 1.15rem 0;
border-bottom: 1px solid var(--pst-color-border);
}
/* Increase top margin for first question in each section */
h2 + section > h3 {
margin-top: 2.5rem;
}
/* Make the headerlinks a bit more visible */
h3 > a.headerlink {
font-size: 0.9rem;
}
/* Remove the backlink decoration on the titles */
h2 > a.toc-backref,
h3 > a.toc-backref {
text-decoration: none;
}
</style>
.. _faq:
==========================
Frequently Asked Questions
==========================
.. currentmodule:: sklearn
Here we try to give some answers to questions that regularly pop up on the mailing list.
.. contents:: Table of Contents
:local:
:depth: 2
About the project
-----------------
What is the project name (a lot of people get it wrong)?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scikit-learn, but not scikit or SciKit nor sci-kit learn.
Also not scikits.learn or scikits-learn, which were previously used.
How do you pronounce the project name?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sy-kit learn. sci stands for science!
Why scikit?
^^^^^^^^^^^
There are multiple scikits, which are scientific toolboxes built around SciPy.
Apart from scikit-learn, another popular one is `scikit-image <https://scikit-image.org/>`_.
Do you support PyPy?
^^^^^^^^^^^^^^^^^^^^
scikit-learn is regularly tested and maintained to work with
`PyPy <https://pypy.org/>`_ (an alternative Python implementation with
a built-in just-in-time compiler).
Note however that this support is still considered experimental and specific
components might behave slightly differently. Please refer to the test
suite of the specific module of interest for more details.
How can I obtain permission to use the images in scikit-learn for my work?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The images contained in the `scikit-learn repository
<https://github.com/scikit-learn/scikit-learn>`_ and the images generated within
the `scikit-learn documentation <https://scikit-learn.org/stable/index.html>`_
can be used via the `BSD 3-Clause License
<https://github.com/scikit-learn/scikit-learn?tab=BSD-3-Clause-1-ov-file>`_ for
your work. Citations of scikit-learn are highly encouraged and appreciated. See
:ref:`citing scikit-learn <citing-scikit-learn>`.
Implementation decisions
------------------------
Why is there no support for deep or reinforcement learning? Will there be such support in the future?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Deep learning and reinforcement learning both require a rich vocabulary to
define an architecture, with deep learning additionally requiring
GPUs for efficient computing. However, neither of these fit within
the design constraints of scikit-learn. As a result, deep learning
and reinforcement learning are currently out of scope for what
scikit-learn seeks to achieve.
You can find more information about the addition of GPU support at
`Will you add GPU support?`_.
Note that scikit-learn currently implements a simple multilayer perceptron
in :mod:`sklearn.neural_network`. We will only accept bug fixes for this module.
If you want to implement more complex deep learning models, please turn to
popular deep learning frameworks such as
`tensorflow <https://www.tensorflow.org/>`_,
`keras <https://keras.io/>`_,
and `pytorch <https://pytorch.org/>`_.
.. _adding_graphical_models:
Will you add graphical models or sequence prediction to scikit-learn?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Not in the foreseeable future.
scikit-learn tries to provide a unified API for the basic tasks in machine
learning, with pipelines and meta-algorithms like grid search to tie
everything together. The required concepts, APIs, algorithms and
expertise required for structured learning are different from what
scikit-learn has to offer. If we started doing arbitrary structured
learning, we'd need to redesign the whole package and the project
would likely collapse under its own weight.
There are two projects with API similar to scikit-learn that
do structured prediction:
* `pystruct <https://pystruct.github.io/>`_ handles general structured
learning (focuses on SSVMs on arbitrary graph structures with
approximate inference; defines the notion of sample as an instance of
the graph structure).
* `seqlearn <https://larsmans.github.io/seqlearn/>`_ handles sequences only
(focuses on exact inference; has HMMs, but mostly for the sake of
completeness; treats a feature vector as a sample and uses an offset encoding
for the dependencies between feature vectors).
Why did you remove HMMs from scikit-learn?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See :ref:`adding_graphical_models`.
Will you add GPU support?
^^^^^^^^^^^^^^^^^^^^^^^^^
Adding GPU support by default would introduce heavy harware-specific software
dependencies and existing algorithms would need to be reimplemented. This would
make it both harder for the average user to install scikit-learn and harder for
the developers to maintain the code.
However, since 2023, a limited but growing :ref:`list of scikit-learn
estimators <array_api_supported>` can already run on GPUs if the input data is
provided as a PyTorch or CuPy array and if scikit-learn has been configured to
accept such inputs as explained in :ref:`array_api`. This Array API support
allows scikit-learn to run on GPUs without introducing heavy and
hardware-specific software dependencies to the main package.
Most estimators that rely on NumPy for their computationally intensive operations
can be considered for Array API support and therefore GPU support.
However, not all scikit-learn estimators are amenable to efficiently running
on GPUs via the Array API for fundamental algorithmic reasons. For instance,
tree-based models currently implemented with Cython in scikit-learn are
fundamentally not array-based algorithms. Other algorithms such as k-means or
k-nearest neighbors rely on array-based algorithms but are also implemented in
Cython. Cython is used to manually interleave consecutive array operations to
avoid introducing performance killing memory access to large intermediate
arrays: this low-level algorithmic rewrite is called "kernel fusion" and cannot
be expressed via the Array API for the foreseeable future.
Adding efficient GPU support to estimators that cannot be efficiently
implemented with the Array API would require designing and adopting a more
flexible extension system for scikit-learn. This possibility is being
considered in the following GitHub issue (under discussion):
- https://github.com/scikit-learn/scikit-learn/issues/22438
Why do categorical variables need preprocessing in scikit-learn, compared to other tools?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices
of a single numeric dtype. These do not explicitly represent categorical
variables at present. Thus, unlike R's ``data.frames`` or :class:`pandas.DataFrame`,
we require explicit conversion of categorical features to numeric values, as
discussed in :ref:`preprocessing_categorical_features`.
See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py` for an
example of working with heterogeneous (e.g. categorical and numeric) data.
Why does scikit-learn not directly work with, for example, :class:`pandas.DataFrame`?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The homogeneous NumPy and SciPy data objects currently expected are most
efficient to process for most operations. Extensive work would also be needed
to support Pandas categorical types. Restricting input to homogeneous
types therefore reduces maintenance cost and encourages usage of efficient
data structures.
Note however that :class:`~sklearn.compose.ColumnTransformer` makes it
convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of
dataframe columns selected by name or dtype to dedicated scikit-learn transformers.
Therefore :class:`~sklearn.compose.ColumnTransformer` are often used in the first
step of scikit-learn pipelines when dealing
with heterogeneous dataframes (see :ref:`pipeline` for more details).
See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`
for an example of working with heterogeneous (e.g. categorical and numeric) data.
Do you plan to implement transform for target ``y`` in a pipeline?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Currently transform only works for features ``X`` in a pipeline. There's a
long-standing discussion about not being able to transform ``y`` in a pipeline.
Follow on GitHub issue :issue:`4143`. Meanwhile, you can check out
:class:`~compose.TransformedTargetRegressor`,
`pipegraph <https://github.com/mcasl/PipeGraph>`_,
and `imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
Note that scikit-learn solved for the case where ``y``
has an invertible transformation applied before training
and inverted after prediction. scikit-learn intends to solve for
use cases where ``y`` should be transformed at training time
and not at test time, for resampling and similar uses, like at
`imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
In general, these use cases can be solved
with a custom meta estimator rather than a :class:`~pipeline.Pipeline`.
Why are there so many different estimators for linear models?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Usually, there is one classifier and one regressor per model type, e.g.
:class:`~ensemble.GradientBoostingClassifier` and
:class:`~ensemble.GradientBoostingRegressor`. Both have similar options and
both have the parameter `loss`, which is especially useful in the regression
case as it enables the estimation of conditional mean as well as conditional
quantiles.
For linear models, there are many estimator classes which are very close to
each other. Let us have a look at
- :class:`~linear_model.LinearRegression`, no penalty
- :class:`~linear_model.Ridge`, L2 penalty
- :class:`~linear_model.Lasso`, L1 penalty (sparse models)
- :class:`~linear_model.ElasticNet`, L1 + L2 penalty (less sparse models)
- :class:`~linear_model.SGDRegressor` with `loss="squared_loss"`
**Maintainer perspective:**
They all do in principle the same and are different only by the penalty they
impose. This, however, has a large impact on the way the underlying
optimization problem is solved. In the end, this amounts to usage of different
methods and tricks from linear algebra. A special case is
:class:`~linear_model.SGDRegressor` which
comprises all 4 previous models and is different by the optimization procedure.
A further side effect is that the different estimators favor different data
layouts (`X` C-contiguous or F-contiguous, sparse csr or csc). This complexity
of the seemingly simple linear models is the reason for having different
estimator classes for different penalties.
**User perspective:**
First, the current design is inspired by the scientific literature where linear
regression models with different regularization/penalty were given different
names, e.g. *ridge regression*. Having different model classes with according
names makes it easier for users to find those regression models.
Secondly, if all the 5 above mentioned linear models were unified into a single
class, there would be parameters with a lot of options like the ``solver``
parameter. On top of that, there would be a lot of exclusive interactions
between different parameters. For example, the possible options of the
parameters ``solver``, ``precompute`` and ``selection`` would depend on the
chosen values of the penalty parameters ``alpha`` and ``l1_ratio``.
Contributing
------------
How can I contribute to scikit-learn?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See :ref:`contributing`. Before wanting to add a new algorithm, which is
usually a major and lengthy undertaking, it is recommended to start with
:ref:`known issues <new_contributors>`. Please do not contact the contributors
of scikit-learn directly regarding contributing to scikit-learn.
Why is my pull request not getting any attention?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The scikit-learn review process takes a significant amount of time, and
contributors should not be discouraged by a lack of activity or review on
their pull request. We care a lot about getting things right
the first time, as maintenance and later change comes at a high cost.
We rarely release any "experimental" code, so all of our contributions
will be subject to high use immediately and should be of the highest
quality possible initially.
Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the
reviewers and core developers are working on scikit-learn on their own time.
If a review of your pull request comes slowly, it is likely because the
reviewers are busy. We ask for your understanding and request that you
not close your pull request or discontinue your work solely because of
this reason.
.. _new_algorithms_inclusion_criteria:
What are the inclusion criteria for new algorithms?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We only consider well-established algorithms for inclusion. A rule of thumb is
at least 3 years since publication, 200+ citations, and wide use and
usefulness. A technique that provides a clear-cut improvement (e.g. an
enhanced data structure or a more efficient approximation technique) on
a widely-used method will also be considered for inclusion.
From the algorithms or techniques that meet the above criteria, only those
which fit well within the current API of scikit-learn, that is a ``fit``,
``predict/transform`` interface and ordinarily having input/output that is a
numpy array or sparse matrix, are accepted.
The contributor should support the importance of the proposed addition with
research papers and/or implementations in other similar packages, demonstrate
its usefulness via common use-cases/applications and corroborate performance
improvements, if any, with benchmarks and/or plots. It is expected that the
proposed algorithm should outperform the methods that are already implemented
in scikit-learn at least in some areas.
Inclusion of a new algorithm speeding up an existing model is easier if:
- it does not introduce new hyper-parameters (as it makes the library
more future-proof),
- it is easy to document clearly when the contribution improves the speed
and when it does not, for instance, "when ``n_features >>
n_samples``",
- benchmarks clearly show a speed up.
Also, note that your implementation need not be in scikit-learn to be used
together with scikit-learn tools. You can implement your favorite algorithm
in a scikit-learn compatible way, upload it to GitHub and let us know. We
will be happy to list it under :ref:`related_projects`. If you already have
a package on GitHub following the scikit-learn API, you may also be
interested to look at `scikit-learn-contrib
<https://scikit-learn-contrib.github.io>`_.
.. _selectiveness:
Why are you so selective on what algorithms you include in scikit-learn?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Code comes with maintenance cost, and we need to balance the amount of
code we have with the size of the team (and add to this the fact that
complexity scales non linearly with the number of features).
The package relies on core developers using their free time to
fix bugs, maintain code and review contributions.
Any algorithm that is added needs future attention by the developers,
at which point the original author might long have lost interest.
See also :ref:`new_algorithms_inclusion_criteria`. For a great read about
long-term maintenance issues in open-source software, look at
`the Executive Summary of Roads and Bridges
<https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8>`_.
Using scikit-learn
------------------
What's the best way to get help on scikit-learn usage?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* General machine learning questions: use `Cross Validated
<https://stats.stackexchange.com/>`_ with the ``[machine-learning]`` tag.
* scikit-learn usage questions: use `Stack Overflow
<https://stackoverflow.com/questions/tagged/scikit-learn>`_ with the
``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the `mailing list
<https://mail.python.org/mailman/listinfo/scikit-learn>`_.
Please make sure to include a minimal reproduction code snippet (ideally shorter
than 10 lines) that highlights your problem on a toy dataset (for instance from
:mod:`sklearn.datasets` or randomly generated with functions of ``numpy.random`` with
a fixed random seed). Please remove any line of code that is not necessary to
reproduce your problem.
The problem should be reproducible by simply copy-pasting your code snippet in a Python
shell with scikit-learn installed. Do not forget to include the import statements.
More guidance to write good reproduction code snippets can be found at:
https://stackoverflow.com/help/mcve.
If your problem raises an exception that you do not understand (even after googling it),
please make sure to include the full traceback that you obtain when running the
reproduction script.
For bug reports or feature requests, please make use of the
`issue tracker on GitHub <https://github.com/scikit-learn/scikit-learn/issues>`_.
.. warning::
Please do not email any authors directly to ask for assistance, report bugs,
or for any other issue related to scikit-learn.
How should I save, export or deploy estimators for production?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See :ref:`model_persistence`.
How can I create a bunch object?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Bunch objects are sometimes used as an output for functions and methods. They
extend dictionaries by enabling values to be accessed by key,
`bunch["value_key"]`, or by an attribute, `bunch.value_key`.
They should not be used as an input. Therefore you almost never need to create
a :class:`~utils.Bunch` object, unless you are extending scikit-learn's API.
How can I load my own datasets into a format usable by scikit-learn?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Generally, scikit-learn works on any numeric data stored as numpy arrays
or scipy sparse matrices. Other types that are convertible to numeric
arrays such as :class:`pandas.DataFrame` are also acceptable.
For more information on loading your data files into these usable data
structures, please refer to :ref:`loading external datasets <external_datasets>`.
How do I deal with string data (or trees, graphs...)?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
scikit-learn estimators assume you'll feed them real-valued feature vectors.
This assumption is hard-coded in pretty much all of the library.
However, you can feed non-numerical inputs to estimators in several ways.
If you have text documents, you can use a term frequency features; see
:ref:`text_feature_extraction` for the built-in *text vectorizers*.
For more general feature extraction from any kind of data, see
:ref:`dict_feature_extraction` and :ref:`feature_hashing`.
Another common case is when you have non-numerical data and a custom distance
(or similarity) metric on these data. Examples include strings with edit
distance (aka. Levenshtein distance), for instance, DNA or RNA sequences. These can be
encoded as numbers, but doing so is painful and error-prone. Working with
distance metrics on arbitrary data can be done in two ways.
Firstly, many estimators take precomputed distance/similarity matrices, so if
the dataset is not too large, you can compute distances for all pairs of inputs.
If the dataset is large, you can use feature vectors with only one "feature",
which is an index into a separate data structure, and supply a custom metric
function that looks up the actual data in this data structure. For instance, to use
:class:`~cluster.dbscan` with Levenshtein distances::
>>> import numpy as np
>>> from leven import levenshtein # doctest: +SKIP
>>> from sklearn.cluster import dbscan
>>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
>>> def lev_metric(x, y):
... i, j = int(x[0]), int(y[0]) # extract indices
... return levenshtein(data[i], data[j])
...
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
array([[0],
[1],
[2]])
>>> # We need to specify algorithm='brute' as the default assumes
>>> # a continuous feature space.
>>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute') # doctest: +SKIP
(array([0, 1]), array([ 0, 0, -1]))
Note that the example above uses the third-party edit distance package
`leven <https://pypi.org/project/leven/>`_. Similar tricks can be used,
with some care, for tree kernels, graph kernels, etc.
Why do I sometimes get a crash/freeze with ``n_jobs > 1`` under OSX or Linux?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Several scikit-learn tools such as :class:`~model_selection.GridSearchCV` and
:class:`~model_selection.cross_val_score` rely internally on Python's
:mod:`multiprocessing` module to parallelize execution
onto several Python processes by passing ``n_jobs > 1`` as an argument.
The problem is that Python :mod:`multiprocessing` does a ``fork`` system call
without following it with an ``exec`` system call for performance reasons. Many
libraries like (some versions of) Accelerate or vecLib under OSX, (some versions
of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
manage their own internal thread pool. Upon a call to `fork`, the thread pool
state in the child process is corrupted: the thread pool believes it has many
threads while only the main thread state has been forked. It is possible to
change the libraries to make them detect when a fork happens and reinitialize
the thread pool in that case: we did that for OpenBLAS (merged upstream in
main since 0.2.10) and we contributed a `patch
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035>`_ to GCC's OpenMP runtime
(not yet reviewed).
But in the end the real culprit is Python's :mod:`multiprocessing` that does
``fork`` without ``exec`` to reduce the overhead of starting and using new
Python processes for parallel computing. Unfortunately this is a violation of
the POSIX standard and therefore some software editors like Apple refuse to
consider the lack of fork-safety in Accelerate and vecLib as a bug.
In Python 3.4+ it is now possible to configure :mod:`multiprocessing` to
use the ``"forkserver"`` or ``"spawn"`` start methods (instead of the default
``"fork"``) to manage the process pools. To work around this issue when
using scikit-learn, you can set the ``JOBLIB_START_METHOD`` environment
variable to ``"forkserver"``. However the user should be aware that using
the ``"forkserver"`` method prevents :class:`joblib.Parallel` to call function
interactively defined in a shell session.
If you have custom code that uses :mod:`multiprocessing` directly instead of using
it via :mod:`joblib` you can enable the ``"forkserver"`` mode globally for your
program. Insert the following instructions in your main script::
import multiprocessing
# other imports, custom code, load data, define model...
if __name__ == "__main__":
multiprocessing.set_start_method("forkserver")
# call scikit-learn utils with n_jobs > 1 here
You can find more default on the new start methods in the `multiprocessing
documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_.
.. _faq_mkl_threading:
Why does my job use more cores than specified with ``n_jobs``?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This is because ``n_jobs`` only controls the number of jobs for
routines that are parallelized with :mod:`joblib`, but parallel code can come
from other sources:
- some routines may be parallelized with OpenMP (for code written in C or
Cython),
- scikit-learn relies a lot on numpy, which in turn may rely on numerical
libraries like MKL, OpenBLAS or BLIS which can provide parallel
implementations.
For more details, please refer to our :ref:`notes on parallelism <parallelism>`.
How do I set a ``random_state`` for an entire execution?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Please refer to :ref:`randomness`.