# Code of Conduct
We are a community based on openness, as well as friendly and didactic discussions.
We aspire to treat everybody equally, and value their contributions.
Decisions are made based on technical merit and consensus.
Code is not the only way to help the project. Reviewing pull requests,
answering questions to help others on mailing lists or issues, organizing and
teaching tutorials, working on the website, improving the documentation, are
all priceless contributions.
We abide by the principles of openness, respect, and consideration of others of
the Python Software Foundation:
@ -0,0 +1,42 @@
Contributing to scikit-learn
The latest contributing guide is available in the repository at
`doc/developers/contributing.rst`, or online at:
There are many ways to contribute to scikit-learn, with the most common ones
being contribution of code or documentation to the project. Improving the
documentation is no less important than improving the library itself. If you
find a typo in the documentation, or have made improvements, do not hesitate to
send an email to the mailing list or preferably submit a GitHub pull request.
Documentation can be found under the
[doc/]( directory.
But there are many other ways to help. In particular answering queries on the
[issue tracker](,
investigating bugs, and [reviewing other developers' pull
are very valuable contributions that decrease the burden on the project
Another way to contribute is to report issues you're facing, and give a "thumbs
up" on issues that others reported and that are relevant to you. It also helps
us if you spread the word: reference the project from your blog and articles,
link to it from your website, or simply star it in GitHub to say "I use it".
Quick links
* [Submitting a bug report or feature request](
* [Contributing code](
* [Coding guidelines](
* [Tips to read current code](
Code of Conduct
We abide by the principles of openness, respect, and consideration of others
of the Python Software Foundation:
@ -0,0 +1,29 @@
BSD 3-Clause License
Copyright (c) 2007-2024 The scikit-learn developers.
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
@ -0,0 +1,36 @@
include *.rst
include *.build
recursive-include sklearn *.build
recursive-include doc *
recursive-include examples *
recursive-include sklearn *.c *.cpp *.h *.pyx *.pxd *.pxi *.tp
recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt *.arff.gz *.json.gz
include COPYING
include README.rst
include pyproject.toml
include sklearn/externals/README
include sklearn/svm/src/liblinear/COPYRIGHT
include sklearn/svm/src/libsvm/LIBSVM_CHANGES
include Makefile
include .coveragerc
# exclude from sdist
recursive-exclude asv_benchmarks *
recursive-exclude benchmarks *
recursive-exclude build_tools *
recursive-exclude maint_tools *
recursive-exclude benchmarks *
recursive-exclude .binder *
recursive-exclude .circleci *
exclude .codecov.yml
exclude .git-blame-ignore-revs
exclude .mailmap
exclude .pre-commit-config.yaml
exclude azure-pipelines.yml
@ -0,0 +1,70 @@
# simple makefile to simplify repetitive build env management tasks under posix
# caution: testing won't work on windows, see README
PYTHON ?= python
CYTHON ?= cython
PYTEST ?= pytest
CTAGS ?= ctags
# skip doctests on 32bit python
BITS := $(shell python -c 'import struct; print(8 * struct.calcsize("P"))')
all: clean inplace test
rm -f tags
clean: clean-ctags
$(PYTHON) clean
rm -rf dist
in: inplace # just a shortcut
$(PYTHON) build_ext -i
pip install --verbose --no-build-isolation --editable . --config-settings editable-verbose=true
pip uninstall -y scikit-learn
test-code: in
$(PYTEST) --showlocals -v sklearn --durations=20
$(PYTEST) --showlocals -v doc/sphinxext/
ifeq ($(BITS),64)
$(PYTEST) $(shell find doc -name '*.rst' | sort)
test-code-parallel: in
$(PYTEST) -n auto --showlocals -v sklearn --durations=20
rm -rf coverage .coverage
$(PYTEST) sklearn --showlocals -v --cov=sklearn --cov-report=html:coverage
rm -rf coverage .coverage .coverage.*
$(PYTEST) sklearn -n auto --showlocals -v --cov=sklearn --cov-report=html:coverage
test: test-code test-sphinxext test-doc
find sklearn -name "*.py" -exec perl -pi -e 's/[ \t]*$$//' {} \;
python build_src
# make tags for symbol based navigation in emacs and vim
# Install with: sudo apt-get install exuberant-ctags
$(CTAGS) --python-kinds=-i -R sklearn
doc: inplace
$(MAKE) -C doc html
doc-noplot: inplace
$(MAKE) -C doc html-noplot
@ -0,0 +1,301 @@
Metadata-Version: 2.1
Name: scikit-learn
Version: 1.5.1
Summary: A set of python modules for machine learning and data mining
Maintainer-Email: scikit-learn developers <>
License: new BSD
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: C
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
Classifier: Topic :: Scientific/Engineering
Classifier: Development Status :: 5 - Production/Stable
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Project-URL: Homepage,
Project-URL: Source,
Project-URL: Download,
Project-URL: Tracker,
Project-URL: Release notes,
Requires-Python: >=3.9
Requires-Dist: numpy>=1.19.5
Requires-Dist: scipy>=1.6.0
Requires-Dist: joblib>=1.2.0
Requires-Dist: threadpoolctl>=3.1.0
Requires-Dist: numpy>=1.19.5; extra == "build"
Requires-Dist: scipy>=1.6.0; extra == "build"
Requires-Dist: cython>=3.0.10; extra == "build"
Requires-Dist: meson-python>=0.16.0; extra == "build"
Requires-Dist: numpy>=1.19.5; extra == "install"
Requires-Dist: scipy>=1.6.0; extra == "install"
Requires-Dist: joblib>=1.2.0; extra == "install"
Requires-Dist: threadpoolctl>=3.1.0; extra == "install"
Requires-Dist: matplotlib>=3.3.4; extra == "benchmark"
Requires-Dist: pandas>=1.1.5; extra == "benchmark"
Requires-Dist: memory_profiler>=0.57.0; extra == "benchmark"
Requires-Dist: matplotlib>=3.3.4; extra == "docs"
Requires-Dist: scikit-image>=0.17.2; extra == "docs"
Requires-Dist: pandas>=1.1.5; extra == "docs"
Requires-Dist: seaborn>=0.9.0; extra == "docs"
Requires-Dist: memory_profiler>=0.57.0; extra == "docs"
Requires-Dist: sphinx>=7.3.7; extra == "docs"
Requires-Dist: sphinx-copybutton>=0.5.2; extra == "docs"
Requires-Dist: sphinx-gallery>=0.16.0; extra == "docs"
Requires-Dist: numpydoc>=1.2.0; extra == "docs"
Requires-Dist: Pillow>=7.1.2; extra == "docs"
Requires-Dist: pooch>=1.6.0; extra == "docs"
Requires-Dist: sphinx-prompt>=1.4.0; extra == "docs"
Requires-Dist: sphinxext-opengraph>=0.9.1; extra == "docs"
Requires-Dist: plotly>=5.14.0; extra == "docs"
Requires-Dist: polars>=0.20.23; extra == "docs"
Requires-Dist: sphinx-design>=0.5.0; extra == "docs"
Requires-Dist: sphinxcontrib-sass>=0.3.4; extra == "docs"
Requires-Dist: pydata-sphinx-theme>=0.15.3; extra == "docs"
Requires-Dist: sphinx-remove-toctrees>=1.0.0.post1; extra == "docs"
Requires-Dist: matplotlib>=3.3.4; extra == "examples"
Requires-Dist: scikit-image>=0.17.2; extra == "examples"
Requires-Dist: pandas>=1.1.5; extra == "examples"
Requires-Dist: seaborn>=0.9.0; extra == "examples"
Requires-Dist: pooch>=1.6.0; extra == "examples"
Requires-Dist: plotly>=5.14.0; extra == "examples"
Requires-Dist: matplotlib>=3.3.4; extra == "tests"
Requires-Dist: scikit-image>=0.17.2; extra == "tests"
Requires-Dist: pandas>=1.1.5; extra == "tests"
Requires-Dist: pytest>=7.1.2; extra == "tests"
Requires-Dist: pytest-cov>=2.9.0; extra == "tests"
Requires-Dist: ruff>=0.2.1; extra == "tests"
Requires-Dist: black>=24.3.0; extra == "tests"
Requires-Dist: mypy>=1.9; extra == "tests"
Requires-Dist: pyamg>=4.0.0; extra == "tests"
Requires-Dist: polars>=0.20.23; extra == "tests"
Requires-Dist: pyarrow>=12.0.0; extra == "tests"
Requires-Dist: numpydoc>=1.2.0; extra == "tests"
Requires-Dist: pooch>=1.6.0; extra == "tests"
Requires-Dist: conda-lock==2.5.6; extra == "maintenance"
Provides-Extra: build
Provides-Extra: install
Provides-Extra: benchmark
Provides-Extra: docs
Provides-Extra: examples
Provides-Extra: tests
Provides-Extra: maintenance
Description-Content-Type: text/x-rst
.. -*- mode: rst -*-
|Azure| |CirrusCI| |Codecov| |CircleCI| |Nightly wheels| |Black| |PythonVersion| |PyPi| |DOI| |Benchmark|
.. |Azure| image::
.. |CircleCI| image::
.. |CirrusCI| image::
.. |Codecov| image::
.. |Nightly wheels| image::
.. |PythonVersion| image::
.. |PyPi| image::
.. |Black| image::
.. |DOI| image::
.. |Benchmark| image::
.. |PythonMinVersion| replace:: 3.9
.. |NumPyMinVersion| replace:: 1.19.5
.. |SciPyMinVersion| replace:: 1.6.0
.. |JoblibMinVersion| replace:: 1.2.0
.. |ThreadpoolctlMinVersion| replace:: 3.1.0
.. |MatplotlibMinVersion| replace:: 3.3.4
.. |Scikit-ImageMinVersion| replace:: 0.17.2
.. |PandasMinVersion| replace:: 1.1.5
.. |SeabornMinVersion| replace:: 0.9.0
.. |PytestMinVersion| replace:: 7.1.2
.. |PlotlyMinVersion| replace:: 5.14.0
.. image::
**scikit-learn** is a Python module for machine learning built on top of
SciPy and is distributed under the 3-Clause BSD license.
The project was started in 2007 by David Cournapeau as a Google Summer
of Code project, and since then many volunteers have contributed. See
the `About us <>`__ page
for a list of core contributors.
It is currently maintained by a team of volunteers.
scikit-learn requires:
- Python (>= |PythonMinVersion|)
- NumPy (>= |NumPyMinVersion|)
- SciPy (>= |SciPyMinVersion|)
- joblib (>= |JoblibMinVersion|)
- threadpoolctl (>= |ThreadpoolctlMinVersion|)
**Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4.**
scikit-learn 1.0 and later require Python 3.7 or newer.
scikit-learn 1.1 and later require Python 3.8 or newer.
Scikit-learn plotting capabilities (i.e., functions start with ``plot_`` and
classes end with ``Display``) require Matplotlib (>= |MatplotlibMinVersion|).
For running the examples Matplotlib >= |MatplotlibMinVersion| is required.
A few examples require scikit-image >= |Scikit-ImageMinVersion|, a few examples
require pandas >= |PandasMinVersion|, some examples require seaborn >=
|SeabornMinVersion| and plotly >= |PlotlyMinVersion|.
User installation
If you already have a working installation of NumPy and SciPy,
the easiest way to install scikit-learn is using ``pip``::
pip install -U scikit-learn
or ``conda``::
conda install -c conda-forge scikit-learn
The documentation includes more detailed `installation instructions <>`_.
See the `changelog <>`__
for a history of notable changes to scikit-learn.
We welcome new contributors of all experience levels. The scikit-learn
community goals are to be helpful, welcoming, and effective. The
`Development Guide <>`_
has detailed information about contributing code, documentation, tests, and
more. We've included some basic information in this README.
Important links
- Official source code repo:
- Download releases:
- Issue tracker:
Source code
You can check the latest sources with the command::
git clone
To learn more about making a contribution to scikit-learn, please see our
`Contributing guide
After installation, you can launch the test suite from outside the source
directory (you will need to have ``pytest`` >= |PyTestMinVersion| installed)::
pytest sklearn
See the web page
for more information.
Random number generation can be controlled during testing by setting
the ``SKLEARN_SEED`` environment variable.
Submitting a Pull Request
Before opening a Pull Request, have a look at the
full Contributing page to make sure your code complies
with our guidelines:
Project History
The project was started in 2007 by David Cournapeau as a Google Summer
of Code project, and since then many volunteers have contributed. See
the `About us <>`__ page
for a list of core contributors.
The project is currently maintained by a team of volunteers.
**Note**: `scikit-learn` was previously referred to as `scikits.learn`.
Help and Support
- HTML documentation (stable release):
- HTML documentation (development version):
- FAQ:
- Mailing list:
- Logos & Branding:
- Blog:
- Calendar:
- Twitter:
- Stack Overflow:
- GitHub Discussions:
- Website:
- LinkedIn:
- YouTube:
- Facebook:
- Instagram:
- TikTok:
- Mastodon:
- Discord:
If you use scikit-learn in a scientific publication, we would appreciate citations:
@ -0,0 +1,206 @@
# Security Policy
## Supported Versions
| Version | Supported |
| ------------- | ------------------ |
| 1.4.2 | :white_check_mark: |
| < 1.4.2 | :x: |
## Reporting a Vulnerability
Please report security vulnerabilities by email to ``.
This email is an alias to a subset of the scikit-learn maintainers' team.
If the security vulnerability is accepted, a patch will be crafted privately
in order to prepare a dedicated bugfix release as timely as possible (depending
on the complexity of the fix).
In addition to sending the report by email, you can also report security
vulnerabilities to [tidelift](
@ -0,0 +1,152 @@
# Makefile for Sphinx documentation
# You can set these variables from the command line.
SPHINXBUILD ?= sphinx-build
BUILDDIR = _build
EXAMPLES_PATTERN_OPTS := -D sphinx_gallery_conf.filename_pattern="$(EXAMPLES_PATTERN)"
ifeq ($(CI), true)
# On CircleCI using -j2 does not seem to speed up the html-noplot build
else ifeq ($(shell uname), Darwin)
# Avoid stalling issues on MacOS
# Internal variables.
PAPEROPT_a4 = -D latex_paper_size=a4
PAPEROPT_letter = -D latex_paper_size=letter
.PHONY: help clean html dirhtml ziphtml pickle json latex latexpdf changes linkcheck doctest optipng
all: html-noplot
@echo "Please use \`make <target>' where <target> is one of"
@echo " html to make standalone HTML files"
@echo " dirhtml to make HTML files named index.html in directories"
@echo " ziphtml to make a ZIP of the HTML"
@echo " pickle to make pickle files"
@echo " json to make JSON files"
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
@echo " latexpdf to make LaTeX files and run them through pdflatex"
@echo " changes to make an overview of all changed/added/deprecated items"
@echo " linkcheck to check all external links for integrity"
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
-rm -rf $(BUILDDIR)/*
@echo "Removed $(BUILDDIR)/*"
-rm -rf auto_examples/
@echo "Removed auto_examples/"
-rm -rf generated/*
@echo "Removed generated/"
-rm -rf modules/generated/
@echo "Removed modules/generated/"
-rm -rf css/styles/
@echo "Removed css/styles/"
-rm -rf api/*.rst
@echo "Removed api/*.rst"
# Default to SPHINX_NUMJOBS=1 for full documentation build. Using
# SPHINX_NUMJOBS!=1 may actually slow down the build, or cause weird issues in
# the CI (job stalling or EOFError), see
# or
# These two lines make the build a bit more lengthy, and the
# the embedding of images more robust
rm -rf $(BUILDDIR)/html/_images
#rm -rf _build/doctrees/
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable"
# Default to SPHINX_NUMJOBS=auto (except on MacOS and CI) since this makes
# html-noplot build faster
$(SPHINXBUILD) -D plot_gallery=0 -b html $(ALLSPHINXOPTS) -j$(SPHINX_NUMJOBS) \
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable."
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
@if [ ! -d "$(BUILDDIR)/html/stable/" ]; then \
make html; \
# Optimize the images to reduce the size of the ZIP
optipng $(BUILDDIR)/html/stable/_images/*.png
# Exclude the output directory to avoid infinity recursion
cd $(BUILDDIR)/html/stable; \
zip -q -x _downloads \
-r _downloads/ .
@echo "Build finished. The ZIP of the HTML is in $(BUILDDIR)/html/stable/_downloads."
@echo "Build finished; now you can process the pickle files."
@echo "Build finished; now you can process the JSON files."
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
@echo "Run \`make' in that directory to run these through (pdf)latex" \
"(use \`make latexpdf' here to do that automatically)."
@echo "Running LaTeX files through pdflatex..."
make -C $(BUILDDIR)/latex all-pdf
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
@echo "The overview file is in $(BUILDDIR)/changes."
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
@echo "Link check complete; look for any errors in the above output " \
"or in $(BUILDDIR)/linkcheck/output.txt."
@echo "Testing of doctests in the sources finished, look at the " \
"results in $(BUILDDIR)/doctest/output.txt."
python -c "from sklearn.datasets._lfw import _check_fetch_lfw; _check_fetch_lfw()"
# Optimize PNG files. Needs OptiPNG. Change the -P argument to the number of
# cores you have available, so -P 64 if you have a real computer ;)
find _build auto_examples */generated -name '*.png' -print0 \
| xargs -0 -n 1 -P 4 optipng -o10
dist: html ziphtml
# Documentation for scikit-learn
This directory contains the full manual and website as displayed at
|||| See
|||| for
detailed information about the documentation.
Normal file
@ -0,0 +1,599 @@
.. _about:
About us
This project was started in 2007 as a Google Summer of Code project by
David Cournapeau. Later that year, Matthieu Brucher started work on
this project as part of his thesis.
In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent
Michel of INRIA took leadership of the project and made the first public
release, February the 1st 2010. Since then, several releases have appeared
following an approximately 3-month cycle, and a thriving international
community has been leading the development.
The decision making process and governance structure of scikit-learn is laid
out in the :ref:`governance document <governance>`.
.. The "author" anchors below is there to ensure that old html links (in
the form of "about.html#author" still work)
.. _authors:
The people behind scikit-learn
Scikit-learn is a community project, developed by a large group of
people, all across the world. A few teams, listed below, have central
roles, however a more complete list of contributors can be found `on
Maintainers Team
The following people are currently maintainers, in charge of
consolidating scikit-learn's development and maintenance:
.. include:: maintainers.rst
.. note::
Please do not email the authors directly to ask for assistance or report issues.
Instead, please see `What's the best way to ask questions about scikit-learn
in the FAQ.
.. seealso::
How you can :ref:`contribute to the project <contributing>`.
Documentation Team
The following people help with documenting the project:
.. include:: documentation_team.rst
Contributor Experience Team
The following people are active contributors who also help with
:ref:`triaging issues <bug_triaging>`, PRs, and general
.. include:: contributor_experience_team.rst
Communication Team
The following people help with :ref:`communication around scikit-learn
.. include:: communication_team.rst
Emeritus Core Developers
The following people have been active contributors in the past, but are no
longer active in the project:
.. include:: maintainers_emeritus.rst
Emeritus Communication Team
The following people have been active in the communication team in the
past, but no longer have communication responsibilities:
.. include:: communication_team_emeritus.rst
Emeritus Contributor Experience Team
The following people have been active in the contributor experience team in the
.. include:: contributor_experience_team_emeritus.rst
.. _citing-scikit-learn:
Citing scikit-learn
If you use scikit-learn in a scientific publication, we would appreciate
citations to the following paper:
`Scikit-learn: Machine Learning in Python
<>`_, Pedregosa
*et al.*, JMLR 12, pp. 2825-2830, 2011.
Bibtex entry::
title={Scikit-learn: Machine Learning in {P}ython},
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
journal={Journal of Machine Learning Research},
If you want to cite scikit-learn for its API or design, you may also want to consider the
following paper:
:arxiv:`API design for machine learning software: experiences from the scikit-learn
project <1309.0238>`, Buitinck *et al.*, 2013.
Bibtex entry::
author = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and
Fabian Pedregosa and Andreas Mueller and Olivier Grisel and
Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort
and Jaques Grobler and Robert Layton and Jake VanderPlas and
Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux},
title = {{API} design for machine learning software: experiences from the scikit-learn
booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine Learning},
year = {2013},
pages = {108--122},
High quality PNG and SVG logos are available in the `doc/logos/
source directory.
.. image:: images/scikit-learn-logo-notext.png
:align: center
Scikit-learn is a community driven project, however institutional and private
grants help to assure its sustainability.
The project would like to thank the following funders.
.. div:: sk-text-image-grid-small
.. div:: text-box
`:probabl. <>`_ funds Adrin Jalali, Arturo Amor, François Goupil,
Guillaume Lemaitre, Jérémie du Boisberranger, Olivier Grisel, and Stefanie Senger.
.. div:: image-box
.. image:: images/probabl.png
.. |chanel| image:: images/chanel.png
.. |axa| image:: images/axa.png
.. |bnp| image:: images/bnp.png
.. |dataiku| image:: images/dataiku.png
.. |hf| image:: images/huggingface_logo-noborder.png
.. |nvidia| image:: images/nvidia.png
.. |inria| image:: images/inria-logo.jpg
.. raw:: html
table.image-subtable tr {
border-color: transparent;
table.image-subtable td {
width: 50%;
vertical-align: middle;
text-align: center;
table.image-subtable td img {
max-height: 40px !important;
max-width: 90% !important;
.. div:: sk-text-image-grid-small
.. div:: text-box
The `Members <>`_ of
the `Scikit-learn Consortium at Inria Foundation
<>`_ help at maintaining and
improving the project through their financial support.
.. div:: image-box
.. table::
:class: image-subtable
| |chanel| |
| |axa| | |bnp| |
| |nvidia| | |hf| |
| |dataiku| |
| |inria| |
.. div:: sk-text-image-grid-small
.. div:: text-box
`NVidia <>`_ funds Tim Head since 2022
and is part of the scikit-learn consortium at Inria.
.. div:: image-box
.. image:: images/nvidia.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`Microsoft <>`_ funds Andreas Müller since 2020.
.. div:: image-box
.. image:: images/microsoft.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`Quansight Labs <>`_ funds Lucy Liu since 2022.
.. div:: image-box
.. image:: images/quansight-labs.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`Tidelift <>`_ supports the project via their service
.. div:: image-box
.. image:: images/Tidelift-logo-on-light.svg
Past Sponsors
.. div:: sk-text-image-grid-small
.. div:: text-box
`Quansight Labs <>`_ funded Meekail Zain in 2022 and 2023,
and funded Thomas J. Fan from 2021 to 2023.
.. div:: image-box
.. image:: images/quansight-labs.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`Columbia University <>`_ funded Andreas Müller
.. div:: image-box
.. image:: images/columbia.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`The University of Sydney <>`_ funded Joel Nothman
.. div:: image-box
.. image:: images/sydney-primary.jpeg
.. div:: sk-text-image-grid-small
.. div:: text-box
Andreas Müller received a grant to improve scikit-learn from the
`Alfred P. Sloan Foundation <>`_ .
This grant supported the position of Nicolas Hug and Thomas J. Fan.
.. div:: image-box
.. image:: images/sloan_banner.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`INRIA <>`_ actively supports this project. It has
provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
(2012-2013) and Olivier Grisel (2013-2017) to work on this project
full-time. It also hosts coding sprints and other events.
.. div:: image-box
.. image:: images/inria-logo.jpg
.. div:: sk-text-image-grid-small
.. div:: text-box
`Paris-Saclay Center for Data Science <>`_
funded one year for a developer to work on the project full-time (2014-2015), 50%
of the time of Guillaume Lemaitre (2016-2017) and 50% of the time of Joris van den
Bossche (2017-2018).
.. div:: image-box
.. image:: images/cds-logo.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`NYU Moore-Sloan Data Science Environment <>`_
funded Andreas Mueller (2014-2016) to work on this project. The Moore-Sloan
Data Science Environment also funds several students to work on the project
.. div:: image-box
.. image:: images/nyu_short_color.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`Télécom Paristech <>`_ funded Manoj Kumar
(2014), Tom Dupré la Tour (2015), Raghav RV (2015-2017), Thierry Guillemot
(2016-2017) and Albert Thomas (2017) to work on scikit-learn.
.. div:: image-box
.. image:: images/telecom.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`The Labex DigiCosme <>`_ funded Nicolas Goix
(2015-2016), Tom Dupré la Tour (2015-2016 and 2017-2018), Mathurin Massias
(2018-2019) to work part time on scikit-learn during their PhDs. It also
funded a scikit-learn coding sprint in 2015.
.. div:: image-box
.. image:: images/digicosme.png
.. div:: sk-text-image-grid-small
.. div:: text-box
`The Chan-Zuckerberg Initiative <>`_ funded Nicolas
Hug to work full-time on scikit-learn in 2020.
.. div:: image-box
.. image:: images/czi_logo.svg
The following students were sponsored by `Google
<>`_ to work on scikit-learn through
the `Google Summer of Code <>`_
- 2007 - David Cournapeau
- 2011 - `Vlad Niculae`_
- 2012 - `Vlad Niculae`_, Immanuel Bayer
- 2013 - Kemal Eren, Nicolas Trésegnie
- 2014 - Hamzeh Alsalhi, Issam Laradji, Maheshakya Wijewardena, Manoj Kumar
- 2015 - `Raghav RV <>`_, Wei Xue
- 2016 - `Nelson Liu <>`_, `YenChen Lin <>`_
.. _Vlad Niculae:
The `NeuroDebian <>`_ project providing `Debian
<>`_ packaging and contributions is supported by
`Dr. James V. Haxby <>`_ (`Dartmouth
College <>`_).
The following organizations funded the scikit-learn consortium at Inria in
the past:
.. |msn| image:: images/microsoft.png
.. |bcg| image:: images/bcg.png
.. |fujitsu| image:: images/fujitsu.png
.. |aphp| image:: images/logo_APHP_text.png
.. raw:: html
div.image-subgrid img {
max-height: 50px;
max-width: 90%;
.. grid:: 2 2 4 4
:class-row: image-subgrid
:gutter: 1
.. grid-item::
:class: sd-text-center
:child-align: center
.. grid-item::
:class: sd-text-center
:child-align: center
.. grid-item::
:class: sd-text-center
:child-align: center
.. grid-item::
:class: sd-text-center
:child-align: center
- The International 2019 Paris sprint was kindly hosted by `AXA <>`_.
Also some participants could attend thanks to the support of the `Alfred P.
Sloan Foundation <>`_, the `Python Software
Foundation <>`_ (PSF) and the `DATAIA Institute
- The 2013 International Paris Sprint was made possible thanks to the support of
`Télécom Paristech <>`_, `tinyclues
<>`_, the `French Python Association
<>`_ and the `Fonds de la Recherche Scientifique
- The 2011 International Granada sprint was made possible thanks to the support
of the `PSF <>`_ and `tinyclues
Donating to the project
If you are interested in donating to the project or to one of our code-sprints,
please donate via the `NumFOCUS Donations Page
.. raw:: html
<p class="text-center">
<a class="btn sk-btn-orange mb-1" href="">
Help us, <strong>donate!</strong>
All donations will be handled by `NumFOCUS <>`_, a non-profit
organization which is managed by a board of `Scipy community members
<>`_. NumFOCUS's mission is to foster scientific
computing software, in particular in Python. As a fiscal home of scikit-learn, it
ensures that money is available when needed to keep the project funded and available
while in compliance with tax regulations.
The received donations for the scikit-learn project mostly will go towards covering
travel-expenses for code sprints, as well as towards the organization budget of the
project [#f1]_.
.. rubric:: Notes
.. [#f1] Regarding the organization budget, in particular, we might use some of
the donated funds to pay for other project expenses such as DNS,
hosting or continuous integration services.
Infrastructure support
We would also like to thank `Microsoft Azure <>`_,
`Cirrus Cl <>`_, `CircleCl <>`_ for free CPU
time on their Continuous Integration servers, and `Anaconda Inc. <>`_
for the storage they provide for our staging and nightly builds.
@ -0,0 +1,24 @@
.. _api_depr_ref:
Recently Deprecated
.. currentmodule:: sklearn
{% for ver, objs in DEPRECATED_API_REFERENCE %}
.. _api_depr_ref-{{ ver|replace(".", "-") }}:
.. rubric:: To be removed in {{ ver }}
.. autosummary::
:toctree: ../modules/generated/
:template: base.rst
{% for obj in objs %}
{{ obj }}
{%- endfor %}
Normal file
@ -0,0 +1,77 @@
.. _api_ref:
API Reference
This is the class and function reference of scikit-learn. Please refer to the
:ref:`full user guide <user_guide>` for further details, as the raw specifications of
classes and functions may not be enough to give full guidelines on their uses. For
reference on concepts repeated across the API, see :ref:`glossary`.
.. toctree::
:maxdepth: 2
{% for module, _ in API_REFERENCE %}
{{ module }}
{%- endfor %}
{%- endif %}
.. list-table::
:header-rows: 1
:class: apisearch-table
* - Object
- Description
{% for module, module_info in API_REFERENCE %}
{% for section in module_info["sections"] %}
{% for obj in section["autosummary"] %}
{% set parts = obj.rsplit(".", 1) %}
{% if parts|length > 1 %}
{% set full_module = module + "." + parts[0] %}
{% else %}
{% set full_module = module %}
{% endif %}
* - :obj:`~{{ module }}.{{ obj }}`
- .. div:: sk-apisearch-desc
.. currentmodule:: {{ full_module }}
.. autoshortsummary:: {{ module }}.{{ obj }}
.. div:: caption
:mod:`{{ full_module }}`
{% endfor %}
{% endfor %}
{% endfor %}
{% for ver, objs in DEPRECATED_API_REFERENCE %}
{% for obj in objs %}
{% set parts = obj.rsplit(".", 1) %}
{% if parts|length > 1 %}
{% set full_module = "sklearn." + parts[0] %}
{% else %}
{% set full_module = "sklearn" %}
{% endif %}
* - :obj:`~sklearn.{{ obj }}`
- .. div:: sk-apisearch-desc
.. currentmodule:: {{ full_module }}
.. autoshortsummary:: sklearn.{{ obj }}
.. div:: caption
:mod:`{{ full_module }}`
:bdg-ref-danger-line:`Deprecated in version {{ ver }} <api_depr_ref-{{ ver|replace(".", "-") }}>`
{% endfor %}
{% endfor %}
@ -0,0 +1,46 @@
{% if module == "sklearn" -%}
{%- set module_hook = "sklearn" -%}
{%- elif module.startswith("sklearn.") -%}
{%- set module_hook = module[8:] -%}
{%- else -%}
{%- set module_hook = None -%}
{%- endif -%}
{% if module_hook %}
.. _{{ module_hook }}_ref:
{% endif %}
{{ module }}
{{ "=" * module|length }}
.. automodule:: {{ module }}
{% if module_info["description"] %}
{{ module_info["description"] }}
{% endif %}
{% for section in module_info["sections"] %}
{% if section["title"] and module_hook %}
.. _{{ module_hook }}_ref-{{ section["title"]|lower|replace(" ", "-") }}:
{% endif %}
{% if section["title"] %}
{{ section["title"] }}
{{ "-" * section["title"]|length }}
{% endif %}
{% if section["description"] %}
{{ section["description"] }}
{% endif %}
.. autosummary::
:toctree: ../modules/generated/
:template: base.rst
{% for obj in section["autosummary"] %}
{{ obj }}
{%- endfor %}
{% endfor %}
# A binder requirement file is required by sphinx-gallery.
# We don't really need one since our binder requirement file lives in the
# .binder directory.
# This file can be removed if 'dependencies' is made an optional key for
# binder in sphinx-gallery.
Normal file
@ -0,0 +1,574 @@
.. _common_pitfalls:
Common pitfalls and recommended practices
The purpose of this chapter is to illustrate some common pitfalls and
anti-patterns that occur when using scikit-learn. It provides
examples of what **not** to do, along with a corresponding correct
Inconsistent preprocessing
scikit-learn provides a library of :ref:`data-transforms`, which
may clean (see :ref:`preprocessing`), reduce
(see :ref:`data_reduction`), expand (see :ref:`kernel_approximation`)
or generate (see :ref:`feature_extraction`) feature representations.
If these data transforms are used when training a model, they also
must be used on subsequent datasets, whether it's test data or
data in a production system. Otherwise, the feature space will change,
and the model will not be able to perform effectively.
For the following example, let's create a synthetic dataset with a
single feature::
>>> from sklearn.datasets import make_regression
>>> from sklearn.model_selection import train_test_split
>>> random_state = 42
>>> X, y = make_regression(random_state=random_state, n_features=1, noise=1)
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.4, random_state=random_state)
The train dataset is scaled, but not the test dataset, so model
performance on the test dataset is worse than expected::
>>> from sklearn.metrics import mean_squared_error
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
>>> X_train_transformed = scaler.fit_transform(X_train)
>>> model = LinearRegression().fit(X_train_transformed, y_train)
>>> mean_squared_error(y_test, model.predict(X_test))
Instead of passing the non-transformed `X_test` to `predict`, we should
transform the test data, the same way we transformed the training data::
>>> X_test_transformed = scaler.transform(X_test)
>>> mean_squared_error(y_test, model.predict(X_test_transformed))
Alternatively, we recommend using a :class:`Pipeline
<sklearn.pipeline.Pipeline>`, which makes it easier to chain transformations
with estimators, and reduces the possibility of forgetting a transformation::
>>> from sklearn.pipeline import make_pipeline
>>> model = make_pipeline(StandardScaler(), LinearRegression())
>>>, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('linearregression', LinearRegression())])
>>> mean_squared_error(y_test, model.predict(X_test))
Pipelines also help avoiding another common pitfall: leaking the test data
into the training data.
.. _data_leakage:
Data leakage
Data leakage occurs when information that would not be available at prediction
time is used when building the model. This results in overly optimistic
performance estimates, for example from :ref:`cross-validation
<cross_validation>`, and thus poorer performance when the model is used
on actually novel data, for example during production.
A common cause is not keeping the test and train data subsets separate.
Test data should never be used to make choices about the model.
**The general rule is to never call** `fit` **on the test data**. While this
may sound obvious, this is easy to miss in some cases, for example when
applying certain pre-processing steps.
Although both train and test data subsets should receive the same
preprocessing transformation (as described in the previous section), it is
important that these transformations are only learnt from the training data.
For example, if you have a
normalization step where you divide by the average value, the average should
be the average of the train subset, **not** the average of all the data. If the
test subset is included in the average calculation, information from the test
subset is influencing the model.
How to avoid data leakage
Below are some tips on avoiding data leakage:
* Always split the data into train and test subsets first, particularly
before any preprocessing steps.
* Never include test data when using the `fit` and `fit_transform`
methods. Using all the data, e.g., `fit(X)`, can result in overly optimistic
Conversely, the `transform` method should be used on both train and test
subsets as the same preprocessing should be applied to all the data.
This can be achieved by using `fit_transform` on the train subset and
`transform` on the test subset.
* The scikit-learn :ref:`pipeline <pipeline>` is a great way to prevent data
leakage as it ensures that the appropriate method is performed on the
correct data subset. The pipeline is ideal for use in cross-validation
and hyper-parameter tuning functions.
An example of data leakage during preprocessing is detailed below.
Data leakage during pre-processing
.. note::
We here choose to illustrate data leakage with a feature selection step.
This risk of leakage is however relevant with almost all transformations
in scikit-learn, including (but not limited to)
:class:`~sklearn.impute.SimpleImputer`, and
A number of :ref:`feature_selection` functions are available in scikit-learn.
They can help remove irrelevant, redundant and noisy features as well as
improve your model build time and performance. As with any other type of
preprocessing, feature selection should **only** use the training data.
Including the test data in feature selection will optimistically bias your
To demonstrate we will create this binary classification problem with
10,000 randomly generated features::
>>> import numpy as np
>>> n_samples, n_features, n_classes = 200, 10000, 2
>>> rng = np.random.RandomState(42)
>>> X = rng.standard_normal((n_samples, n_features))
>>> y = rng.choice(n_classes, n_samples)
Using all the data to perform feature selection results in an accuracy score
much higher than chance, even though our targets are completely random.
This randomness means that our `X` and `y` are independent and we thus expect
the accuracy to be around 0.5. However, since the feature selection step
'sees' the test data, the model has an unfair advantage. In the incorrect
example below we first use all the data for feature selection and then split
the data into training and test subsets for model fitting. The result is a
much higher than expected accuracy score::
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.ensemble import GradientBoostingClassifier
>>> from sklearn.metrics import accuracy_score
>>> # Incorrect preprocessing: the entire data is transformed
>>> X_selected = SelectKBest(k=25).fit_transform(X, y)
>>> X_train, X_test, y_train, y_test = train_test_split(
... X_selected, y, random_state=42)
>>> gbc = GradientBoostingClassifier(random_state=1)
>>>, y_train)
>>> y_pred = gbc.predict(X_test)
>>> accuracy_score(y_test, y_pred)
To prevent data leakage, it is good practice to split your data into train
and test subsets **first**. Feature selection can then be formed using just
the train dataset. Notice that whenever we use `fit` or `fit_transform`, we
only use the train dataset. The score is now what we would expect for the
data, close to chance::
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, random_state=42)
>>> select = SelectKBest(k=25)
>>> X_train_selected = select.fit_transform(X_train, y_train)
>>> gbc = GradientBoostingClassifier(random_state=1)
>>>, y_train)
>>> X_test_selected = select.transform(X_test)
>>> y_pred = gbc.predict(X_test_selected)
>>> accuracy_score(y_test, y_pred)
Here again, we recommend using a :class:`~sklearn.pipeline.Pipeline` to chain
together the feature selection and model estimators. The pipeline ensures
that only the training data is used when performing `fit` and the test data
is used only for calculating the accuracy score::
>>> from sklearn.pipeline import make_pipeline
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, random_state=42)
>>> pipeline = make_pipeline(SelectKBest(k=25),
... GradientBoostingClassifier(random_state=1))
>>>, y_train)
Pipeline(steps=[('selectkbest', SelectKBest(k=25)),
>>> y_pred = pipeline.predict(X_test)
>>> accuracy_score(y_test, y_pred)
The pipeline can also be fed into a cross-validation
function such as :func:`~sklearn.model_selection.cross_val_score`.
Again, the pipeline ensures that the correct data subset and estimator
method is used during fitting and predicting::
>>> from sklearn.model_selection import cross_val_score
>>> scores = cross_val_score(pipeline, X, y)
>>> print(f"Mean accuracy: {scores.mean():.2f}+/-{scores.std():.2f}")
Mean accuracy: 0.46+/-0.07
.. _randomness:
Controlling randomness
Some scikit-learn objects are inherently random. These are usually estimators
(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation
splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of
these objects is controlled via their `random_state` parameter, as described
in the :term:`Glossary <random_state>`. This section expands on the glossary
entry, and describes good practices and common pitfalls w.r.t. this
subtle parameter.
.. note:: Recommendation summary
For an optimal robustness of cross-validation (CV) results, pass
`RandomState` instances when creating estimators, or leave `random_state`
to `None`. Passing integers to CV splitters is usually the safest option
and is preferable; passing `RandomState` instances to splitters may
sometimes be useful to achieve very specific use-cases.
For both estimators and splitters, passing an integer vs passing an
instance (or `None`) leads to subtle but significant differences,
especially for CV procedures. These differences are important to
understand when reporting results.
For reproducible results across executions, remove any use of
Using `None` or `RandomState` instances, and repeated calls to `fit` and `split`
The `random_state` parameter determines whether multiple calls to :term:`fit`
(for estimators) or to :term:`split` (for CV splitters) will produce the same
results, according to these rules:
- If an integer is passed, calling `fit` or `split` multiple times always
yields the same results.
- If `None` or a `RandomState` instance is passed: `fit` and `split` will
yield different results each time they are called, and the succession of
calls explores all sources of entropy. `None` is the default value for all
`random_state` parameters.
We here illustrate these rules for both estimators and CV splitters.
.. note::
Since passing `random_state=None` is equivalent to passing the global
`RandomState` instance from `numpy`
(`random_state=np.random.mtrand._rand`), we will not explicitly mention
`None` here. Everything that applies to instances also applies to using
Passing instances means that calling `fit` multiple times will not yield the
same results, even if the estimator is fitted on the same data and with the
same hyper-parameters::
>>> from sklearn.linear_model import SGDClassifier
>>> from sklearn.datasets import make_classification
>>> import numpy as np
>>> rng = np.random.RandomState(0)
>>> X, y = make_classification(n_features=5, random_state=rng)
>>> sgd = SGDClassifier(random_state=rng)
>>>, y).coef_
array([[ 8.85418642, 4.79084103, -3.13077794, 8.11915045, -0.56479934]])
>>>, y).coef_
array([[ 6.70814003, 5.25291366, -7.55212743, 5.18197458, 1.37845099]])
We can see from the snippet above that repeatedly calling `` has
produced different models, even if the data was the same. This is because the
Random Number Generator (RNG) of the estimator is consumed (i.e. mutated)
when `fit` is called, and this mutated RNG will be used in the subsequent
calls to `fit`. In addition, the `rng` object is shared across all objects
that use it, and as a consequence, these objects become somewhat
inter-dependent. For example, two estimators that share the same
`RandomState` instance will influence each other, as we will see later when
we discuss cloning. This point is important to keep in mind when debugging.
If we had passed an integer to the `random_state` parameter of the
:class:`~sklearn.linear_model.SGDClassifier`, we would have obtained the
same models, and thus the same scores each time. When we pass an integer, the
same RNG is used across all calls to `fit`. What internally happens is that
even though the RNG is consumed when `fit` is called, it is always reset to
its original state at the beginning of `fit`.
CV splitters
Randomized CV splitters have a similar behavior when a `RandomState`
instance is passed; calling `split` multiple times yields different data
>>> from sklearn.model_selection import KFold
>>> import numpy as np
>>> X = y = np.arange(10)
>>> rng = np.random.RandomState(0)
>>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)
>>> for train, test in cv.split(X, y):
... print(train, test)
[0 3 5 6 7] [1 2 4 8 9]
[1 2 4 8 9] [0 3 5 6 7]
>>> for train, test in cv.split(X, y):
... print(train, test)
[0 4 6 7 8] [1 2 3 5 9]
[1 2 3 5 9] [0 4 6 7 8]
We can see that the splits are different from the second time `split` is
called. This may lead to unexpected results if you compare the performance of
multiple estimators by calling `split` many times, as we will see in the next
Common pitfalls and subtleties
While the rules that govern the `random_state` parameter are seemingly simple,
they do however have some subtle implications. In some cases, this can even
lead to wrong conclusions.
**Different `random_state` types lead to different cross-validation
Depending on the type of the `random_state` parameter, estimators will behave
differently, especially in cross-validation procedures. Consider the
following snippet::
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import cross_val_score
>>> import numpy as np
>>> X, y = make_classification(random_state=0)
>>> rf_123 = RandomForestClassifier(random_state=123)
>>> cross_val_score(rf_123, X, y)
array([0.85, 0.95, 0.95, 0.9 , 0.9 ])
>>> rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))
>>> cross_val_score(rf_inst, X, y)
array([0.9 , 0.95, 0.95, 0.9 , 0.9 ])
We see that the cross-validated scores of `rf_123` and `rf_inst` are
different, as should be expected since we didn't pass the same `random_state`
parameter. However, the difference between these scores is more subtle than
it looks, and **the cross-validation procedures that were performed by**
:func:`~sklearn.model_selection.cross_val_score` **significantly differ in
each case**:
- Since `rf_123` was passed an integer, every call to `fit` uses the same RNG:
this means that all random characteristics of the random forest estimator
will be the same for each of the 5 folds of the CV procedure. In
particular, the (randomly chosen) subset of features of the estimator will
be the same across all folds.
- Since `rf_inst` was passed a `RandomState` instance, each call to `fit`
starts from a different RNG. As a result, the random subset of features
will be different for each folds.
While having a constant estimator RNG across folds isn't inherently wrong, we
usually want CV results that are robust w.r.t. the estimator's randomness. As
a result, passing an instance instead of an integer may be preferable, since
it will allow the estimator RNG to vary for each fold.
.. note::
Here, :func:`~sklearn.model_selection.cross_val_score` will use a
non-randomized CV splitter (as is the default), so both estimators will
be evaluated on the same splits. This section is not about variability in
the splits. Also, whether we pass an integer or an instance to
:func:`~sklearn.datasets.make_classification` isn't relevant for our
illustration purpose: what matters is what we pass to the
:class:`~sklearn.ensemble.RandomForestClassifier` estimator.
.. dropdown:: Cloning
Another subtle side effect of passing `RandomState` instances is how
:func:`~sklearn.base.clone` will work::
>>> from sklearn import clone
>>> from sklearn.ensemble import RandomForestClassifier
>>> import numpy as np
>>> rng = np.random.RandomState(0)
>>> a = RandomForestClassifier(random_state=rng)
>>> b = clone(a)
Since a `RandomState` instance was passed to `a`, `a` and `b` are not clones
in the strict sense, but rather clones in the statistical sense: `a` and `b`
will still be different models, even when calling `fit(X, y)` on the same
data. Moreover, `a` and `b` will influence each-other since they share the
same internal RNG: calling `` will consume `b`'s RNG, and calling
`` will consume `a`'s RNG, since they are the same. This bit is true for
any estimators that share a `random_state` parameter; it is not specific to
If an integer were passed, `a` and `b` would be exact clones and they would not
influence each other.
.. warning::
Even though :func:`~sklearn.base.clone` is rarely used in user code, it is
called pervasively throughout scikit-learn codebase: in particular, most
meta-estimators that accept non-fitted estimators call
:func:`~sklearn.base.clone` internally
:class:`~sklearn.calibration.CalibratedClassifierCV`, etc.).
CV splitters
When passed a `RandomState` instance, CV splitters yield different splits
each time `split` is called. When comparing different estimators, this can
lead to overestimating the variance of the difference in performance between
the estimators::
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import KFold
>>> from sklearn.model_selection import cross_val_score
>>> import numpy as np
>>> rng = np.random.RandomState(0)
>>> X, y = make_classification(random_state=rng)
>>> cv = KFold(shuffle=True, random_state=rng)
>>> lda = LinearDiscriminantAnalysis()
>>> nb = GaussianNB()
>>> for est in (lda, nb):
... print(cross_val_score(est, X, y, cv=cv))
[0.8 0.75 0.75 0.7 0.85]
[0.85 0.95 0.95 0.85 0.95]
Directly comparing the performance of the
:class:`~sklearn.discriminant_analysis.LinearDiscriminantAnalysis` estimator
vs the :class:`~sklearn.naive_bayes.GaussianNB` estimator **on each fold** would
be a mistake: **the splits on which the estimators are evaluated are
different**. Indeed, :func:`~sklearn.model_selection.cross_val_score` will
internally call `cv.split` on the same
:class:`~sklearn.model_selection.KFold` instance, but the splits will be
different each time. This is also true for any tool that performs model
selection via cross-validation, e.g.
:class:`~sklearn.model_selection.GridSearchCV` and
:class:`~sklearn.model_selection.RandomizedSearchCV`: scores are not
comparable fold-to-fold across different calls to ``, since
`cv.split` would have been called multiple times. Within a single call to
``, however, fold-to-fold comparison is possible since the search
estimator only calls `cv.split` once.
For comparable fold-to-fold results in all scenarios, one should pass an
integer to the CV splitter: `cv = KFold(shuffle=True, random_state=0)`.
.. note::
While fold-to-fold comparison is not advisable with `RandomState`
instances, one can however expect that average scores allow to conclude
whether one estimator is better than another, as long as enough folds and
data are used.
.. note::
What matters in this example is what was passed to
:class:`~sklearn.model_selection.KFold`. Whether we pass a `RandomState`
instance or an integer to :func:`~sklearn.datasets.make_classification`
is not relevant for our illustration purpose. Also, neither
:class:`~sklearn.discriminant_analysis.LinearDiscriminantAnalysis` nor
:class:`~sklearn.naive_bayes.GaussianNB` are randomized estimators.
General recommendations
Getting reproducible results across multiple executions
In order to obtain reproducible (i.e. constant) results across multiple
*program executions*, we need to remove all uses of `random_state=None`, which
is the default. The recommended way is to declare a `rng` variable at the top
of the program, and pass it down to any object that accepts a `random_state`
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> import numpy as np
>>> rng = np.random.RandomState(0)
>>> X, y = make_classification(random_state=rng)
>>> rf = RandomForestClassifier(random_state=rng)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
... random_state=rng)
>>>, y_train).score(X_test, y_test)
We are now guaranteed that the result of this script will always be 0.84, no
matter how many times we run it. Changing the global `rng` variable to a
different value should affect the results, as expected.
It is also possible to declare the `rng` variable as an integer. This may
however lead to less robust cross-validation results, as we will see in the
next section.
.. note::
We do not recommend setting the global `numpy` seed by calling
`np.random.seed(0)`. See `here
for a discussion.
Robustness of cross-validation results
When we evaluate a randomized estimator performance by cross-validation, we
want to make sure that the estimator can yield accurate predictions for new
data, but we also want to make sure that the estimator is robust w.r.t. its
random initialization. For example, we would like the random weights
initialization of a :class:`~sklearn.linear_model.SGDClassifier` to be
consistently good across all folds: otherwise, when we train that estimator
on new data, we might get unlucky and the random initialization may lead to
bad performance. Similarly, we want a random forest to be robust w.r.t the
set of randomly selected features that each tree will be using.
For these reasons, it is preferable to evaluate the cross-validation
performance by letting the estimator use a different RNG on each fold. This
is done by passing a `RandomState` instance (or `None`) to the estimator
When we pass an integer, the estimator will use the same RNG on each fold:
if the estimator performs well (or bad), as evaluated by CV, it might just be
because we got lucky (or unlucky) with that specific seed. Passing instances
leads to more robust CV results, and makes the comparison between various
algorithms fairer. It also helps limiting the temptation to treat the
estimator's RNG as a hyper-parameter that can be tuned.
Whether we pass `RandomState` instances or integers to CV splitters has no
impact on robustness, as long as `split` is only called once. When `split`
is called multiple times, fold-to-fold comparison isn't possible anymore. As
a result, passing integer to CV splitters is usually safer and covers most
Normal file
.. raw :: html
<!-- Generated by -->
<div class="sk-authors-container">
img.avatar {border-radius: 10px;}
<a href=''><img src='' class='avatar' /></a> <br />
<p>Lauren Burke</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>François Goupil</p>
@ -0,0 +1 @@
- Reshama Shaikh
@ -0,0 +1,10 @@
Computing with scikit-learn
.. toctree::
:maxdepth: 2
@ -0,0 +1,366 @@
.. _computational_performance:
.. currentmodule:: sklearn
Computational Performance
For some applications the performance (mainly latency and throughput at
prediction time) of estimators is crucial. It may also be of interest to
consider the training throughput but this is often less important in a
production setup (where it often takes place offline).
We will review here the orders of magnitude you can expect from a number of
scikit-learn estimators in different contexts and provide some tips and
tricks for overcoming performance bottlenecks.
Prediction latency is measured as the elapsed time necessary to make a
prediction (e.g. in micro-seconds). Latency is often viewed as a distribution
and operations engineers often focus on the latency at a given percentile of
this distribution (e.g. the 90 percentile).
Prediction throughput is defined as the number of predictions the software can
deliver in a given amount of time (e.g. in predictions per second).
An important aspect of performance optimization is also that it can hurt
prediction accuracy. Indeed, simpler models (e.g. linear instead of
non-linear, or with fewer parameters) often run faster but are not always able
to take into account the same exact properties of the data as more complex ones.
Prediction Latency
One of the most straight-forward concerns one may have when using/choosing a
machine learning toolkit is the latency at which predictions can be made in a
production environment.
The main factors that influence the prediction latency are
1. Number of features
2. Input data representation and sparsity
3. Model complexity
4. Feature extraction
A last major parameter is also the possibility to do predictions in bulk or
one-at-a-time mode.
Bulk versus Atomic mode
In general doing predictions in bulk (many instances at the same time) is
more efficient for a number of reasons (branching predictability, CPU cache,
linear algebra libraries optimizations etc.). Here we see on a setting
with few features that independently of estimator choice the bulk mode is
always faster, and for some of them by 1 to 2 orders of magnitude:
.. |atomic_prediction_latency| image:: ../auto_examples/applications/images/sphx_glr_plot_prediction_latency_001.png
:target: ../auto_examples/applications/plot_prediction_latency.html
:scale: 80
.. centered:: |atomic_prediction_latency|
.. |bulk_prediction_latency| image:: ../auto_examples/applications/images/sphx_glr_plot_prediction_latency_002.png
:target: ../auto_examples/applications/plot_prediction_latency.html
:scale: 80
.. centered:: |bulk_prediction_latency|
To benchmark different estimators for your case you can simply change the
``n_features`` parameter in this example:
:ref:``. This should give
you an estimate of the order of magnitude of the prediction latency.
Configuring Scikit-learn for reduced validation overhead
Scikit-learn does some validation on data that increases the overhead per
call to ``predict`` and similar functions. In particular, checking that
features are finite (not NaN or infinite) involves a full pass over the
data. If you ensure that your data is acceptable, you may suppress
checking for finiteness by setting the environment variable
``SKLEARN_ASSUME_FINITE`` to a non-empty string before importing
scikit-learn, or configure it in Python with :func:`set_config`.
For more control than these global settings, a :func:`config_context`
allows you to set this configuration within a specified context::
>>> import sklearn
>>> with sklearn.config_context(assume_finite=True):
... pass # do learning/prediction here with reduced validation
Note that this will affect all uses of
:func:`~utils.assert_all_finite` within the context.
Influence of the Number of Features
Obviously when the number of features increases so does the memory
consumption of each example. Indeed, for a matrix of :math:`M` instances
with :math:`N` features, the space complexity is in :math:`O(NM)`.
From a computing perspective it also means that the number of basic operations
(e.g., multiplications for vector-matrix products in linear models) increases
too. Here is a graph of the evolution of the prediction latency with the
number of features:
.. |influence_of_n_features_on_latency| image:: ../auto_examples/applications/images/sphx_glr_plot_prediction_latency_003.png
:target: ../auto_examples/applications/plot_prediction_latency.html
:scale: 80
.. centered:: |influence_of_n_features_on_latency|
Overall you can expect the prediction time to increase at least linearly with
the number of features (non-linear cases can happen depending on the global
memory footprint and estimator).
Influence of the Input Data Representation
Scipy provides sparse matrix data structures which are optimized for storing
sparse data. The main feature of sparse formats is that you don't store zeros
so if your data is sparse then you use much less memory. A non-zero value in
a sparse (`CSR or CSC <>`_)
representation will only take on average one 32bit integer position + the 64
bit floating point value + an additional 32bit per row or column in the matrix.
Using sparse input on a dense (or sparse) linear model can speedup prediction
by quite a bit as only the non zero valued features impact the dot product
and thus the model predictions. Hence if you have 100 non zeros in 1e6
dimensional space, you only need 100 multiply and add operation instead of 1e6.
Calculation over a dense representation, however, may leverage highly optimized
vector operations and multithreading in BLAS, and tends to result in fewer CPU
cache misses. So the sparsity should typically be quite high (10% non-zeros
max, to be checked depending on the hardware) for the sparse input
representation to be faster than the dense input representation on a machine
with many CPUs and an optimized BLAS implementation.
Here is sample code to test the sparsity of your input::
def sparsity_ratio(X):
return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
print("input sparsity ratio:", sparsity_ratio(X))
As a rule of thumb you can consider that if the sparsity ratio is greater
than 90% you can probably benefit from sparse formats. Check Scipy's sparse
matrix formats `documentation <>`_
for more information on how to build (or convert your data to) sparse matrix
formats. Most of the time the ``CSR`` and ``CSC`` formats work best.
Influence of the Model Complexity
Generally speaking, when model complexity increases, predictive power and
latency are supposed to increase. Increasing predictive power is usually
interesting, but for many applications we would better not increase
prediction latency too much. We will now review this idea for different
families of supervised models.
For :mod:`sklearn.linear_model` (e.g. Lasso, ElasticNet,
SGDClassifier/Regressor, Ridge & RidgeClassifier,
PassiveAggressiveClassifier/Regressor, LinearSVC, LogisticRegression...) the
decision function that is applied at prediction time is the same (a dot product)
, so latency should be equivalent.
Here is an example using
:class:`~linear_model.SGDClassifier` with the
``elasticnet`` penalty. The regularization strength is globally controlled by
the ``alpha`` parameter. With a sufficiently high ``alpha``,
one can then increase the ``l1_ratio`` parameter of ``elasticnet`` to
enforce various levels of sparsity in the model coefficients. Higher sparsity
here is interpreted as less model complexity as we need fewer coefficients to
describe it fully. Of course sparsity influences in turn the prediction time
as the sparse dot-product takes time roughly proportional to the number of
non-zero coefficients.
.. |en_model_complexity| image:: ../auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_001.png
:target: ../auto_examples/applications/plot_model_complexity_influence.html
:scale: 80
.. centered:: |en_model_complexity|
For the :mod:`sklearn.svm` family of algorithms with a non-linear kernel,
the latency is tied to the number of support vectors (the fewer the faster).
Latency and throughput should (asymptotically) grow linearly with the number
of support vectors in a SVC or SVR model. The kernel will also influence the
latency as it is used to compute the projection of the input vector once per
support vector. In the following graph the ``nu`` parameter of
:class:`~svm.NuSVR` was used to influence the number of
support vectors.
.. |nusvr_model_complexity| image:: ../auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_002.png
:target: ../auto_examples/applications/plot_model_complexity_influence.html
:scale: 80
.. centered:: |nusvr_model_complexity|
For :mod:`sklearn.ensemble` of trees (e.g. RandomForest, GBT,
ExtraTrees, etc.) the number of trees and their depth play the most
important role. Latency and throughput should scale linearly with the number
of trees. In this case we used directly the ``n_estimators`` parameter of
.. |gbt_model_complexity| image:: ../auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_003.png
:target: ../auto_examples/applications/plot_model_complexity_influence.html
:scale: 80
.. centered:: |gbt_model_complexity|
In any case be warned that decreasing model complexity can hurt accuracy as
mentioned above. For instance a non-linearly separable problem can be handled
with a speedy linear model but prediction power will very likely suffer in
the process.
Feature Extraction Latency
Most scikit-learn models are usually pretty fast as they are implemented
either with compiled Cython extensions or optimized computing libraries.
On the other hand, in many real world applications the feature extraction
process (i.e. turning raw data like database rows or network packets into
numpy arrays) governs the overall prediction time. For example on the Reuters
text classification task the whole preparation (reading and parsing SGML
files, tokenizing the text and hashing it into a common vector space) is
taking 100 to 500 times more time than the actual prediction code, depending on
the chosen model.
.. |prediction_time| image:: ../auto_examples/applications/images/sphx_glr_plot_out_of_core_classification_004.png
:target: ../auto_examples/applications/plot_out_of_core_classification.html
:scale: 80
.. centered:: |prediction_time|
In many cases it is thus recommended to carefully time and profile your
feature extraction code as it may be a good place to start optimizing when
your overall latency is too slow for your application.
Prediction Throughput
Another important metric to care about when sizing production systems is the
throughput i.e. the number of predictions you can make in a given amount of
time. Here is a benchmark from the
:ref:`` example that measures
this quantity for a number of estimators on synthetic data:
.. |throughput_benchmark| image:: ../auto_examples/applications/images/sphx_glr_plot_prediction_latency_004.png
:target: ../auto_examples/applications/plot_prediction_latency.html
:scale: 80
.. centered:: |throughput_benchmark|
These throughputs are achieved on a single process. An obvious way to
increase the throughput of your application is to spawn additional instances
(usually processes in Python because of the
`GIL <>`_) that share the
same model. One might also add machines to spread the load. A detailed
explanation on how to achieve this is beyond the scope of this documentation
Tips and Tricks
Linear algebra libraries
As scikit-learn relies heavily on Numpy/Scipy and linear algebra in general it
makes sense to take explicit care of the versions of these libraries.
Basically, you ought to make sure that Numpy is built using an optimized `BLAS
<>`_ /
`LAPACK <>`_ library.
Not all models benefit from optimized BLAS and Lapack implementations. For
instance models based on (randomized) decision trees typically do not rely on
BLAS calls in their inner loops, nor do kernel SVMs (``SVC``, ``SVR``,
``NuSVC``, ``NuSVR``). On the other hand a linear model implemented with a
BLAS DGEMM call (via ````) will typically benefit hugely from a tuned
BLAS implementation and lead to orders of magnitude speedup over a
non-optimized BLAS.
You can display the BLAS / LAPACK implementation used by your NumPy / SciPy /
scikit-learn install with the following command::
python -c "import sklearn; sklearn.show_versions()"
Optimized BLAS / LAPACK implementations include:
- Atlas (need hardware specific tuning by rebuilding on the target machine)
- OpenBLAS
- Apple Accelerate and vecLib frameworks (OSX only)
More information can be found on the `NumPy install page <>`_
and in this
`blog post <,-with-scipy-and-ubuntu/>`_
from Daniel Nouri which has some nice step by step install instructions for
Debian / Ubuntu.
.. _working_memory:
Limiting Working Memory
Some calculations when implemented using standard numpy vectorized operations
involve using a large amount of temporary memory. This may potentially exhaust
system memory. Where computations can be performed in fixed-memory chunks, we
attempt to do so, and allow the user to hint at the maximum size of this
working memory (defaulting to 1GB) using :func:`set_config` or
:func:`config_context`. The following suggests to limit temporary working
memory to 128 MiB::
>>> import sklearn
>>> with sklearn.config_context(working_memory=128):
... pass # do chunked work here
An example of a chunked operation adhering to this setting is
:func:`~metrics.pairwise_distances_chunked`, which facilitates computing
row-wise reductions of a pairwise distance matrix.
Model Compression
Model compression in scikit-learn only concerns linear models for the moment.
In this context it means that we want to control the model sparsity (i.e. the
number of non-zero coordinates in the model vectors). It is generally a good
idea to combine model sparsity with sparse input data representation.
Here is sample code that illustrates the use of the ``sparsify()`` method::
clf = SGDRegressor(penalty='elasticnet', l1_ratio=0.25)
||||, y_train).sparsify()
In this example we prefer the ``elasticnet`` penalty as it is often a good
compromise between model compactness and prediction power. One can also
further tune the ``l1_ratio`` parameter (in combination with the
regularization strength ``alpha``) to control this tradeoff.
A typical `benchmark <>`_
on synthetic data yields a >30% decrease in latency when both the model and
input are sparse (with 0.000024 and 0.027400 non-zero coefficients ratio
respectively). Your mileage may vary depending on the sparsity and size of
your data and model.
Furthermore, sparsifying can be very useful to reduce the memory usage of
predictive models deployed on production servers.
Model Reshaping
Model reshaping consists in selecting only a portion of the available features
to fit a model. In other words, if a model discards features during the
learning phase we can then strip those from the input. This has several
benefits. Firstly it reduces memory (and therefore time) overhead of the
model itself. It also allows to discard explicit
feature selection components in a pipeline once we know which features to
keep from a previous run. Finally, it can help reduce processing time and I/O
usage upstream in the data access and feature extraction layers by not
collecting and building features that are discarded by the model. For instance
if the raw data come from a database, it can make it possible to write simpler
and faster queries or reduce I/O usage by making the queries return lighter
At the moment, reshaping needs to be performed manually in scikit-learn.
In the case of sparse input (particularly in ``CSR`` format), it is generally
sufficient to not generate the relevant features, leaving their columns empty.
- :ref:`scikit-learn developer performance documentation <performance-howto>`
- `Scipy sparse matrix formats documentation <>`_
Normal file
@ -0,0 +1,340 @@
Parallelism, resource management, and configuration
.. _parallelism:
Some scikit-learn estimators and utilities parallelize costly operations
using multiple CPU cores.
Depending on the type of estimator and sometimes the values of the
constructor parameters, this is either done:
- with higher-level parallelism via `joblib <>`_.
- with lower-level parallelism via OpenMP, used in C or Cython code.
- with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations
on arrays.
The `n_jobs` parameters of estimators always controls the amount of parallelism
managed by joblib (processes or threads depending on the joblib backend).
The thread-level parallelism managed by OpenMP in scikit-learn's own Cython code
or by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn
is always controlled by environment variables or `threadpoolctl` as explained below.
Note that some estimators can leverage all three kinds of parallelism at different
points of their training and prediction methods.
We describe these 3 types of parallelism in the following subsections in more details.
Higher-level parallelism with joblib
When the underlying implementation uses joblib, the number of workers
(threads or processes) that are spawned in parallel can be controlled via the
``n_jobs`` parameter.
.. note::
Where (and how) parallelization happens in the estimators using joblib by
specifying `n_jobs` is currently poorly documented.
Please help us by improving our docs and tackle `issue 14228
Joblib is able to support both multi-processing and multi-threading. Whether
joblib chooses to spawn a thread or a process depends on the **backend**
that it's using.
scikit-learn generally relies on the ``loky`` backend, which is joblib's
default backend. Loky is a multi-processing backend. When doing
multi-processing, in order to avoid duplicating the memory in each process
(which isn't reasonable with big datasets), joblib will create a `memmap
that all processes can share, when the data is bigger than 1MB.
In some specific cases (when the code that is run in parallel releases the
GIL), scikit-learn will indicate to ``joblib`` that a multi-threading
backend is preferable.
As a user, you may control the backend that joblib will use (regardless of
what scikit-learn recommends) by using a context manager::
from joblib import parallel_backend
with parallel_backend('threading', n_jobs=2):
# Your scikit-learn code here
Please refer to the `joblib's docs
for more details.
In practice, whether parallelism is helpful at improving runtime depends on
many factors. It is usually a good idea to experiment rather than assuming
that increasing the number of workers is always a good thing. In some cases
it can be highly detrimental to performance to run multiple copies of some
estimators or functions in parallel (see oversubscription below).
Lower-level parallelism with OpenMP
OpenMP is used to parallelize code written in Cython or C, relying on
multi-threading exclusively. By default, the implementations using OpenMP
will use as many threads as possible, i.e. as many threads as logical cores.
You can control the exact number of threads that are used either:
- via the ``OMP_NUM_THREADS`` environment variable, for instance when:
running a python script:
.. prompt:: bash $
- or via `threadpoolctl` as explained by `this piece of documentation
Parallel NumPy and SciPy routines from numerical libraries
scikit-learn relies heavily on NumPy and SciPy, which internally call
multi-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries
such as MKL, OpenBLAS or BLIS.
You can control the exact number of threads used by BLAS for each library
using environment variables, namely:
- ``MKL_NUM_THREADS`` sets the number of thread MKL uses,
- ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses
- ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses
Note that BLAS & LAPACK implementations can also be impacted by
`OMP_NUM_THREADS`. To check whether this is the case in your environment,
you can inspect how the number of threads effectively used by those libraries
is affected when running the following command in a bash or zsh terminal
for different values of `OMP_NUM_THREADS`:
.. prompt:: bash $
OMP_NUM_THREADS=2 python -m threadpoolctl -i numpy scipy
.. note::
At the time of writing (2022), NumPy and SciPy packages which are
distributed on (i.e. the ones installed via ``pip install``)
and on the conda-forge channel (i.e. the ones installed via
``conda install --channel conda-forge``) are linked with OpenBLAS, while
NumPy and SciPy packages packages shipped on the ``defaults`` conda
channel from (i.e. the ones installed via ``conda install``)
are linked by default with MKL.
Oversubscription: spawning too many threads
It is generally recommended to avoid using significantly more processes or
threads than the number of CPUs on a machine. Over-subscription happens when
a program is running too many threads at the same time.
Suppose you have a machine with 8 CPUs. Consider a case where you're running
a :class:`~sklearn.model_selection.GridSearchCV` (parallelized with joblib)
with ``n_jobs=8`` over a
:class:`~sklearn.ensemble.HistGradientBoostingClassifier` (parallelized with
OpenMP). Each instance of
:class:`~sklearn.ensemble.HistGradientBoostingClassifier` will spawn 8 threads
(since you have 8 CPUs). That's a total of ``8 * 8 = 64`` threads, which
leads to oversubscription of threads for physical CPU resources and thus
to scheduling overhead.
Oversubscription can arise in the exact same fashion with parallelized
routines from MKL, OpenBLAS or BLIS that are nested in joblib calls.
Starting from ``joblib >= 0.14``, when the ``loky`` backend is used (which
is the default), joblib will tell its child **processes** to limit the
number of threads they can use, so as to avoid oversubscription. In practice
the heuristic that joblib uses is to tell the processes to use ``max_threads
= n_cpus // n_jobs``, via their corresponding environment variable. Back to
our example from above, since the joblib backend of
:class:`~sklearn.model_selection.GridSearchCV` is ``loky``, each process will
only be able to use 1 thread instead of 8, thus mitigating the
oversubscription issue.
Note that:
- Manually setting one of the environment variables (``OMP_NUM_THREADS``,
will take precedence over what joblib tries to do. The total number of
threads will be ``n_jobs * <LIB>_NUM_THREADS``. Note that setting this
limit will also impact your computations in the main process, which will
only use ``<LIB>_NUM_THREADS``. Joblib exposes a context manager for
finer control over the number of threads in its workers (see joblib docs
linked below).
- When joblib is configured to use the ``threading`` backend, there is no
mechanism to avoid oversubscriptions when calling into parallel native
libraries in the joblib-managed threads.
- All scikit-learn estimators that explicitly rely on OpenMP in their Cython code
always use `threadpoolctl` internally to automatically adapt the numbers of
threads used by OpenMP and potentially nested BLAS calls so as to avoid
You will find additional details about joblib mitigation of oversubscription
in `joblib documentation
You will find additional details about parallelism in numerical python libraries
in `this document from Thomas J. Fan <>`_.
Configuration switches
Python API
:func:`sklearn.set_config` and :func:`sklearn.config_context` can be used to change
parameters of the configuration which control aspect of parallelism.
.. _environment_variable:
Environment variables
These environment variables should be set before importing scikit-learn.
Sets the default value for the `assume_finite` argument of
Sets the default value for the `working_memory` argument of
Sets the seed of the global random generator when running the tests, for
Note that scikit-learn tests are expected to run deterministically with
explicit seeding of their own independent RNG instances instead of relying on
the numpy or Python standard library RNG singletons to make sure that test
results are independent of the test execution order. However some tests might
forget to use explicit seeding and this variable is a way to control the initial
state of the aforementioned singletons.
Controls the seeding of the random number generator used in tests that rely on
the `global_random_seed`` fixture.
All tests that use this fixture accept the contract that they should
deterministically pass for any seed value from 0 to 99 included.
If the `SKLEARN_TESTS_GLOBAL_RANDOM_SEED` environment variable is set to
`"any"` (which should be the case on nightly builds on the CI), the fixture
will choose an arbitrary seed in the above range (based on the BUILD_NUMBER or
the current day) and all fixtured tests will run for that specific seed. The
goal is to ensure that, over time, our CI will run all tests with different
seeds while keeping the test duration of a single run of the full test suite
limited. This will check that the assertions of tests written to use this
fixture are not dependent on a specific seed value.
The range of admissible seed values is limited to [0, 99] because it is often
not possible to write a test that can work for any possible seed and we want to
avoid having tests that randomly fail on the CI.
- `SKLEARN_TESTS_GLOBAL_RANDOM_SEED="42"`: run tests with a fixed seed of 42
- `SKLEARN_TESTS_GLOBAL_RANDOM_SEED="40-42"`: run the tests with all seeds
between 40 and 42 included
- `SKLEARN_TESTS_GLOBAL_RANDOM_SEED="any"`: run the tests with an arbitrary
seed selected between 0 and 99 included
- `SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all"`: run the tests with all seeds
between 0 and 99 included. This can take a long time: only use for individual
tests, not the full test suite!
If the variable is not set, then 42 is used as the global seed in a
deterministic manner. This ensures that, by default, the scikit-learn test
suite is as deterministic as possible to avoid disrupting our friendly
third-party package maintainers. Similarly, this variable should not be set in
the CI config of pull-requests to make sure that our friendly contributors are
not the first people to encounter a seed-sensitivity regression in a test
unrelated to the changes of their own PR. Only the scikit-learn maintainers who
watch the results of the nightly builds are expected to be annoyed by this.
When writing a new test function that uses this fixture, please use the
following command to make sure that it passes deterministically for all
admissible seeds on your local machine:
.. prompt:: bash $
SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -v -k test_your_test_name
When this environment variable is set to a non zero value, the tests that need
network access are skipped. When this environment variable is not set then
network tests are skipped.
When this environment variable is set to '1', the tests using the
`global_dtype` fixture are also run on float32 data.
When this environment variable is not set, the tests are only run on
float64 data.
When this environment variable is set to a non zero value, the `Cython`
derivative, `boundscheck` is set to `True`. This is useful for finding
When this environment variable is set to a non zero value, the debug symbols
will be included in the compiled C extensions. Only debug symbols for POSIX
systems is configured.
This sets the size of chunk to be used by the underlying `PairwiseDistancesReductions`
implementations. The default value is `256` which has been showed to be adequate on
most machines.
Users looking for the best performance might want to tune this variable using
powers of 2 so as to get the best parallelism behavior for their hardware,
especially with respect to their caches' sizes.
This environment variable is used to turn warnings into errors in tests and
documentation build.
Some CI (Continuous Integration) builds set `SKLEARN_WARNINGS_AS_ERRORS=1`, for
example to make sure that we catch deprecation warnings from our dependencies
and that we adapt our code.
To locally run with the same "warnings as errors" setting as in these CI builds
By default, warnings are not turned into errors. This is the case if
This environment variable use specific warning filters to ignore some warnings,
since sometimes warnings originate from third-party libraries and there is not
much we can do about it. You can see the warning filters in the
`_get_warnings_filters_info_list` function in `sklearn/utils/`.
Note that for documentation build, `SKLEARN_WARNING_AS_ERRORS=1` is checking
that the documentation build, in particular running examples, does not produce
any warnings. This is different from the `-W` `sphinx-build` argument that
catches syntax warnings in the rst files.
@ -0,0 +1,136 @@
.. _scaling_strategies:
Strategies to scale computationally: bigger data
For some applications the amount of examples, features (or both) and/or the
speed at which they need to be processed are challenging for traditional
approaches. In these cases scikit-learn has a number of options you can
consider to make your system scale.
Scaling with instances using out-of-core learning
Out-of-core (or "external memory") learning is a technique used to learn from
data that cannot fit in a computer's main memory (RAM).
Here is a sketch of a system designed to achieve this goal:
1. a way to stream instances
2. a way to extract features from instances
3. an incremental algorithm
Streaming instances
Basically, 1. may be a reader that yields instances from files on a
hard drive, a database, from a network stream etc. However,
details on how to achieve this are beyond the scope of this documentation.
Extracting features
\2. could be any relevant way to extract features among the
different :ref:`feature extraction <feature_extraction>` methods supported by
scikit-learn. However, when working with data that needs vectorization and
where the set of features or values is not known in advance one should take
explicit care. A good example is text classification where unknown terms are
likely to be found during training. It is possible to use a stateful
vectorizer if making multiple passes over the data is reasonable from an
application point of view. Otherwise, one can turn up the difficulty by using
a stateless feature extractor. Currently the preferred way to do this is to
use the so-called :ref:`hashing trick<feature_hashing>` as implemented by
:class:`sklearn.feature_extraction.FeatureHasher` for datasets with categorical
variables represented as list of Python dicts or
:class:`sklearn.feature_extraction.text.HashingVectorizer` for text documents.
Incremental learning
Finally, for 3. we have a number of options inside scikit-learn. Although not
all algorithms can learn incrementally (i.e. without seeing all the instances
at once), all estimators implementing the ``partial_fit`` API are candidates.
Actually, the ability to learn incrementally from a mini-batch of instances
(sometimes called "online learning") is key to out-of-core learning as it
guarantees that at any given time there will be only a small amount of
instances in the main memory. Choosing a good size for the mini-batch that
balances relevancy and memory footprint could involve some tuning [1]_.
Here is a list of incremental estimators for different tasks:
- Classification
+ :class:`sklearn.naive_bayes.MultinomialNB`
+ :class:`sklearn.naive_bayes.BernoulliNB`
+ :class:`sklearn.linear_model.Perceptron`
+ :class:`sklearn.linear_model.SGDClassifier`
+ :class:`sklearn.linear_model.PassiveAggressiveClassifier`
+ :class:`sklearn.neural_network.MLPClassifier`
- Regression
+ :class:`sklearn.linear_model.SGDRegressor`
+ :class:`sklearn.linear_model.PassiveAggressiveRegressor`
+ :class:`sklearn.neural_network.MLPRegressor`
- Clustering
+ :class:`sklearn.cluster.MiniBatchKMeans`
+ :class:`sklearn.cluster.Birch`
- Decomposition / feature Extraction
+ :class:`sklearn.decomposition.MiniBatchDictionaryLearning`
+ :class:`sklearn.decomposition.IncrementalPCA`
+ :class:`sklearn.decomposition.LatentDirichletAllocation`
+ :class:`sklearn.decomposition.MiniBatchNMF`
- Preprocessing
+ :class:`sklearn.preprocessing.StandardScaler`
+ :class:`sklearn.preprocessing.MinMaxScaler`
+ :class:`sklearn.preprocessing.MaxAbsScaler`
For classification, a somewhat important thing to note is that although a
stateless feature extraction routine may be able to cope with new/unseen
attributes, the incremental learner itself may be unable to cope with
new/unseen targets classes. In this case you have to pass all the possible
classes to the first ``partial_fit`` call using the ``classes=`` parameter.
Another aspect to consider when choosing a proper algorithm is that not all of
them put the same importance on each example over time. Namely, the
``Perceptron`` is still sensitive to badly labeled examples even after many
examples whereas the ``SGD*`` and ``PassiveAggressive*`` families are more
robust to this kind of artifacts. Conversely, the latter also tend to give less
importance to remarkably different, yet properly labeled examples when they
come late in the stream as their learning rate decreases over time.
Finally, we have a full-fledged example of
:ref:``. It is aimed at
providing a starting point for people wanting to build out-of-core learning
systems and demonstrates most of the notions discussed above.
Furthermore, it also shows the evolution of the performance of different
algorithms with the number of processed examples.
.. |accuracy_over_time| image:: ../auto_examples/applications/images/sphx_glr_plot_out_of_core_classification_001.png
:target: ../auto_examples/applications/plot_out_of_core_classification.html
:scale: 80
.. centered:: |accuracy_over_time|
Now looking at the computation time of the different parts, we see that the
vectorization is much more expensive than learning itself. From the different
algorithms, ``MultinomialNB`` is the most expensive, but its overhead can be
mitigated by increasing the size of the mini-batches (exercise: change
``minibatch_size`` to 100 and 10000 in the program and compare).
.. |computation_time| image:: ../auto_examples/applications/images/sphx_glr_plot_out_of_core_classification_003.png
:target: ../auto_examples/applications/plot_out_of_core_classification.html
:scale: 80
.. centered:: |computation_time|
.. [1] Depending on the algorithm the mini-batch size can influence results or
not. SGD*, PassiveAggressive*, and discrete NaiveBayes are truly online
and are not affected by batch size. Conversely, MiniBatchKMeans
convergence rate is affected by the batch size. Also, its memory
footprint can vary dramatically with batch size.
@ -0,0 +1,966 @@
# scikit-learn documentation build configuration file, created by
# sphinx-quickstart on Fri Jan 8 09:13:42 2010.
# This file is execfile()d with the current directory set to its containing
# dir.
# Note that not all possible configuration values are present in this
# autogenerated file.
# All configuration values have a default; values that are commented out
# serve to show the default.
import os
import re
import sys
import warnings
from datetime import datetime
from pathlib import Path
from sklearn.externals._packaging.version import parse
from sklearn.utils._testing import turn_warnings_into_errors
# If extensions (or modules to document with autodoc) are in another
# directory, add these directories to sys.path here. If the directory
# is relative to the documentation root, use os.path.abspath to make it
# absolute, like shown here.
sys.path.insert(0, os.path.abspath("."))
sys.path.insert(0, os.path.abspath("sphinxext"))
import jinja2
import sphinx_gallery
from github_link import make_linkcode_resolve
from sphinx_gallery.notebook import add_code_cell, add_markdown_cell
from sphinx_gallery.sorting import ExampleTitleSortKey
# Configure plotly to integrate its output into the HTML pages generated by
# sphinx-gallery.
import as pio
pio.renderers.default = "sphinx_gallery"
except ImportError:
# Make it possible to render the doc when not running the examples
# that need plotly.
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = [
# See sphinxext/
# Specify how to identify the prompt when copying code snippets
copybutton_prompt_text = r">>> |\.\.\. "
copybutton_prompt_is_regexp = True
copybutton_exclude = "style"
import jupyterlite_sphinx # noqa: F401
with_jupyterlite = True
except ImportError:
# In some cases we don't want to require jupyterlite_sphinx to be installed,
# e.g. the doc-min-dependencies build
"jupyterlite_sphinx is not installed, you need to install it "
"if you want JupyterLite links to appear in each example"
with_jupyterlite = False
# Produce `plot::` directives for examples that contain `import matplotlib` or
# `from matplotlib import`.
numpydoc_use_plots = True
# Options for the `::plot` directive:
plot_formats = ["png"]
plot_include_source = True
plot_html_show_formats = False
plot_html_show_source_link = False
# We do not need the table of class members because `sphinxext/`
# will show them in the secondary sidebar
numpydoc_show_class_members = False
numpydoc_show_inherited_class_members = False
# We want in-page toc of class members instead of a separate page for each entry
numpydoc_class_members_toctree = False
# For maths, use mathjax by default and svg if NO_MATHJAX env variable is set
# (useful for viewing the doc offline)
if os.environ.get("NO_MATHJAX"):
imgmath_image_format = "svg"
mathjax_path = ""
mathjax_path = ""
# Add any paths that contain templates here, relative to this directory.
templates_path = ["templates"]
# generate autosummary even if no references
autosummary_generate = True
# The suffix of source filenames.
source_suffix = ".rst"
# The encoding of source files.
source_encoding = "utf-8"
# The main toctree document.
root_doc = "index"
# General information about the project.
project = "scikit-learn"
copyright = f"2007 - {}, scikit-learn developers (BSD License)"
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
# The short X.Y version.
import sklearn
parsed_version = parse(sklearn.__version__)
version = ".".join(parsed_version.base_version.split(".")[:2])
# The full version, including alpha/beta/rc tags.
# Removes post from release name
if parsed_version.is_postrelease:
release = parsed_version.base_version
release = sklearn.__version__
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
# language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
# today = ''
# Else, today_fmt is used as the format for a strftime call.
# today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = [
# The reST default role (used for this markup: `text`) to use for all
# documents.
default_role = "literal"
# If true, '()' will be appended to :func: etc. cross-reference text.
add_function_parentheses = False
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
# add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
# show_authors = False
# A list of ignored prefixes for module index sorting.
# modindex_common_prefix = []
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. Major themes that come with
# Sphinx are currently 'default' and 'sphinxdoc'.
html_theme = "pydata_sphinx_theme"
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
html_theme_options = {
# -- General configuration ------------------------------------------------
"sidebar_includehidden": True,
"use_edit_page_button": True,
"external_links": [],
"icon_links_label": "Icon Links",
"icon_links": [
"name": "GitHub",
"url": "",
"icon": "fa-brands fa-square-github",
"type": "fontawesome",
"analytics": {
"plausible_analytics_domain": "",
"plausible_analytics_url": "",
# If "prev-next" is included in article_footer_items, then setting show_prev_next
# to True would repeat prev and next links. See
"show_prev_next": False,
"search_bar_text": "Search the docs ...",
"navigation_with_keys": False,
"collapse_navigation": False,
"navigation_depth": 2,
"show_nav_level": 1,
"show_toc_level": 1,
"navbar_align": "left",
"header_links_before_dropdown": 5,
"header_dropdown_text": "More",
# The switcher requires a JSON file with the list of documentation versions, which
# is generated by the script `build_tools/circle/` and placed under
# the `js/` static directory; it will then be copied to the `_static` directory in
# the built documentation
"switcher": {
"json_url": "",
"version_match": release,
# check_switcher may be set to False if docbuild pipeline fails. See
"check_switcher": True,
"pygments_light_style": "tango",
"pygments_dark_style": "monokai",
"logo": {
"alt_text": "scikit-learn homepage",
"image_relative": "logos/scikit-learn-logo-small.png",
"image_light": "logos/scikit-learn-logo-small.png",
"image_dark": "logos/scikit-learn-logo-small.png",
"surface_warnings": True,
# -- Template placement in theme layouts ----------------------------------
"navbar_start": ["navbar-logo"],
# Note that the alignment of navbar_center is controlled by navbar_align
"navbar_center": ["navbar-nav"],
"navbar_end": ["theme-switcher", "navbar-icon-links", "version-switcher"],
# navbar_persistent is persistent right (even when on mobiles)
"navbar_persistent": ["search-button"],
"article_header_start": ["breadcrumbs"],
"article_header_end": [],
"article_footer_items": ["prev-next"],
"content_footer_items": [],
# Use html_sidebars that map page patterns to list of sidebar templates
"primary_sidebar_end": [],
"footer_start": ["copyright"],
"footer_center": [],
"footer_end": [],
# When specified as a dictionary, the keys should follow glob-style patterns, as in
# In particular, "**" specifies the default for all pages
# Use :html_theme.sidebar_secondary.remove: for file-wide removal
"secondary_sidebar_items": {"**": ["page-toc", "sourcelink"]},
"show_version_warning_banner": True,
"announcement": None,
# Add any paths that contain custom themes here, relative to this directory.
# html_theme_path = ["themes"]
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
# html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
html_short_title = "scikit-learn"
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
html_favicon = "logos/favicon.ico"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["images", "css", "js"]
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
# html_last_updated_fmt = '%b %d, %Y'
# Custom sidebar templates, maps document names to template names.
# Workaround for removing the left sidebar on pages without TOC
# A better solution would be to follow the merge of:
html_sidebars = {
"install": [],
"getting_started": [],
"glossary": [],
"faq": [],
"support": [],
"related_projects": [],
"roadmap": [],
"governance": [],
"about": [],
# Additional templates that should be rendered to pages, maps page names to
# template names.
html_additional_pages = {"index": "index.html"}
# Additional files to copy
# html_extra_path = []
# Additional JS files
html_js_files = [
# Compile scss files into css files using sphinxcontrib-sass
sass_src_dir, sass_out_dir = "scss", "css/styles"
sass_targets = {
f"{file.stem}.scss": f"{file.stem}.css"
for file in Path(sass_src_dir).glob("*.scss")
# Additional CSS files, should be subset of the values of `sass_targets`
html_css_files = ["styles/colors.css", "styles/custom.css"]
def add_js_css_files(app, pagename, templatename, context, doctree):
"""Load additional JS and CSS files only for certain pages.
Note that `html_js_files` and `html_css_files` are included in all pages and
should be used for the ones that are used by multiple pages. All page-specific
JS and CSS files should be added here instead.
if pagename == "api/index":
# External: jQuery and DataTables
# Internal: API search intialization and styling
elif pagename == "index":
elif pagename == "install":
elif pagename.startswith("modules/generated/"):
# If false, no module index is generated.
html_domain_indices = False
# If false, no index is generated.
html_use_index = False
# If true, the index is split into individual pages for each letter.
# html_split_index = False
# If true, links to the reST sources are added to the pages.
# html_show_sourcelink = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
# html_use_opensearch = ''
# If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml").
# html_file_suffix = ''
# Output file base name for HTML help builder.
htmlhelp_basename = "scikit-learndoc"
# If true, the reST sources are included in the HTML build as _sources/name.
html_copy_source = True
# Adds variables into templates
html_context = {}
# finds latest release highlights and places it into HTML context for
# index.html
release_highlights_dir = Path("..") / "examples" / "release_highlights"
# Finds the highlight with the latest version number
latest_highlights = sorted(release_highlights_dir.glob("plot_release_highlights_*.py"))[
latest_highlights = latest_highlights.with_suffix("").name
html_context["release_highlights"] = (
# get version from highlight name assuming highlights have the form
# plot_release_highlights_0_22_0
highlight_version = ".".join(latest_highlights.split("_")[-3:-1])
html_context["release_highlights_version"] = highlight_version
# redirects dictionary maps from old links to new links
redirects = {
"documentation": "index",
"contents": "index",
"preface": "index",
"modules/classes": "api/index",
"auto_examples/feature_selection/plot_permutation_test_for_classification": (
"modules/model_persistence": "model_persistence",
"auto_examples/linear_model/plot_bayesian_ridge": (
"auto_examples/model_selection/": (
"auto_examples/miscellaneous/plot_changed_only_pprint_parameter": (
"auto_examples/decomposition/plot_beta_divergence": (
"auto_examples/svm/plot_svm_nonlinear": "auto_examples/svm/plot_svm_kernels",
"auto_examples/ensemble/plot_adaboost_hastie_10_2": (
"auto_examples/decomposition/plot_pca_3d": (
"auto_examples/exercises/": (
"tutorial/machine_learning_map/index.html": "machine_learning_map/index.html",
html_context["redirects"] = redirects
for old_link in redirects:
html_additional_pages[old_link] = "redirects.html"
# See
html_context["is_devrelease"] = parsed_version.is_devrelease
# -- Options for LaTeX output ------------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
# 'papersize': 'letterpaper',
# The font size ('10pt', '11pt' or '12pt').
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
"preamble": r"""
\usepackage{morefloats}\usepackage{enumitem} \setlistdepth{10}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, documentclass
# [howto/manual]).
latex_documents = [
"scikit-learn user guide",
"scikit-learn developers",
# The name of an image file (relative to this directory) to place at the top of
# the title page.
latex_logo = "logos/scikit-learn-logo.png"
# Documents to append as an appendix to all manuals.
# latex_appendices = []
# If false, no module index is generated.
latex_domain_indices = False
trim_doctests_flags = True
# intersphinx configuration
intersphinx_mapping = {
"python": ("{.major}".format(sys.version_info), None),
"numpy": ("", None),
"scipy": ("", None),
"matplotlib": ("", None),
"pandas": ("", None),
"joblib": ("", None),
"seaborn": ("", None),
"skops": ("", None),
v = parse(release)
if v.release is None:
raise ValueError(
"Ill-formed version: {!r}. Version should follow PEP440".format(version)
if v.is_devrelease:
binder_branch = "main"
major, minor = v.release[:2]
binder_branch = "{}.{}.X".format(major, minor)
class SubSectionTitleOrder:
"""Sort example gallery by title of subsection.
Assumes README.txt exists for all subsections and uses the subsection with
dashes, '---', as the adornment.
def __init__(self, src_dir):
self.src_dir = src_dir
self.regex = re.compile(r"^([\w ]+)\n-", re.MULTILINE)
def __repr__(self):
return "<%s>" % (self.__class__.__name__,)
def __call__(self, directory):
src_path = os.path.normpath(os.path.join(self.src_dir, directory))
# Forces Release Highlights to the top
if os.path.basename(src_path) == "release_highlights":
return "0"
readme = os.path.join(src_path, "README.txt")
with open(readme, "r") as f:
content =
except FileNotFoundError:
return directory
title_match =
if title_match is not None:
return directory
class SKExampleTitleSortKey(ExampleTitleSortKey):
"""Sorts release highlights based on version number."""
def __call__(self, filename):
title = super().__call__(filename)
prefix = "plot_release_highlights_"
# Use title to sort if not a release highlight
if not str(filename).startswith(prefix):
return title
major_minor = filename[len(prefix) :].split("_")[:2]
version_float = float(".".join(major_minor))
# negate to place the newest version highlights first
return -version_float
def notebook_modification_function(notebook_content, notebook_filename):
notebook_content_str = str(notebook_content)
warning_template = "\n".join(
"<div class='alert alert-{message_class}'>",
"# JupyterLite warning",
message_class = "warning"
message = (
"Running the scikit-learn examples in JupyterLite is experimental and you may"
" encounter some unexpected behavior.\n\nThe main difference is that imports"
" will take a lot longer than usual, for example the first `import sklearn` can"
" take roughly 10-20s.\n\nIf you notice problems, feel free to open an"
" [issue]("
" about it."
markdown = warning_template.format(message_class=message_class, message=message)
dummy_notebook_content = {"cells": []}
add_markdown_cell(dummy_notebook_content, markdown)
code_lines = []
if "seaborn" in notebook_content_str:
code_lines.append("%pip install seaborn")
if "" in notebook_content_str:
code_lines.append("%pip install plotly")
if "skimage" in notebook_content_str:
code_lines.append("%pip install scikit-image")
if "polars" in notebook_content_str:
code_lines.append("%pip install polars")
if "fetch_" in notebook_content_str:
"%pip install pyodide-http",
"import pyodide_http",
# always import matplotlib and pandas to avoid Pyodide limitation with
# imports inside functions
code_lines.extend(["import matplotlib", "import pandas"])
if code_lines:
code_lines = ["# JupyterLite-specific code"] + code_lines
code = "\n".join(code_lines)
add_code_cell(dummy_notebook_content, code)
notebook_content["cells"] = (
dummy_notebook_content["cells"] + notebook_content["cells"]
default_global_config = sklearn.get_config()
def reset_sklearn_config(gallery_conf, fname):
"""Reset sklearn config to default values."""
sg_examples_dir = "../examples"
sg_gallery_dir = "auto_examples"
sphinx_gallery_conf = {
"doc_module": "sklearn",
"backreferences_dir": os.path.join("modules", "generated"),
"show_memory": False,
"reference_url": {"sklearn": None},
"examples_dirs": [sg_examples_dir],
"gallery_dirs": [sg_gallery_dir],
"subsection_order": SubSectionTitleOrder(sg_examples_dir),
"within_subsection_order": SKExampleTitleSortKey,
"binder": {
"org": "scikit-learn",
"repo": "scikit-learn",
"binderhub_url": "",
"branch": binder_branch,
"dependencies": "./binder/requirements.txt",
"use_jupyter_lab": True,
# avoid generating too many cross links
"inspect_global_variables": False,
"remove_config_comments": True,
"plot_gallery": "True",
"recommender": {"enable": True, "n_examples": 4, "min_df": 12},
"reset_modules": ("matplotlib", "seaborn", reset_sklearn_config),
if with_jupyterlite:
sphinx_gallery_conf["jupyterlite"] = {
"notebook_modification_function": notebook_modification_function
# Secondary sidebar configuration for pages generated by sphinx-gallery
# For the index page of the gallery and each nested section, we hide the secondary
# sidebar by specifying an empty list (no components), because there is no meaningful
# in-page toc for these pages, and they are generated so "sourcelink" is not useful
# either.
# For each example page we keep default ["page-toc", "sourcelink"] specified by the
# "**" key. "page-toc" is wanted for these pages. "sourcelink" is also necessary since
# otherwise the secondary sidebar will degenerate when "page-toc" is empty, and the
# script `sphinxext/` will fail (it assumes the existence of the
# secondary sidebar). The script will remove "sourcelink" in the end.
html_theme_options["secondary_sidebar_items"][f"{sg_gallery_dir}/index"] = []
for sub_sg_dir in (Path(".") / sg_examples_dir).iterdir():
if sub_sg_dir.is_dir():
] = []
# The following dictionary contains the information used to create the
# thumbnails for the front page of the scikit-learn home page.
# key: first image in set
# values: (number of plot in set, height of thumbnail)
carousel_thumbs = {"sphx_glr_plot_classifier_comparison_001.png": 600}
# enable experimental module so that experimental estimators can be
# discovered properly by sphinx
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.experimental import enable_halving_search_cv # noqa
def make_carousel_thumbs(app, exception):
"""produces the final resized carousel images"""
if exception is not None:
print("Preparing carousel images")
image_dir = os.path.join(app.builder.outdir, "_images")
for glr_plot, max_width in carousel_thumbs.items():
image = os.path.join(image_dir, glr_plot)
if os.path.exists(image):
c_thumb = os.path.join(image_dir, glr_plot[:-4] + "_carousel.png")
sphinx_gallery.gen_rst.scale_image(image, c_thumb, max_width, 190)
def filter_search_index(app, exception):
if exception is not None:
# searchindex only exist when generating html
if != "html":
print("Removing methods from search index")
searchindex_path = os.path.join(app.builder.outdir, "searchindex.js")
with open(searchindex_path, "r") as f:
searchindex_text =
searchindex_text = re.sub(r"{__init__.+?}", "{}", searchindex_text)
searchindex_text = re.sub(r"{__call__.+?}", "{}", searchindex_text)
with open(searchindex_path, "w") as f:
# Config for sphinx_issues
# we use the issues path for PRs since the issues URL will forward
issues_github_path = "scikit-learn/scikit-learn"
def disable_plot_gallery_for_linkcheck(app):
if == "linkcheck":
sphinx_gallery_conf["plot_gallery"] = "False"
def setup(app):
# do not run the examples when using linkcheck by using a small priority
# (default priority is 500 and sphinx-gallery using builder-inited event too)
app.connect("builder-inited", disable_plot_gallery_for_linkcheck, priority=50)
# triggered just before the HTML for an individual page is created
app.connect("html-page-context", add_js_css_files)
# to hide/show the prompt in code examples
app.connect("build-finished", make_carousel_thumbs)
app.connect("build-finished", filter_search_index)
# The following is used by sphinx.ext.linkcode to provide links to github
linkcode_resolve = make_linkcode_resolve(
"Matplotlib is currently using agg, which is a"
" non-GUI backend, so cannot show the figure."
if os.environ.get("SKLEARN_WARNINGS_AS_ERRORS", "0") != "0":
# maps functions with a class name that is indistinguishable when case is
# ignore to another filename
autosummary_filename_map = {
"sklearn.cluster.dbscan": "dbscan-function",
"sklearn.covariance.oas": "oas-function",
"sklearn.decomposition.fastica": "fastica-function",
# Config for sphinxext.opengraph
ogp_site_url = "https://scikit-learn/stable/"
ogp_image = ""
ogp_use_first_image = True
ogp_site_name = "scikit-learn"
# Config for linkcheck that checks the documentation for broken links
# ignore all links in 'whats_new' to avoid doing many github requests and
# hitting the github rate threshold that makes linkcheck take a lot of time
linkcheck_exclude_documents = [r"whats_new/.*"]
# default timeout to make some sites links fail faster
linkcheck_timeout = 10
# Allow redirects from
linkcheck_allowed_redirects = {r"": r".*"}
linkcheck_ignore = [
# ignore links to local html files e.g. in image directive :target: field
# ignore links to specific pdf pages because linkcheck does not handle them
# ('utf-8' codec can't decode byte error)
# links falsely flagged as broken
# Broken links from testimonials
# Ignore some dynamically created anchors. See
# for more details about
# the github example
# Config for sphinx-remove-toctrees
remove_from_toctrees = ["metadata_routing.rst"]
# Use a browser-like user agent to avoid some "403 Client Error: Forbidden for
# url" errors. This is taken from the variable navigator.userAgent inside a
# browser console.
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
# Use Github token from environment variable to avoid Github rate limits when
# checking Github links
github_token = os.getenv("GITHUB_TOKEN")
if github_token is None:
linkcheck_request_headers = {}
linkcheck_request_headers = {
"": {"Authorization": f"token {github_token}"},
# -- Convert .rst.template files to .rst ---------------------------------------
from sklearn._min_dependencies import dependent_packages
# If development build, link to local page in the top navbar; otherwise link to the
# development version; see
if parsed_version.is_devrelease:
development_link = "developers/index"
development_link = ""
# Define the templates and target files for conversion
# Each entry is in the format (template name, file name, kwargs for rendering)
rst_templates = [
("index", "index", {"development_link": development_link}),
{"dependent_packages": dependent_packages},
{"dependent_packages": dependent_packages},
"API_REFERENCE": sorted(API_REFERENCE.items(), key=lambda x: x[0]),
DEPRECATED_API_REFERENCE.items(), key=lambda x: x[0], reverse=True
# Convert each module API reference page
for module in API_REFERENCE:
{"module": module, "module_info": API_REFERENCE[module]},
# Convert the deprecated API reference page (if there exists any)
DEPRECATED_API_REFERENCE.items(), key=lambda x: x[0], reverse=True
for rst_template_name, rst_target_name, kwargs in rst_templates:
# Read the corresponding template file into jinja2
with (Path(".") / f"{rst_template_name}.rst.template").open(
"r", encoding="utf-8"
) as f:
t = jinja2.Template(
# Render the template and write to the target
with (Path(".") / f"{rst_target_name}.rst").open("w", encoding="utf-8") as f:
import os
import warnings
from os import environ
from os.path import exists, join
import pytest
from _pytest.doctest import DoctestItem
from sklearn.datasets import get_data_home
from sklearn.datasets._base import _pkl_filepath
from sklearn.datasets._twenty_newsgroups import CACHE_NAME
from sklearn.utils._testing import SkipTest, check_skip_network
from sklearn.utils.fixes import _IS_PYPY, np_base_version, parse_version
def setup_labeled_faces():
data_home = get_data_home()
if not exists(join(data_home, "lfw_home")):
raise SkipTest("Skipping dataset loading doctests")
def setup_rcv1():
# skip the test in rcv1.rst if the dataset is not already loaded
rcv1_dir = join(get_data_home(), "RCV1")
if not exists(rcv1_dir):
raise SkipTest("Download RCV1 dataset to run this test.")
def setup_twenty_newsgroups():
cache_path = _pkl_filepath(get_data_home(), CACHE_NAME)
if not exists(cache_path):
raise SkipTest("Skipping dataset loading doctests")
def setup_working_with_text_data():
if _IS_PYPY and os.environ.get("CI", None):
raise SkipTest("Skipping too slow test with PyPy on CI")
cache_path = _pkl_filepath(get_data_home(), CACHE_NAME)
if not exists(cache_path):
raise SkipTest("Skipping dataset loading doctests")
def setup_loading_other_datasets():
import pandas # noqa
except ImportError:
raise SkipTest("Skipping loading_other_datasets.rst, pandas not installed")
# checks SKLEARN_SKIP_NETWORK_TESTS to see if test should run
run_network_tests = environ.get("SKLEARN_SKIP_NETWORK_TESTS", "1") == "0"
if not run_network_tests:
raise SkipTest(
"Skipping loading_other_datasets.rst, tests can be "
"enabled by setting SKLEARN_SKIP_NETWORK_TESTS=0"
def setup_compose():
import pandas # noqa
except ImportError:
raise SkipTest("Skipping compose.rst, pandas not installed")
def setup_impute():
import pandas # noqa
except ImportError:
raise SkipTest("Skipping impute.rst, pandas not installed")
def setup_grid_search():
import pandas # noqa
except ImportError:
raise SkipTest("Skipping grid_search.rst, pandas not installed")
def setup_preprocessing():
import pandas # noqa
if parse_version(pandas.__version__) < parse_version("1.1.0"):
raise SkipTest("Skipping preprocessing.rst, pandas version < 1.1.0")
except ImportError:
raise SkipTest("Skipping preprocessing.rst, pandas not installed")
def setup_unsupervised_learning():
import skimage # noqa
except ImportError:
raise SkipTest("Skipping unsupervised_learning.rst, scikit-image not installed")
# ignore deprecation warnings from scipy.misc.face
"ignore", "The binary mode of fromstring", DeprecationWarning
def skip_if_matplotlib_not_installed(fname):
import matplotlib # noqa
except ImportError:
basename = os.path.basename(fname)
raise SkipTest(f"Skipping doctests for {basename}, matplotlib not installed")
def skip_if_cupy_not_installed(fname):
import cupy # noqa
except ImportError:
basename = os.path.basename(fname)
raise SkipTest(f"Skipping doctests for {basename}, cupy not installed")
def pytest_runtest_setup(item):
fname = item.fspath.strpath
# normalize filename to use forward slashes on Windows for easier handling
# later
fname = fname.replace(os.sep, "/")
is_index = fname.endswith("datasets/index.rst")
if fname.endswith("datasets/labeled_faces.rst") or is_index:
elif fname.endswith("datasets/rcv1.rst") or is_index:
elif fname.endswith("datasets/twenty_newsgroups.rst") or is_index:
elif fname.endswith("modules/compose.rst") or is_index:
elif fname.endswith("datasets/loading_other_datasets.rst"):
elif fname.endswith("modules/impute.rst"):
elif fname.endswith("modules/grid_search.rst"):
elif fname.endswith("modules/preprocessing.rst"):
elif fname.endswith("statistical_inference/unsupervised_learning.rst"):
rst_files_requiring_matplotlib = [
for each in rst_files_requiring_matplotlib:
if fname.endswith(each):
if fname.endswith("array_api.rst"):
def pytest_configure(config):
# Use matplotlib agg backend during the tests including doctests
import matplotlib
except ImportError:
def pytest_collection_modifyitems(config, items):
"""Called after collect is completed.
config : pytest config
items : list of collected items
skip_doctests = False
if np_base_version >= parse_version("2"):
# Skip doctests when using numpy 2 for now. See the following discussion
# to decide what to do in the longer term:
reason = "Due to NEP 51 numpy scalar repr has changed in numpy 2"
skip_doctests = True
# Normally doctest has the entire module's scope. Here we set globs to an empty dict
# to remove the module's scope:
for item in items:
if isinstance(item, DoctestItem):
item.dtest.globs = {}
if skip_doctests:
skip_marker = pytest.mark.skip(reason=reason)
for item in items:
if isinstance(item, DoctestItem):
.. raw :: html
<!-- Generated by -->
<div class="sk-authors-container">
img.avatar {border-radius: 10px;}
<a href=''><img src='' class='avatar' /></a> <br />
<p>Juan Carlos Alfaro Jiménez</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Lucy Liu</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Maxwell Liu</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Juan Martin Loyola</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Sylvain Marié</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Norbert Preining</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Reshama Shaikh</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Albert Thomas</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Maren Westermann</p>
- Chiara Marmo
Normal file
@ -0,0 +1,35 @@
.. _data-transforms:
Dataset transformations
scikit-learn provides a library of transformers, which may clean (see
:ref:`preprocessing`), reduce (see :ref:`data_reduction`), expand (see
:ref:`kernel_approximation`) or generate (see :ref:`feature_extraction`)
feature representations.
Like other estimators, these are represented by classes with a ``fit`` method,
which learns model parameters (e.g. mean and standard deviation for
normalization) from a training set, and a ``transform`` method which applies
this transformation model to unseen data. ``fit_transform`` may be more
convenient and efficient for modelling and transforming the training data
Combining such transformers, either in parallel or series is covered in
:ref:`combining_estimators`. :ref:`metrics` covers transforming feature
spaces into affinity matrices, while :ref:`preprocessing_targets` considers
transformations of the target space (e.g. categorical labels) for use in
.. toctree::
:maxdepth: 2
Normal file
.. _datasets:
Dataset loading utilities
.. currentmodule:: sklearn.datasets
The ``sklearn.datasets`` package embeds some small toy datasets
as introduced in the :ref:`Getting Started <loading_example_dataset>` section.
This package also features helpers to fetch larger datasets commonly
used by the machine learning community to benchmark algorithms on data
that comes from the 'real world'.
To evaluate the impact of the scale of the dataset (``n_samples`` and
``n_features``) while controlling the statistical properties of the data
(typically the correlation and informativeness of the features), it is
also possible to generate synthetic data.
**General dataset API.** There are three main kinds of dataset interfaces that
can be used to get datasets depending on the desired type of dataset.
**The dataset loaders.** They can be used to load small standard datasets,
described in the :ref:`toy_datasets` section.
**The dataset fetchers.** They can be used to download and load larger datasets,
described in the :ref:`real_world_datasets` section.
Both loaders and fetchers functions return a :class:`~sklearn.utils.Bunch`
object holding at least two items:
an array of shape ``n_samples`` * ``n_features`` with
key ``data`` (except for 20newsgroups) and a numpy array of
length ``n_samples``, containing the target values, with key ``target``.
The Bunch object is a dictionary that exposes its keys as attributes.
For more information about Bunch object, see :class:`~sklearn.utils.Bunch`.
It's also possible for almost all of these function to constrain the output
to be a tuple containing only the data and the target, by setting the
``return_X_y`` parameter to ``True``.
The datasets also contain a full description in their ``DESCR`` attribute and
some contain ``feature_names`` and ``target_names``. See the dataset
descriptions below for details.
**The dataset generation functions.** They can be used to generate controlled
synthetic datasets, described in the :ref:`sample_generators` section.
These functions return a tuple ``(X, y)`` consisting of a ``n_samples`` *
``n_features`` numpy array ``X`` and an array of length ``n_samples``
containing the targets ``y``.
In addition, there are also miscellaneous tools to load datasets of other
formats or from other locations, described in the :ref:`loading_other_datasets`
.. toctree::
:maxdepth: 2
Normal file
.. _loading_other_datasets:
Loading other datasets
.. currentmodule:: sklearn.datasets
.. _sample_images:
Sample images
Scikit-learn also embeds a couple of sample JPEG images published under Creative
Commons license by their authors. Those images can be useful to test algorithms
and pipelines on 2D data.
.. autosummary::
.. image:: ../auto_examples/cluster/images/sphx_glr_plot_color_quantization_001.png
:target: ../auto_examples/cluster/plot_color_quantization.html
:scale: 30
:align: right
.. warning::
The default coding of images is based on the ``uint8`` dtype to
spare memory. Often machine learning algorithms work best if the
input is converted to a floating point representation first. Also,
if you plan to use ``matplotlib.pyplpt.imshow``, don't forget to scale to the range
0 - 1 as done in the following example.
.. rubric:: Examples
* :ref:``
.. _libsvm_loader:
Datasets in svmlight / libsvm format
scikit-learn includes utility functions for loading
datasets in the svmlight / libsvm format. In this format, each line
takes the form ``<label> <feature-id>:<feature-value>
<feature-id>:<feature-value> ...``. This format is especially suitable for sparse datasets.
In this module, scipy sparse CSR matrices are used for ``X`` and numpy arrays are used for ``y``.
You may load a dataset like as follows::
>>> from sklearn.datasets import load_svmlight_file
>>> X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")
... # doctest: +SKIP
You may also load two (or more) datasets at once::
>>> X_train, y_train, X_test, y_test = load_svmlight_files(
... ("/path/to/train_dataset.txt", "/path/to/test_dataset.txt"))
... # doctest: +SKIP
In this case, ``X_train`` and ``X_test`` are guaranteed to have the same number
of features. Another way to achieve the same result is to fix the number of
>>> X_test, y_test = load_svmlight_file(
... "/path/to/test_dataset.txt", n_features=X_train.shape[1])
... # doctest: +SKIP
.. rubric:: Related links
- `Public datasets in svmlight / libsvm format`:
- `Faster API-compatible implementation`:
For doctests:
>>> import numpy as np
>>> import os
.. _openml:
Downloading datasets from the repository
` <>`_ is a public repository for machine learning
data and experiments, that allows everybody to upload open datasets.
The ``sklearn.datasets`` package is able to download datasets
from the repository using the function
For example, to download a dataset of gene expressions in mice brains::
>>> from sklearn.datasets import fetch_openml
>>> mice = fetch_openml(name='miceprotein', version=4)
To fully specify a dataset, you need to provide a name and a version, though
the version is optional, see :ref:`openml_versions` below.
The dataset contains a total of 1080 examples belonging to 8 different
(1080, 77)
>>> np.unique(
array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object)
You can get more information on the dataset by looking at the ``DESCR``
and ``details`` attributes::
>>> print(mice.DESCR) # doctest: +SKIP
**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios
**Source**: [UCI]( - 2015
**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing
Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down
Syndrome. PLoS ONE 10(6): e0129126...
>>> mice.details # doctest: +SKIP
{'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF',
'upload_date': '2017-11-08T16:00:15', 'licence': 'Public',
'url': '',
'file_id': '17928620', 'default_target_attribute': 'class',
'row_id_attribute': 'MouseID',
'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],
'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],
'visibility': 'public', 'status': 'active',
'md5_checksum': '3c479a6885bfa0438971388283a1ce32'}
The ``DESCR`` contains a free-text description of the data, while ``details``
contains a dictionary of meta-data stored by openml, like the dataset id.
For more details, see the `OpenML documentation
<>`_ The ``data_id`` of the mice protein dataset
is 40966, and you can use this (or the name) to get more information on the
dataset on the openml website::
>>> mice.url
The ``data_id`` also uniquely identifies a dataset from OpenML::
>>> mice = fetch_openml(data_id=40966)
>>> mice.details # doctest: +SKIP
{'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF',
'creator': ...,
'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url':
'', 'file_id':
'1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C,
Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins
Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6):
e0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14',
'study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum':
.. _openml_versions:
Dataset Versions
A dataset is uniquely specified by its ``data_id``, but not necessarily by its
name. Several different "versions" of a dataset with the same name can exist
which can contain entirely different datasets.
If a particular version of a dataset has been found to contain significant
issues, it might be deactivated. Using a name to specify a dataset will yield
the earliest version of a dataset that is still active. That means that
``fetch_openml(name="miceprotein")`` can yield different results
at different times if earlier versions become inactive.
You can see that the dataset with ``data_id`` 40966 that we fetched above is
the first version of the "miceprotein" dataset::
>>> mice.details['version'] #doctest: +SKIP
In fact, this dataset only has one version. The iris dataset on the other hand
has multiple versions::
>>> iris = fetch_openml(name="iris")
>>> iris.details['version'] #doctest: +SKIP
>>> iris.details['id'] #doctest: +SKIP
>>> iris_61 = fetch_openml(data_id=61)
>>> iris_61.details['version']
>>> iris_61.details['id']
>>> iris_969 = fetch_openml(data_id=969)
>>> iris_969.details['version']
>>> iris_969.details['id']
Specifying the dataset by the name "iris" yields the lowest version, version 1,
with the ``data_id`` 61. To make sure you always get this exact dataset, it is
safest to specify it by the dataset ``data_id``. The other dataset, with
``data_id`` 969, is version 3 (version 2 has become inactive), and contains a
binarized version of the data::
>>> np.unique(
array(['N', 'P'], dtype=object)
You can also specify both the name and the version, which also uniquely
identifies the dataset::
>>> iris_version_3 = fetch_openml(name="iris", version=3)
>>> iris_version_3.details['version']
>>> iris_version_3.details['id']
.. rubric:: References
* :arxiv:`Vanschoren, van Rijn, Bischl and Torgo. "OpenML: networked science in
machine learning" ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014.
.. _openml_parser:
ARFF parser
From version 1.2, scikit-learn provides a new keyword argument `parser` that
provides several options to parse the ARFF files provided by OpenML. The legacy
parser (i.e. `parser="liac-arff"`) is based on the project
`LIAC-ARFF <>`_. This parser is however
slow and consume more memory than required. A new parser based on pandas
(i.e. `parser="pandas"`) is both faster and more memory efficient.
However, this parser does not support sparse data.
Therefore, we recommend using `parser="auto"` which will use the best parser
available for the requested dataset.
The `"pandas"` and `"liac-arff"` parsers can lead to different data types in
the output. The notable differences are the following:
- The `"liac-arff"` parser always encodes categorical features as `str`
objects. To the contrary, the `"pandas"` parser instead infers the type while
reading and numerical categories will be casted into integers whenever
- The `"liac-arff"` parser uses float64 to encode numerical features tagged as
'REAL' and 'NUMERICAL' in the metadata. The `"pandas"` parser instead infers
if these numerical features corresponds to integers and uses panda's Integer
extension dtype.
- In particular, classification datasets with integer categories are typically
loaded as such `(0, 1, ...)` with the `"pandas"` parser while `"liac-arff"`
will force the use of string encoded class labels such as `"0"`, `"1"` and so
- The `"pandas"` parser will not strip single quotes - i.e. `'` - from string
columns. For instance, a string `'my string'` will be kept as is while the
`"liac-arff"` parser will strip the single quotes. For categorical columns,
the single quotes are stripped from the values.
In addition, when `as_frame=False` is used, the `"liac-arff"` parser returns
ordinally encoded data where the categories are provided in the attribute
`categories` of the `Bunch` instance. Instead, `"pandas"` returns a NumPy array
were the categories. Then it's up to the user to design a feature
engineering pipeline with an instance of `OneHotEncoder` or
`OrdinalEncoder` typically wrapped in a `ColumnTransformer` to
preprocess the categorical columns explicitly. See for instance: :ref:``.
.. _external_datasets:
Loading from external datasets
scikit-learn works on any numeric data stored as numpy arrays or scipy sparse
matrices. Other types that are convertible to numeric arrays such as pandas
DataFrame are also acceptable.
Here are some recommended ways to load standard columnar data into a
format usable by scikit-learn:
* ` <>`_
provides tools to read data from common formats including CSV, Excel, JSON
and SQL. DataFrames may also be constructed from lists of tuples or dicts.
Pandas handles heterogeneous data smoothly and provides tools for
manipulation and conversion into a numeric array suitable for scikit-learn.
* ` <>`_
specializes in binary formats often used in scientific computing
context such as .mat and .arff
* `numpy/ <>`_
for standard loading of columnar data into numpy arrays
* scikit-learn's :func:`load_svmlight_file` for the svmlight or libSVM
sparse format
* scikit-learn's :func:`load_files` for directories of text files where
the name of each directory is the name of each category and each file inside
of each directory corresponds to one sample from that category
For some miscellaneous data such as images, videos, and audio, you may wish to
refer to:
* ` <>`_ or
`Imageio <>`_
for loading images and videos into numpy arrays
* `
for reading WAV files into a numpy array
Categorical (or nominal) features stored as strings (common in pandas DataFrames)
will need converting to numerical features using :class:`~sklearn.preprocessing.OneHotEncoder`
or :class:`~sklearn.preprocessing.OrdinalEncoder` or similar.
See :ref:`preprocessing`.
Note: if you manage your own numerical data it is recommended to use an
optimized file format such as HDF5 to reduce data load times. Various libraries
such as H5Py, PyTables and pandas provides a Python interface for reading and
writing data in that format.
.. _real_world_datasets:
Real world datasets
.. currentmodule:: sklearn.datasets
scikit-learn provides tools to load larger datasets, downloading them if
They can be loaded using the following functions:
.. autosummary::
.. include:: ../../sklearn/datasets/descr/olivetti_faces.rst
.. include:: ../../sklearn/datasets/descr/twenty_newsgroups.rst
.. include:: ../../sklearn/datasets/descr/lfw.rst
.. include:: ../../sklearn/datasets/descr/covtype.rst
.. include:: ../../sklearn/datasets/descr/rcv1.rst
.. include:: ../../sklearn/datasets/descr/kddcup99.rst
.. include:: ../../sklearn/datasets/descr/california_housing.rst
.. include:: ../../sklearn/datasets/descr/species_distributions.rst
Normal file
.. _sample_generators:
Generated datasets
.. currentmodule:: sklearn.datasets
In addition, scikit-learn includes various random sample generators that
can be used to build artificial datasets of controlled size and complexity.
Generators for classification and clustering
These generators produce a matrix of features and corresponding discrete
Single label
Both :func:`make_blobs` and :func:`make_classification` create multiclass
datasets by allocating each class one or more normally-distributed clusters of
points. :func:`make_blobs` provides greater control regarding the centers and
standard deviations of each cluster, and is used to demonstrate clustering.
:func:`make_classification` specializes in introducing noise by way of:
correlated, redundant and uninformative features; multiple Gaussian clusters
per class; and linear transformations of the feature space.
:func:`make_gaussian_quantiles` divides a single Gaussian cluster into
near-equal-size classes separated by concentric hyperspheres.
:func:`make_hastie_10_2` generates a similar binary, 10-dimensional problem.
.. image:: ../auto_examples/datasets/images/sphx_glr_plot_random_dataset_001.png
:target: ../auto_examples/datasets/plot_random_dataset.html
:scale: 50
:align: center
:func:`make_circles` and :func:`make_moons` generate 2d binary classification
datasets that are challenging to certain algorithms (e.g. centroid-based
clustering or linear classification), including optional Gaussian noise.
They are useful for visualization. :func:`make_circles` produces Gaussian data
with a spherical decision boundary for binary classification, while
:func:`make_moons` produces two interleaving half circles.
:func:`make_multilabel_classification` generates random samples with multiple
labels, reflecting a bag of words drawn from a mixture of topics. The number of
topics for each document is drawn from a Poisson distribution, and the topics
themselves are drawn from a fixed random distribution. Similarly, the number of
words is drawn from Poisson, with words drawn from a multinomial, where each
topic defines a probability distribution over words. Simplifications with
respect to true bag-of-words mixtures include:
* Per-topic word distributions are independently drawn, where in reality all
would be affected by a sparse base distribution, and would be correlated.
* For a document generated from multiple topics, all topics are weighted
equally in generating its bag of words.
* Documents without labels words at random, rather than from a base
.. image:: ../auto_examples/datasets/images/sphx_glr_plot_random_multilabel_dataset_001.png
:target: ../auto_examples/datasets/plot_random_multilabel_dataset.html
:scale: 50
:align: center
.. autosummary::
Generators for regression
:func:`make_regression` produces regression targets as an optionally-sparse
random linear combination of random features, with noise. Its informative
features may be uncorrelated, or low rank (few features account for most of the
Other regression generators generate functions deterministically from
randomized features. :func:`make_sparse_uncorrelated` produces a target as a
linear combination of four features with fixed coefficients.
Others encode explicitly non-linear relations:
:func:`make_friedman1` is related by polynomial and sine transforms;
:func:`make_friedman2` includes feature multiplication and reciprocation; and
:func:`make_friedman3` is similar with an arctan transformation on the target.
Generators for manifold learning
.. autosummary::
Generators for decomposition
.. autosummary::
Normal file
.. _toy_datasets:
Toy datasets
.. currentmodule:: sklearn.datasets
scikit-learn comes with a few small standard datasets that do not require to
download any file from some external website.
They can be loaded using the following functions:
.. autosummary::
These datasets are useful to quickly illustrate the behavior of the
various algorithms implemented in scikit-learn. They are however often too
small to be representative of real world machine learning tasks.
.. include:: ../../sklearn/datasets/descr/iris.rst
.. include:: ../../sklearn/datasets/descr/diabetes.rst
.. include:: ../../sklearn/datasets/descr/digits.rst
.. include:: ../../sklearn/datasets/descr/linnerud.rst
.. include:: ../../sklearn/datasets/descr/wine_data.rst
.. include:: ../../sklearn/datasets/descr/breast_cancer.rst
Normal file
.. _advanced-installation:
.. include:: ../min_dependency_substitutions.rst
Installing the development version of scikit-learn
This section introduces how to install the **main branch** of scikit-learn.
This can be done by either installing a nightly build or building from source.
.. _install_nightly_builds:
Installing nightly builds
The continuous integration servers of the scikit-learn project build, test
and upload wheel packages for the most recent Python version on a nightly
Installing a nightly build is the quickest way to:
- try a new feature that will be shipped in the next release (that is, a
feature from a pull-request that was recently merged to the main branch);
- check whether a bug you encountered has been fixed since the last release.
You can install the nightly build of scikit-learn using the `scientific-python-nightly-wheels`
index from the PyPI registry of ``:
.. prompt:: bash $
pip install --pre --extra-index scikit-learn
Note that first uninstalling scikit-learn might be required to be able to
install nightly builds of scikit-learn.
.. _install_bleeding_edge:
Building from source
Building from source is required to work on a contribution (bug fix, new
feature, code or documentation improvement).
.. _git_repo:
#. Use `Git <>`_ to check out the latest source from the
`scikit-learn repository <>`_ on
.. prompt:: bash $
git clone git:// # add --depth 1 if your connection is slow
cd scikit-learn
If you plan on submitting a pull-request, you should clone from your fork
#. Install a recent version of Python (3.9 or later at the time of writing) for
instance using Miniforge3_. Miniforge provides a conda-based distribution of
Python and the most popular scientific libraries.
If you installed Python with conda, we recommend to create a dedicated
`conda environment`_ with all the build dependencies of scikit-learn
(namely NumPy_, SciPy_, Cython_, meson-python_ and Ninja_):
.. prompt:: bash $
conda create -n sklearn-env -c conda-forge python numpy scipy cython meson-python ninja
It is not always necessary but it is safer to open a new prompt before
activating the newly created conda environment.
.. prompt:: bash $
conda activate sklearn-env
#. **Alternative to conda:** You can use alternative installations of Python
provided they are recent enough (3.9 or higher at the time of writing).
Here is an example on how to create a build environment for a Linux system's
Python. Build dependencies are installed with `pip` in a dedicated virtualenv_
to avoid disrupting other Python programs installed on the system:
.. prompt:: bash $
python3 -m venv sklearn-env
source sklearn-env/bin/activate
pip install wheel numpy scipy cython meson-python ninja
#. Install a compiler with OpenMP_ support for your platform. See instructions
for :ref:`compiler_windows`, :ref:`compiler_macos`, :ref:`compiler_linux`
and :ref:`compiler_freebsd`.
#. Build the project with pip:
.. prompt:: bash $
pip install --editable . \
--verbose --no-build-isolation \
--config-settings editable-verbose=true
#. Check that the installed scikit-learn has a version number ending with
.. prompt:: bash $
python -c "import sklearn; sklearn.show_versions()"
#. Please refer to the :ref:`developers_guide` and :ref:`pytest_tips` to run
the tests on the module of your choice.
.. note::
`--config-settings editable-verbose=true` is optional but recommended
to avoid surprises when you import `sklearn`. `meson-python` implements
editable installs by rebuilding `sklearn` when executing `import sklearn`.
With the recommended setting you will see a message when this happens,
rather than potentially waiting without feed-back and wondering
what is taking so long. Bonus: this means you only have to run the `pip
install` command once, `sklearn` will automatically be rebuilt when
importing `sklearn`.
Runtime dependencies
Scikit-learn requires the following dependencies both at build time and at
- Python (>= 3.8),
- NumPy (>= |NumpyMinVersion|),
- SciPy (>= |ScipyMinVersion|),
- Joblib (>= |JoblibMinVersion|),
- threadpoolctl (>= |ThreadpoolctlMinVersion|).
Build dependencies
Building Scikit-learn also requires:
# The following places need to be in sync with regard to Cython version:
# - .circleci config file
# - sklearn/_build_utils/
- Cython >= |CythonMinVersion|
- A C/C++ compiler and a matching OpenMP_ runtime library. See the
:ref:`platform system specific instructions
<platform_specific_instructions>` for more details.
.. note::
If OpenMP is not supported by the compiler, the build will be done with
OpenMP functionalities disabled. This is not recommended since it will force
some estimators to run in sequential mode instead of leveraging thread-based
parallelism. Setting the ``SKLEARN_FAIL_NO_OPENMP`` environment variable
(before cythonization) will force the build to fail if OpenMP is not
Since version 0.21, scikit-learn automatically detects and uses the linear
algebra library used by SciPy **at runtime**. Scikit-learn has therefore no
build dependency on BLAS/LAPACK implementations such as OpenBlas, Atlas, Blis
or MKL.
Test dependencies
Running tests requires:
- pytest >= |PytestMinVersion|
Some tests also require `pandas <>`_.
Building a specific version from a tag
If you want to build a stable version, you can ``git checkout <VERSION>``
to get the code for that particular version, or download an zip archive of
the version from github.
.. _platform_specific_instructions:
Platform-specific instructions
Here are instructions to install a working C/C++ compiler with OpenMP support
to build scikit-learn Cython extensions for each supported platform.
.. _compiler_windows:
First, download the `Build Tools for Visual Studio 2019 installer
Run the downloaded `vs_buildtools.exe` file, during the installation you will
need to make sure you select "Desktop development with C++", similarly to this
.. image:: ../images/visual-studio-build-tools-selection.png
Secondly, find out if you are running 64-bit or 32-bit Python. The building
command depends on the architecture of the Python interpreter. You can check
the architecture by running the following in ``cmd`` or ``powershell``
.. prompt:: bash $
python -c "import struct; print(struct.calcsize('P') * 8)"
For 64-bit Python, configure the build environment by running the following
commands in ``cmd`` or an Anaconda Prompt (if you use Anaconda):
.. sphinx-prompt 1.3.0 (used in doc-min-dependencies CI task) does not support `batch` prompt type,
.. so we work around by using a known prompt type and an explicit prompt text.
.. prompt:: bash C:\>
"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" x64
Replace ``x64`` by ``x86`` to build for 32-bit Python.
Please be aware that the path above might be different from user to user. The
aim is to point to the "vcvarsall.bat" file that will set the necessary
environment variables in the current command prompt.
Finally, build scikit-learn with this command prompt:
.. prompt:: bash $
pip install --editable . \
--verbose --no-build-isolation \
--config-settings editable-verbose=true
.. _compiler_macos:
The default C compiler on macOS, Apple clang (confusingly aliased as
`/usr/bin/gcc`), does not directly support OpenMP. We present two alternatives
to enable OpenMP support:
- either install `conda-forge::compilers` with conda;
- or install `libomp` with Homebrew to extend the default Apple clang compiler.
For Apple Silicon M1 hardware, only the conda-forge method below is known to
work at the time of writing (January 2021). You can install the `macos/arm64`
distribution of conda using the `miniforge installer
macOS compilers from conda-forge
If you use the conda package manager (version >= 4.7), you can install the
``compilers`` meta-package from the conda-forge channel, which provides
OpenMP-enabled C/C++ compilers based on the llvm toolchain.
First install the macOS command line tools:
.. prompt:: bash $
xcode-select --install
scikit-learn from source:
.. prompt:: bash $
conda create -n sklearn-dev -c conda-forge python numpy scipy cython \
joblib threadpoolctl pytest compilers llvm-openmp meson-python ninja
It is not always necessary but it is safer to open a new prompt before
activating the newly created conda environment.
.. prompt:: bash $
conda activate sklearn-dev
make clean
pip install --editable . \
--verbose --no-build-isolation \
--config-settings editable-verbose=true
.. note::
If you get any conflicting dependency error message, try commenting out
any custom conda configuration in the ``$HOME/.condarc`` file. In
particular the ``channel_priority: strict`` directive is known to cause
problems for this setup.
You can check that the custom compilers are properly installed from conda
forge using the following command:
.. prompt:: bash $
conda list
which should include ``compilers`` and ``llvm-openmp``.
The compilers meta-package will automatically set custom environment
.. prompt:: bash $
echo $CC
echo $CXX
echo $CFLAGS
They point to files and folders from your ``sklearn-dev`` conda environment
(in particular in the bin/, include/ and lib/ subfolders). For instance
``-L/path/to/conda/envs/sklearn-dev/lib`` should appear in ``LDFLAGS``.
In the log, you should see the compiled extension being built with the clang
and clang++ compilers installed by conda with the ``-fopenmp`` command line
macOS compilers from Homebrew
Another solution is to enable OpenMP support for the clang compiler shipped
by default on macOS.
First install the macOS command line tools:
.. prompt:: bash $
xcode-select --install
Install the Homebrew_ package manager for macOS.
Install the LLVM OpenMP library:
.. prompt:: bash $
brew install libomp
Set the following environment variables:
.. prompt:: bash $
export CC=/usr/bin/clang
export CXX=/usr/bin/clang++
export CPPFLAGS="$CPPFLAGS -Xpreprocessor -fopenmp"
export CFLAGS="$CFLAGS -I/usr/local/opt/libomp/include"
export CXXFLAGS="$CXXFLAGS -I/usr/local/opt/libomp/include"
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
Finally, build scikit-learn in verbose mode (to check for the presence of the
``-fopenmp`` flag in the compiler commands):
.. prompt:: bash $
make clean
pip install --editable . \
--verbose --no-build-isolation \
--config-settings editable-verbose=true
.. _compiler_linux:
Linux compilers from the system
Installing scikit-learn from source without using conda requires you to have
installed the scikit-learn Python development headers and a working C/C++
compiler with OpenMP support (typically the GCC toolchain).
Install build dependencies for Debian-based operating systems, e.g.
.. prompt:: bash $
sudo apt-get install build-essential python3-dev python3-pip
then proceed as usual:
.. prompt:: bash $
pip3 install cython
pip3 install --editable . \
--verbose --no-build-isolation \
--config-settings editable-verbose=true
Cython and the pre-compiled wheels for the runtime dependencies (numpy, scipy
and joblib) should automatically be installed in
``$HOME/.local/lib/pythonX.Y/site-packages``. Alternatively you can run the
above commands from a virtualenv_ or a `conda environment`_ to get full
isolation from the Python packages installed via the system packager. When
using an isolated environment, ``pip3`` should be replaced by ``pip`` in the
above commands.
When precompiled wheels of the runtime dependencies are not available for your
architecture (e.g. ARM), you can install the system versions:
.. prompt:: bash $
sudo apt-get install cython3 python3-numpy python3-scipy
On Red Hat and clones (e.g. CentOS), install the dependencies using:
.. prompt:: bash $
sudo yum -y install gcc gcc-c++ python3-devel numpy scipy
Linux compilers from conda-forge
Alternatively, install a recent version of the GNU C Compiler toolchain (GCC)
in the user folder using conda:
.. prompt:: bash $
conda create -n sklearn-dev -c conda-forge python numpy scipy cython \
joblib threadpoolctl pytest compilers meson-python ninja
It is not always necessary but it is safer to open a new prompt before
activating the newly created conda environment.
.. prompt:: bash $
conda activate sklearn-dev
pip install --editable . \
--verbose --no-build-isolation \
--config-settings editable-verbose=true
.. _compiler_freebsd:
The clang compiler included in FreeBSD 12.0 and 11.2 base systems does not
include OpenMP support. You need to install the `openmp` library from packages
(or ports):
.. prompt:: bash $
sudo pkg install openmp
This will install header files in ``/usr/local/include`` and libs in
``/usr/local/lib``. Since these directories are not searched by default, you
can set the environment variables to these locations:
.. prompt:: bash $
export CFLAGS="$CFLAGS -I/usr/local/include"
export CXXFLAGS="$CXXFLAGS -I/usr/local/include"
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/lib -L/usr/local/lib -lomp"
Finally, build the package using the standard command:
.. prompt:: bash $
pip install --editable . \
--verbose --no-build-isolation \
--config-settings editable-verbose=true
For the upcoming FreeBSD 12.1 and 11.3 versions, OpenMP will be included in
the base system and these steps will not be necessary.
.. _OpenMP:
.. _Cython:
.. _meson-python:
.. _Ninja:
.. _NumPy:
.. _SciPy:
.. _Homebrew:
.. _virtualenv:
.. _conda environment:
.. _Miniforge3:
The following command will build scikit-learn using your default C/C++ compiler.
.. prompt:: bash $
pip install --editable . \
--verbose --no-build-isolation \
--config-settings editable-verbose=true
If you want to build scikit-learn with another compiler handled by ``setuptools``,
use the following command:
.. prompt:: bash $
python build_ext --compiler=<compiler> -i build_clib --compiler=<compiler>
To see the list of available compilers run:
.. prompt:: bash $
python build_ext --help-compiler
If your compiler is not listed here, you can specify it through some environment
variables (does not work on windows). This `section
of the setuptools documentation explains in details which environment variables
are used by ``setuptools``, and at which stage of the compilation, to set the
compiler and linker options.
When setting these environment variables, it is advised to first check their
``sysconfig`` counterparts variables and adapt them to your compiler. For instance::
import sysconfig
In addition, since Scikit-learn uses OpenMP, you need to include the appropriate OpenMP
flag of your compiler into the ``CFLAGS`` and ``CPPFLAGS`` environment variables.
Normal file
.. _bug_triaging:
Bug triaging and issue curation
The `issue tracker <>`_
is important to the communication in the project: it helps
developers identify major projects to work on, as well as to discuss
priorities. For this reason, it is important to curate it, adding labels
to issues and closing issues that are not necessary.
Working on issues to improve them
Improving issues increases their chances of being successfully resolved.
Guidelines on submitting good issues can be found :ref:`here
A third party can give useful feedback or even add
comments on the issue.
The following actions are typically useful:
- documenting issues that are missing elements to reproduce the problem
such as code samples
- suggesting better use of code formatting
- suggesting to reformulate the title and description to make them more
explicit about the problem to be solved
- linking to related issues or discussions while briefly describing how
they are related, for instance "See also #xyz for a similar attempt
at this" or "See also #xyz where the same thing happened in
SomeEstimator" provides context and helps the discussion.
.. topic:: Fruitful discussions
Online discussions may be harder than it seems at first glance, in
particular given that a person new to open-source may have a very
different understanding of the process than a seasoned maintainer.
Overall, it is useful to stay positive and assume good will. `The
following article
explores how to lead online discussions in the context of open source.
Working on PRs to help review
Reviewing code is also encouraged. Contributors and users are welcome to
participate to the review process following our :ref:`review guidelines
Triaging operations for members of the core and contributor experience teams
In addition to the above, members of the core team and the contributor experience team
can do the following important tasks:
- Update :ref:`labels for issues and PRs <issue_tracker_tags>`: see the list of
the `available github labels
- :ref:`Determine if a PR must be relabeled as stalled <stalled_pull_request>`
or needs help (this is typically very important in the context
of sprints, where the risk is to create many unfinished PRs)
- If a stalled PR is taken over by a newer PR, then label the stalled PR as
"Superseded", leave a comment on the stalled PR linking to the new PR, and
likely close the stalled PR.
- Triage issues:
- **close usage questions** and politely point the reporter to use
Stack Overflow instead.
- **close duplicate issues**, after checking that they are
indeed duplicate. Ideally, the original submitter moves the
discussion to the older, duplicate issue
- **close issues that cannot be replicated**, after leaving time (at
least a week) to add extra information
:ref:`Saved replies <saved_replies>` are useful to gain time and yet be
welcoming and polite when triaging.
See the github description for `roles in the organization
.. topic:: Closing issues: a tough call
When uncertain on whether an issue should be closed or not, it is
best to strive for consensus with the original poster, and possibly
to seek relevant expertise. However, when the issue is a usage
question, or when it has been considered as unclear for many years it
should be closed.
A typical workflow for triaging issues
The following workflow [1]_ is a good way to approach issue triaging:
#. Thank the reporter for opening an issue
The issue tracker is many people's first interaction with the
scikit-learn project itself, beyond just using the library. As such,
we want it to be a welcoming, pleasant experience.
#. Is this a usage question? If so close it with a polite message
(:ref:`here is an example <saved_replies>`).
#. Is the necessary information provided?
If crucial information (like the version of scikit-learn used), is
missing feel free to ask for that and label the issue with "Needs
#. Is this a duplicate issue?
We have many open issues. If a new issue seems to be a duplicate,
point to the original issue. If it is a clear duplicate, or consensus
is that it is redundant, close it. Make sure to still thank the
reporter, and encourage them to chime in on the original issue, and
perhaps try to fix it.
If the new issue provides relevant information, such as a better or
slightly different example, add it to the original issue as a comment
or an edit to the original post.
#. Make sure that the title accurately reflects the issue. If you have the
necessary permissions edit it yourself if it's not clear.
#. Is the issue minimal and reproducible?
For bug reports, we ask that the reporter provide a minimal
reproducible example. See `this useful post
by Matthew Rocklin for a good explanation. If the example is not
reproducible, or if it's clearly not minimal, feel free to ask the reporter
if they can provide and example or simplify the provided one.
Do acknowledge that writing minimal reproducible examples is hard work.
If the reporter is struggling, you can try to write one yourself.
If a reproducible example is provided, but you see a simplification,
add your simpler reproducible example.
#. Add the relevant labels, such as "Documentation" when the issue is
about documentation, "Bug" if it is clearly a bug, "Enhancement" if it
is an enhancement request, ...
If the issue is clearly defined and the fix seems relatively
straightforward, label the issue as “Good first issue”.
An additional useful step can be to tag the corresponding module e.g.
`sklearn.linear_models` when relevant.
#. Remove the "Needs Triage" label from the issue if the label exists.
.. [1] Adapted from the pandas project `maintainers guide
Normal file
.. _cython:
Cython Best Practices, Conventions and Knowledge
This documents tips to develop Cython code in scikit-learn.
Tips for developing with Cython in scikit-learn
Tips to ease development
* Time spent reading `Cython's documentation <>`_ is not time lost.
* If you intend to use OpenMP: On MacOS, system's distribution of ``clang`` does not implement OpenMP.
You can install the ``compilers`` package available on ``conda-forge`` which comes with an implementation of OpenMP.
* Activating `checks <>`_ might help. E.g. for activating boundscheck use:
.. code-block:: bash
* `Start from scratch in a notebook <>`_ to understand how to use Cython and to get feedback on your work quickly.
If you plan to use OpenMP for your implementations in your Jupyter Notebook, do add extra compiler and linkers arguments in the Cython magic.
.. code-block:: python
# For GCC and for clang
%%cython --compile-args=-fopenmp --link-args=-fopenmp
# For Microsoft's compilers
%%cython --compile-args=/openmp --link-args=/openmp
* To debug C code (e.g. a segfault), do use ``gdb`` with:
.. code-block:: bash
gdb --ex r --args python ./
* To have access to some value in place to debug in ``cdef (nogil)`` context, use:
.. code-block:: cython
with gil:
* Note that Cython cannot parse f-strings with ``{var=}`` expressions, e.g.
.. code-block:: bash
* scikit-learn codebase has a lot of non-unified (fused) types (re)definitions.
There currently is `ongoing work to simplify and unify that across the codebase
For now, make sure you understand which concrete types are used ultimately.
* You might find this alias to compile individual Cython extension handy:
.. code-block::
# You might want to add this alias to your shell script config.
alias cythonX="cython -X language_level=3 -X boundscheck=False -X wraparound=False -X initializedcheck=False -X nonecheck=False -X cdivision=True"
# This generates `source.c` as if you had recompiled scikit-learn entirely.
cythonX --annotate source.pyx
* Using the ``--annotate`` option with this flag allows generating a HTML report of code annotation.
This report indicates interactions with the CPython interpreter on a line-by-line basis.
Interactions with the CPython interpreter must be avoided as much as possible in
the computationally intensive sections of the algorithms.
For more information, please refer to `this section of Cython's tutorial <>`_
.. code-block::
# This generates a HTML report (`source.html`) for `source.c`.
cythonX --annotate source.pyx
Tips for performance
* Understand the GIL in context for CPython (which problems it solves, what are its limitations)
and get a good understanding of when Cython will be mapped to C code free of interactions with
CPython, when it will not, and when it cannot (e.g. presence of interactions with Python
objects, which include functions). In this regard, `PEP073 <>`_
provides a good overview and context and pathways for removal.
* Make sure you have deactivated `checks <>`_.
* Always prefer memoryviews instead over ``cnp.ndarray`` when possible: memoryviews are lightweight.
* Avoid memoryview slicing: memoryview slicing might be costly or misleading in some cases and
we better not use it, even if handling fewer dimensions in some context would be preferable.
* Decorate final classes or methods with ``@final`` (this allows removing virtual tables when needed)
* Inline methods and function when it makes sense
* Make sure your Cython compilation units `use NumPy recent C API <>`_.
* In doubt, read the generated C or C++ code if you can: "The fewer C instructions and indirections
for a line of Cython code, the better" is a good rule of thumb.
* ``nogil`` declarations are just hints: when declaring the ``cdef`` functions
as nogil, it means that they can be called without holding the GIL, but it does not release
the GIL when entering them. You have to do that yourself either by passing ``nogil=True`` to
``cython.parallel.prange`` explicitly, or by using an explicit context manager:
.. code-block:: cython
cdef inline void my_func(self) nogil:
# Some logic interacting with CPython, e.g. allocating arrays via NumPy.
with nogil:
# The code here is run as is it were written in C.
return 0
This item is based on `this comment from Stéfan's Benhel <>`_
* Direct calls to BLAS routines are possible via interfaces defined in ``sklearn.utils._cython_blas``.
Using OpenMP
Since scikit-learn can be built without OpenMP, it's necessary to protect each
direct call to OpenMP.
The `_openmp_helpers` module, available in
`sklearn/utils/_openmp_helpers.pyx <>`_
provides protected versions of the OpenMP routines. To use OpenMP routines, they
must be ``cimported`` from this module and not from the OpenMP library directly:
.. code-block:: cython
from sklearn.utils._openmp_helpers cimport omp_get_max_threads
max_threads = omp_get_max_threads()
The parallel loop, `prange`, is already protected by cython and can be used directly
from `cython.parallel`.
Cython code requires to use explicit types. This is one of the reasons you get a
performance boost. In order to avoid code duplication, we have a central place
for the most used types in
`sklearn/utils/_typedefs.pyd <>`_.
Ideally you start by having a look there and `cimport` types you need, for example
.. code-block:: cython
from sklear.utils._typedefs cimport float32, float64
Normal file
.. _develop:
Developing scikit-learn estimators
Whether you are proposing an estimator for inclusion in scikit-learn,
developing a separate package compatible with scikit-learn, or
implementing custom components for your own projects, this chapter
details how to develop objects that safely interact with scikit-learn
Pipelines and model selection tools.
.. currentmodule:: sklearn
.. _api_overview:
APIs of scikit-learn objects
To have a uniform API, we try to have a common basic API for all the
objects. In addition, to avoid the proliferation of framework code, we
try to adopt simple conventions and limit to a minimum the number of
methods an object must implement.
Elements of the scikit-learn API are described more definitively in the
Different objects
The main objects in scikit-learn are (one class can implement
multiple interfaces):
The base object, implements a ``fit`` method to learn from data, either::
estimator =, targets)
estimator =
For supervised learning, or some unsupervised problems, implements::
prediction = predictor.predict(data)
Classification algorithms usually also offer a way to quantify certainty
of a prediction, either using ``decision_function`` or ``predict_proba``::
probability = predictor.predict_proba(data)
For modifying the data in a supervised or unsupervised way (e.g. by adding, changing,
or removing columns, but not by adding or removing rows). Implements::
new_data = transformer.transform(data)
When fitting and transforming can be performed much more efficiently
together than separately, implements::
new_data = transformer.fit_transform(data)
A model that can give a `goodness of fit <>`_
measure or a likelihood of unseen data, implements (higher is better)::
score = model.score(data)
The API has one predominant object: the estimator. An estimator is an
object that fits a model based on some training data and is capable of
inferring some properties on new data. It can be, for instance, a
classifier or a regressor. All estimators implement the fit method::
||||, y)
All built-in estimators also have a ``set_params`` method, which sets
data-independent parameters (overriding previous parameter values passed
to ``__init__``).
All estimators in the main scikit-learn codebase should inherit from
This concerns the creation of an object. The object's ``__init__`` method
might accept constants as arguments that determine the estimator's behavior
(like the C constant in SVMs). It should not, however, take the actual training
data as an argument, as this is left to the ``fit()`` method::
clf2 = SVC(C=2.3)
clf3 = SVC([[1, 2], [2, 3]], [-1, 1]) # WRONG!
The arguments accepted by ``__init__`` should all be keyword arguments
with a default value. In other words, a user should be able to instantiate
an estimator without passing any arguments to it. The arguments should all
correspond to hyperparameters describing the model or the optimisation
problem the estimator tries to solve. These initial arguments (or parameters)
are always remembered by the estimator.
Also note that they should not be documented under the "Attributes" section,
but rather under the "Parameters" section for that estimator.
In addition, **every keyword argument accepted by** ``__init__`` **should
correspond to an attribute on the instance**. Scikit-learn relies on this to
find the relevant attributes to set on an estimator when doing model selection.
To summarize, an ``__init__`` should look like::
def __init__(self, param1=1, param2=2):
self.param1 = param1
self.param2 = param2
There should be no logic, not even input validation,
and the parameters should not be changed.
The corresponding logic should be put where the parameters are used,
typically in ``fit``.
The following is wrong::
def __init__(self, param1=1, param2=2, param3=3):
# WRONG: parameters should not be modified
if param1 > 1:
param2 += 1
self.param1 = param1
# WRONG: the object's attributes should have exactly the name of
# the argument in the constructor
self.param3 = param2
The reason for postponing the validation is that the same validation
would have to be performed in ``set_params``,
which is used in algorithms like ``GridSearchCV``.
The next thing you will probably want to do is to estimate some
parameters in the model. This is implemented in the ``fit()`` method.
The ``fit()`` method takes the training data as arguments, which can be one
array in the case of unsupervised learning, or two arrays in the case
of supervised learning.
Note that the model is fitted using ``X`` and ``y``, but the object holds no
reference to ``X`` and ``y``. There are, however, some exceptions to this, as in
the case of precomputed kernels where this data must be stored for use by
the predict method.
============= ======================================================
============= ======================================================
X array-like of shape (n_samples, n_features)
y array-like of shape (n_samples,)
kwargs optional data-dependent parameters
============= ======================================================
``X.shape[0]`` should be the same as ``y.shape[0]``. If this requisite
is not met, an exception of type ``ValueError`` should be raised.
``y`` might be ignored in the case of unsupervised learning. However, to
make it possible to use the estimator as part of a pipeline that can
mix both supervised and unsupervised transformers, even unsupervised
estimators need to accept a ``y=None`` keyword argument in
the second position that is just ignored by the estimator.
For the same reason, ``fit_predict``, ``fit_transform``, ``score``
and ``partial_fit`` methods need to accept a ``y`` argument in
the second place if they are implemented.
The method should return the object (``self``). This pattern is useful
to be able to implement quick one liners in an IPython session such as::
y_predicted = SVC(C=100).fit(X_train, y_train).predict(X_test)
Depending on the nature of the algorithm, ``fit`` can sometimes also
accept additional keywords arguments. However, any parameter that can
have a value assigned prior to having access to the data should be an
``__init__`` keyword argument. **fit parameters should be restricted
to directly data dependent variables**. For instance a Gram matrix or
an affinity matrix which are precomputed from the data matrix ``X`` are
data dependent. A tolerance stopping criterion ``tol`` is not directly
data dependent (although the optimal value according to some scoring
function probably is).
When ``fit`` is called, any previous call to ``fit`` should be ignored. In
general, calling ```` and then ```` should
be the same as only calling ````. However, this may not be
true in practice when ``fit`` depends on some random process, see
:term:`random_state`. Another exception to this rule is when the
hyper-parameter ``warm_start`` is set to ``True`` for estimators that
support it. ``warm_start=True`` means that the previous state of the
trainable parameters of the estimator are reused instead of using the
default initialization strategy.
Estimated Attributes
Attributes that have been estimated from the data must always have a name
ending with trailing underscore, for example the coefficients of
some regression estimator would be stored in a ``coef_`` attribute after
``fit`` has been called.
The estimated attributes are expected to be overridden when you call ``fit``
a second time.
Optional Arguments
In iterative algorithms, the number of iterations should be specified by
an integer called ``n_iter``.
Universal attributes
Estimators that expect tabular input should set a `n_features_in_`
attribute at `fit` time to indicate the number of features that the estimator
expects for subsequent calls to `predict` or `transform`.
for details.
.. _rolling_your_own_estimator:
Rolling your own estimator
If you want to implement a new estimator that is scikit-learn-compatible,
whether it is just for you or for contributing it to scikit-learn, there are
several internals of scikit-learn that you should be aware of in addition to
the scikit-learn API outlined above. You can check whether your estimator
adheres to the scikit-learn interface and standards by running
:func:`~sklearn.utils.estimator_checks.check_estimator` on an instance. The
:func:`~sklearn.utils.estimator_checks.parametrize_with_checks` pytest
decorator can also be used (see its docstring for details and possible
interactions with `pytest`)::
>>> from sklearn.utils.estimator_checks import check_estimator
>>> from sklearn.svm import LinearSVC
>>> check_estimator(LinearSVC()) # passes
The main motivation to make a class compatible to the scikit-learn estimator
interface might be that you want to use it together with model evaluation and
selection tools such as :class:`model_selection.GridSearchCV` and
Before detailing the required interface below, we describe two ways to achieve
the correct interface more easily.
.. topic:: Project template:
We provide a `project template <>`_
which helps in the creation of Python packages containing scikit-learn compatible estimators.
It provides:
* an initial git repository with Python package directory structure
* a template of a scikit-learn estimator
* an initial test suite including use of ``check_estimator``
* directory structures and scripts to compile documentation and example
* scripts to manage continuous integration (testing on Linux and Windows)
* instructions from getting started to publishing on `PyPi <>`_
.. topic:: ``BaseEstimator`` and mixins:
We tend to use "duck typing", so building an estimator which follows
the API suffices for compatibility, without needing to inherit from or
even import any scikit-learn classes.
However, if a dependency on scikit-learn is acceptable in your code,
you can prevent a lot of boilerplate code
by deriving a class from ``BaseEstimator``
and optionally the mixin classes in ``sklearn.base``.
For example, below is a custom classifier, with more examples included
in the scikit-learn-contrib
`project template <>`__.
It is particularly important to notice that mixins should be "on the left" while
the ``BaseEstimator`` should be "on the right" in the inheritance list for proper
>>> import numpy as np
>>> from sklearn.base import BaseEstimator, ClassifierMixin
>>> from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
>>> from sklearn.utils.multiclass import unique_labels
>>> from sklearn.metrics import euclidean_distances
>>> class TemplateClassifier(ClassifierMixin, BaseEstimator):
... def __init__(self, demo_param='demo'):
... self.demo_param = demo_param
... def fit(self, X, y):
... # Check that X and y have correct shape
... X, y = check_X_y(X, y)
... # Store the classes seen during fit
... self.classes_ = unique_labels(y)
... self.X_ = X
... self.y_ = y
... # Return the classifier
... return self
... def predict(self, X):
... # Check if fit has been called
... check_is_fitted(self)
... # Input validation
... X = check_array(X)
... closest = np.argmin(euclidean_distances(X, self.X_), axis=1)
... return self.y_[closest]
get_params and set_params
All scikit-learn estimators have ``get_params`` and ``set_params`` functions.
The ``get_params`` function takes no arguments and returns a dict of the
``__init__`` parameters of the estimator, together with their values.
It must take one keyword argument, ``deep``, which receives a boolean value
that determines whether the method should return the parameters of
sub-estimators (for most estimators, this can be ignored). The default value
for ``deep`` should be `True`. For instance considering the following
>>> from sklearn.base import BaseEstimator
>>> from sklearn.linear_model import LogisticRegression
>>> class MyEstimator(BaseEstimator):
... def __init__(self, subestimator=None, my_extra_param="random"):
... self.subestimator = subestimator
... self.my_extra_param = my_extra_param
The parameter `deep` will control whether or not the parameters of the
`subestimator` should be reported. Thus when `deep=True`, the output will be::
>>> my_estimator = MyEstimator(subestimator=LogisticRegression())
>>> for param, value in my_estimator.get_params(deep=True).items():
... print(f"{param} -> {value}")
my_extra_param -> random
subestimator__C -> 1.0
subestimator__class_weight -> None
subestimator__dual -> False
subestimator__fit_intercept -> True
subestimator__intercept_scaling -> 1
subestimator__l1_ratio -> None
subestimator__max_iter -> 100
subestimator__multi_class -> deprecated
subestimator__n_jobs -> None
subestimator__penalty -> l2
subestimator__random_state -> None
subestimator__solver -> lbfgs
subestimator__tol -> 0.0001
subestimator__verbose -> 0
subestimator__warm_start -> False
subestimator -> LogisticRegression()
Often, the `subestimator` has a name (as e.g. named steps in a
:class:`~sklearn.pipeline.Pipeline` object), in which case the key should
become `<name>__C`, `<name>__class_weight`, etc.
While when `deep=False`, the output will be::
>>> for param, value in my_estimator.get_params(deep=False).items():
... print(f"{param} -> {value}")
my_extra_param -> random
subestimator -> LogisticRegression()
On the other hand, ``set_params`` takes the parameters of ``__init__``
as keyword arguments, unpacks them into a dict of the form
``'parameter': value`` and sets the parameters of the estimator using this dict.
Return value must be the estimator itself.
While the ``get_params`` mechanism is not essential (see :ref:`cloning` below),
the ``set_params`` function is necessary as it is used to set parameters during
grid searches.
The easiest way to implement these functions, and to get a sensible
``__repr__`` method, is to inherit from ``sklearn.base.BaseEstimator``. If you
do not want to make your code dependent on scikit-learn, the easiest way to
implement the interface is::
def get_params(self, deep=True):
# suppose this estimator has parameters "alpha" and "recursive"
return {"alpha": self.alpha, "recursive": self.recursive}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
Parameters and init
As :class:`model_selection.GridSearchCV` uses ``set_params``
to apply parameter setting to estimators,
it is essential that calling ``set_params`` has the same effect
as setting parameters using the ``__init__`` method.
The easiest and recommended way to accomplish this is to
**not do any parameter validation in** ``__init__``.
All logic behind estimator parameters,
like translating string arguments into functions, should be done in ``fit``.
Also it is expected that parameters with trailing ``_`` are **not to be set
inside the** ``__init__`` **method**. All and only the public attributes set by
fit have a trailing ``_``. As a result the existence of parameters with
trailing ``_`` is used to check if the estimator has been fitted.
.. _cloning:
For use with the :mod:`~sklearn.model_selection` module,
an estimator must support the ``base.clone`` function to replicate an estimator.
This can be done by providing a ``get_params`` method.
If ``get_params`` is present, then ``clone(estimator)`` will be an instance of
``type(estimator)`` on which ``set_params`` has been called with clones of
the result of ``estimator.get_params()``.
Objects that do not provide this method will be deep-copied
(using the Python standard function ``copy.deepcopy``)
if ``safe=False`` is passed to ``clone``.
Estimators can customize the behavior of :func:`base.clone` by defining a
`__sklearn_clone__` method. `__sklearn_clone__` must return an instance of the
estimator. `__sklearn_clone__` is useful when an estimator needs to hold on to
some state when :func:`base.clone` is called on the estimator. For example, a
frozen meta-estimator for transformers can be defined as follows::
class FrozenTransformer(BaseEstimator):
def __init__(self, fitted_transformer):
self.fitted_transformer = fitted_transformer
def __getattr__(self, name):
# `fitted_transformer`'s attributes are now accessible
return getattr(self.fitted_transformer, name)
def __sklearn_clone__(self):
return self
def fit(self, X, y):
# Fitting does not change the state of the estimator
return self
def fit_transform(self, X, y=None):
# fit_transform only transforms the data
return self.fitted_transformer.transform(X, y)
Pipeline compatibility
For an estimator to be usable together with ``pipeline.Pipeline`` in any but the
last step, it needs to provide a ``fit`` or ``fit_transform`` function.
To be able to evaluate the pipeline on any data but the training set,
it also needs to provide a ``transform`` function.
There are no special requirements for the last step in a pipeline, except that
it has a ``fit`` function. All ``fit`` and ``fit_transform`` functions must
take arguments ``X, y``, even if y is not used. Similarly, for ``score`` to be
usable, the last step of the pipeline needs to have a ``score`` function that
accepts an optional ``y``.
Estimator types
Some common functionality depends on the kind of estimator passed.
For example, cross-validation in :class:`model_selection.GridSearchCV` and
:func:`model_selection.cross_val_score` defaults to being stratified when used
on a classifier, but not otherwise. Similarly, scorers for average precision
that take a continuous prediction need to call ``decision_function`` for classifiers,
but ``predict`` for regressors. This distinction between classifiers and regressors
is implemented using the ``_estimator_type`` attribute, which takes a string value.
It should be ``"classifier"`` for classifiers and ``"regressor"`` for
regressors and ``"clusterer"`` for clustering methods, to work as expected.
Inheriting from ``ClassifierMixin``, ``RegressorMixin`` or ``ClusterMixin``
will set the attribute automatically. When a meta-estimator needs to distinguish
among estimator types, instead of checking ``_estimator_type`` directly, helpers
like :func:`base.is_classifier` should be used.
Specific models
Classifiers should accept ``y`` (target) arguments to ``fit`` that are
sequences (lists, arrays) of either strings or integers. They should not
assume that the class labels are a contiguous range of integers; instead, they
should store a list of classes in a ``classes_`` attribute or property. The
order of class labels in this attribute should match the order in which
``predict_proba``, ``predict_log_proba`` and ``decision_function`` return their
values. The easiest way to achieve this is to put::
self.classes_, y = np.unique(y, return_inverse=True)
in ``fit``. This returns a new ``y`` that contains class indexes, rather than
labels, in the range [0, ``n_classes``).
A classifier's ``predict`` method should return
arrays containing class labels from ``classes_``.
In a classifier that implements ``decision_function``,
this can be achieved with::
def predict(self, X):
D = self.decision_function(X)
return self.classes_[np.argmax(D, axis=1)]
In linear models, coefficients are stored in an array called ``coef_``, and the
independent term is stored in ``intercept_``. ``sklearn.linear_model._base``
contains a few base classes and mixins that implement common linear model
The :mod:`~sklearn.utils.multiclass` module contains useful functions
for working with multiclass and multilabel problems.
.. _estimator_tags:
Estimator Tags
.. warning::
The estimator tags are experimental and the API is subject to change.
Scikit-learn introduced estimator tags in version 0.21. These are annotations
of estimators that allow programmatic inspection of their capabilities, such as
sparse matrix support, supported output types and supported methods. The
estimator tags are a dictionary returned by the method ``_get_tags()``. These
tags are used in the common checks run by the
:func:`~sklearn.utils.estimator_checks.check_estimator` function and the
:func:`~sklearn.utils.estimator_checks.parametrize_with_checks` decorator.
Tags determine which checks to run and what input data is appropriate. Tags
can depend on estimator parameters or even system architecture and can in
general only be determined at runtime.
The current set of estimator tags are:
allow_nan (default=False)
whether the estimator supports data with missing values encoded as np.nan
array_api_support (default=False)
whether the estimator supports Array API compatible inputs.
binary_only (default=False)
whether estimator supports binary classification but lacks multi-class
classification support.
multilabel (default=False)
whether the estimator supports multilabel output
multioutput (default=False)
whether a regressor supports multi-target outputs or a classifier supports
multi-class multi-output.
multioutput_only (default=False)
whether estimator supports only multi-output classification or regression.
no_validation (default=False)
whether the estimator skips input-validation. This is only meant for
stateless and dummy transformers!
non_deterministic (default=False)
whether the estimator is not deterministic given a fixed ``random_state``
pairwise (default=False)
This boolean attribute indicates whether the data (`X`) :term:`fit` and
similar methods consists of pairwise measures over samples rather than a
feature representation for each sample. It is usually `True` where an
estimator has a `metric` or `affinity` or `kernel` parameter with value
'precomputed'. Its primary purpose is to support a :term:`meta-estimator`
or a cross validation procedure that extracts a sub-sample of data intended
for a pairwise estimator, where the data needs to be indexed on both axes.
Specifically, this tag is used by
`sklearn.utils.metaestimators._safe_split` to slice rows and
preserves_dtype (default=``[np.float64]``)
applies only on transformers. It corresponds to the data types which will
be preserved such that `X_trans.dtype` is the same as `X.dtype` after
calling `transformer.transform(X)`. If this list is empty, then the
transformer is not expected to preserve the data type. The first value in
the list is considered as the default data type, corresponding to the data
type of the output when the input data type is not going to be preserved.
poor_score (default=False)
whether the estimator fails to provide a "reasonable" test-set score, which
currently for regression is an R2 of 0.5 on ``make_regression(n_samples=200,
n_features=10, n_informative=1, bias=5.0, noise=20, random_state=42)``, and
for classification an accuracy of 0.83 on
``make_blobs(n_samples=300, random_state=0)``. These datasets and values
are based on current estimators in sklearn and might be replaced by
something more systematic.
requires_fit (default=True)
whether the estimator requires to be fitted before calling one of
`transform`, `predict`, `predict_proba`, or `decision_function`.
requires_positive_X (default=False)
whether the estimator requires positive X.
requires_y (default=False)
whether the estimator requires y to be passed to `fit`, `fit_predict` or
`fit_transform` methods. The tag is True for estimators inheriting from
`~sklearn.base.RegressorMixin` and `~sklearn.base.ClassifierMixin`.
requires_positive_y (default=False)
whether the estimator requires a positive y (only applicable for regression).
_skip_test (default=False)
whether to skip common tests entirely. Don't use this unless you have a
*very good* reason.
_xfail_checks (default=False)
dictionary ``{check_name: reason}`` of common checks that will be marked
as `XFAIL` for pytest, when using
:func:`~sklearn.utils.estimator_checks.parametrize_with_checks`. These
checks will be simply ignored and not run by
:func:`~sklearn.utils.estimator_checks.check_estimator`, but a
`SkipTestWarning` will be raised.
Don't use this unless there is a *very good* reason for your estimator
not to pass the check.
Also note that the usage of this tag is highly subject to change because
we are trying to make it more flexible: be prepared for breaking changes
in the future.
stateless (default=False)
whether the estimator needs access to data for fitting. Even though an
estimator is stateless, it might still need a call to ``fit`` for
X_types (default=['2darray'])
Supported input types for X as list of strings. Tests are currently only
run if '2darray' is contained in the list, signifying that the estimator
takes continuous 2d numpy arrays as input. The default value is
['2darray']. Other possible types are ``'string'``, ``'sparse'``,
``'categorical'``, ``dict``, ``'1dlabels'`` and ``'2dlabels'``. The goal is
that in the future the supported input type will determine the data used
during testing, in particular for ``'string'``, ``'sparse'`` and
``'categorical'`` data. For now, the test for sparse data do not make use
of the ``'sparse'`` tag.
It is unlikely that the default values for each tag will suit the needs of your
specific estimator. Additional tags can be created or default tags can be
overridden by defining a `_more_tags()` method which returns a dict with the
desired overridden tags or new tags. For example::
class MyMultiOutputEstimator(BaseEstimator):
def _more_tags(self):
return {'multioutput_only': True,
'non_deterministic': True}
Any tag that is not in `_more_tags()` will just fall-back to the default values
documented above.
Even if it is not recommended, it is possible to override the method
`_get_tags()`. Note however that **all tags must be present in the dict**. If
any of the keys documented above is not present in the output of `_get_tags()`,
an error will occur.
In addition to the tags, estimators also need to declare any non-optional
parameters to ``__init__`` in the ``_required_parameters`` class attribute,
which is a list or tuple. If ``_required_parameters`` is only
``["estimator"]`` or ``["base_estimator"]``, then the estimator will be
instantiated with an instance of ``LogisticRegression`` (or
``RidgeRegression`` if the estimator is a regressor) in the tests. The choice
of these two models is somewhat idiosyncratic but both should provide robust
closed-form solutions.
.. _developer_api_set_output:
Developer API for `set_output`
`SLEP018 <>`__,
scikit-learn introduces the `set_output` API for configuring transformers to
output pandas DataFrames. The `set_output` API is automatically defined if the
transformer defines :term:`get_feature_names_out` and subclasses
:class:`base.TransformerMixin`. :term:`get_feature_names_out` is used to get the
column names of pandas output.
:class:`base.OneToOneFeatureMixin` and
:class:`base.ClassNamePrefixFeaturesOutMixin` are helpful mixins for defining
:term:`get_feature_names_out`. :class:`base.OneToOneFeatureMixin` is useful when
the transformer has a one-to-one correspondence between input features and output
features, such as :class:`~preprocessing.StandardScaler`.
:class:`base.ClassNamePrefixFeaturesOutMixin` is useful when the transformer
needs to generate its own feature names out, such as :class:`~decomposition.PCA`.
You can opt-out of the `set_output` API by setting `auto_wrap_output_keys=None`
when defining a custom subclass::
class MyTransformer(TransformerMixin, BaseEstimator, auto_wrap_output_keys=None):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X
def get_feature_names_out(self, input_features=None):
The default value for `auto_wrap_output_keys` is `("transform",)`, which automatically
wraps `fit_transform` and `transform`. The `TransformerMixin` uses the
`__init_subclass__` mechanism to consume `auto_wrap_output_keys` and pass all other
keyword arguments to it's super class. Super classes' `__init_subclass__` should
**not** depend on `auto_wrap_output_keys`.
For transformers that return multiple arrays in `transform`, auto wrapping will
only wrap the first array and not alter the other arrays.
See :ref:``
for an example on how to use the API.
.. _developer_api_check_is_fitted:
Developer API for `check_is_fitted`
By default :func:`~sklearn.utils.validation.check_is_fitted` checks if there
are any attributes in the instance with a trailing underscore, e.g. `coef_`.
An estimator can change the behavior by implementing a `__sklearn_is_fitted__`
method taking no input and returning a boolean. If this method exists,
:func:`~sklearn.utils.validation.check_is_fitted` simply returns its output.
See :ref:``
for an example on how to use the API.
Developer API for HTML representation
.. warning::
The HTML representation API is experimental and the API is subject to change.
Estimators inheriting from :class:`~sklearn.base.BaseEstimator` display
a HTML representation of themselves in interactive programming
environments such as Jupyter notebooks. For instance, we can display this HTML
from sklearn.base import BaseEstimator
The raw HTML representation is obtained by invoking the function
:func:`~sklearn.utils.estimator_html_repr` on an estimator instance.
To customize the URL linking to an estimator's documentation (i.e. when clicking on the
"?" icon), override the `_doc_link_module` and `_doc_link_template` attributes. In
addition, you can provide a `_doc_link_url_param_generator` method. Set
`_doc_link_module` to the name of the (top level) module that contains your estimator.
If the value does not match the top level module name, the HTML representation will not
contain a link to the documentation. For scikit-learn estimators this is set to
The `_doc_link_template` is used to construct the final URL. By default, it can contain
two variables: `estimator_module` (the full name of the module containing the estimator)
and `estimator_name` (the class name of the estimator). If you need more variables you
should implement the `_doc_link_url_param_generator` method which should return a
dictionary of the variables and their values. This dictionary will be used to render the
.. _coding-guidelines:
Coding guidelines
The following are some guidelines on how new code should be written for
inclusion in scikit-learn, and which may be appropriate to adopt in external
projects. Of course, there are special cases and there will be exceptions to
these rules. However, following these rules when submitting new code makes
the review easier so new code can be integrated in less time.
Uniformly formatted code makes it easier to share code ownership. The
scikit-learn project tries to closely follow the official Python guidelines
detailed in `PEP8 <>`_ that
detail how code should be formatted and indented. Please read it and
follow it.
In addition, we add the following guidelines:
* Use underscores to separate words in non class names: ``n_samples``
rather than ``nsamples``.
* Avoid multiple statements on one line. Prefer a line return after
a control flow statement (``if``/``for``).
* Use relative imports for references inside scikit-learn.
* Unit tests are an exception to the previous rule;
they should use absolute imports, exactly as client code would.
A corollary is that, if ```` exports a class or function
that is implemented in ````,
the test should import it from ````.
* **Please don't use** ``import *`` **in any case**. It is considered harmful
by the `official Python recommendations
It makes the code harder to read as the origin of symbols is no
longer explicitly referenced, but most important, it prevents
using a static analysis tool like `pyflakes
<>`_ to automatically
find bugs in scikit-learn.
* Use the `numpy docstring standard
in all your docstrings.
A good example of code that we like can be found `here
Input validation
.. currentmodule:: sklearn.utils
The module :mod:`sklearn.utils` contains various functions for doing input
validation and conversion. Sometimes, ``np.asarray`` suffices for validation;
do *not* use ``np.asanyarray`` or ``np.atleast_2d``, since those let NumPy's
``np.matrix`` through, which has a different API
(e.g., ``*`` means dot product on ``np.matrix``,
but Hadamard product on ``np.ndarray``).
In other cases, be sure to call :func:`check_array` on any array-like argument
passed to a scikit-learn API function. The exact parameters to use depends
mainly on whether and which ``scipy.sparse`` matrices must be accepted.
For more information, refer to the :ref:`developers-utils` page.
Random Numbers
If your code depends on a random number generator, do not use
``numpy.random.random()`` or similar routines. To ensure
repeatability in error checking, the routine should accept a keyword
``random_state`` and use this to construct a
``numpy.random.RandomState`` object.
See :func:`sklearn.utils.check_random_state` in :ref:`developers-utils`.
Here's a simple example of code using some of the above guidelines::
from sklearn.utils import check_array, check_random_state
def choose_random_sample(X, random_state=0):
"""Choose a random point from X.
X : array-like of shape (n_samples, n_features)
An array representing the data.
random_state : int or RandomState instance, default=0
The seed of the pseudo random number generator that selects a
random sample. Pass an int for reproducible output across multiple
function calls.
See :term:`Glossary <random_state>`.
x : ndarray of shape (n_features,)
A random point selected from X.
X = check_array(X)
random_state = check_random_state(random_state)
i = random_state.randint(X.shape[0])
return X[i]
If you use randomness in an estimator instead of a freestanding function,
some additional guidelines apply.
First off, the estimator should take a ``random_state`` argument to its
``__init__`` with a default value of ``None``.
It should store that argument's value, **unmodified**,
in an attribute ``random_state``.
``fit`` can call ``check_random_state`` on that attribute
to get an actual random number generator.
If, for some reason, randomness is needed after ``fit``,
the RNG should be stored in an attribute ``random_state_``.
The following example should make this clear::
class GaussianNoise(BaseEstimator, TransformerMixin):
"""This estimator ignores its input and returns random Gaussian noise.
It also does not adhere to all scikit-learn conventions,
but showcases how to handle randomness.
def __init__(self, n_components=100, random_state=None):
self.random_state = random_state
self.n_components = n_components
# the arguments are ignored anyway, so we make them optional
def fit(self, X=None, y=None):
self.random_state_ = check_random_state(self.random_state)
def transform(self, X):
n_samples = X.shape[0]
return self.random_state_.randn(n_samples, self.n_components)
The reason for this setup is reproducibility:
when an estimator is ``fit`` twice to the same data,
it should produce an identical model both times,
hence the validation in ``fit``, not ``__init__``.
Numerical assertions in tests
When asserting the quasi-equality of arrays of continuous values,
do use `sklearn.utils._testing.assert_allclose`.
The relative tolerance is automatically inferred from the provided arrays
dtypes (for float32 and float64 dtypes in particular) but you can override
via ``rtol``.
When comparing arrays of zero-elements, please do provide a non-zero value for
the absolute tolerance via ``atol``.
For more information, please refer to the docstring of
Normal file
.. _developers_guide:
Developer's Guide
.. toctree::
Normal file
Maintainer/Core-Developer Information
This section is about preparing a major release, incrementing the minor
version, or a bug fix release incrementing the patch version. Our convention is
that we release one or more release candidates (0.RRrcN) before releasing the
final distributions. We follow the `PEP101
<>`_ to indicate release candidates,
post, and minor releases.
Before a release
1. Update authors table:
Create a `classic token on GitHub <>`_
with the ``read:org`` following permission.
Run the following script, entering the token in:
.. prompt:: bash $
cd build_tools; make authors; cd ..
and commit. This is only needed if the authors have changed since the last
release. This step is sometimes done independent of the release. This
updates the maintainer list and is not the contributor list for the release.
2. Confirm any blockers tagged for the milestone are resolved, and that other
issues tagged for the milestone can be postponed.
3. Ensure the change log and commits correspond (within reason!), and that the
change log is reasonably well curated. Some tools for these tasks include:
- ``maint_tools/`` can put what's new entries into
sections. It's not perfect, and requires manual checking of the changes.
If the what's new list is well curated, it may not be necessary.
- The ``maint_tools/`` script may be used to identify pull
requests that were merged but likely missing from What's New.
4. Make sure the deprecations, FIXME and TODOs tagged for the release have
been taken care of.
The release manager must be a *maintainer* of the ``scikit-learn/scikit-learn``
repository to be able to publish on ```` and ````
(via a manual trigger of a dedicated Github Actions workflow).
The release manager does not need extra permissions on ```` to publish a
release in particular.
The release manager must be a *maintainer* of the ``conda-forge/scikit-learn-feedstock``
repository. This can be changed by editing the ``recipe/meta.yaml`` file in the
first release pull-request.
.. _preparing_a_release_pr:
Preparing a release PR
Major version release
Prior to branching please do not forget to prepare a Release Highlights page as
a runnable example and check that its HTML rendering looks correct. These
release highlights should be linked from the ``doc/whats_new/v0.99.rst`` file
for the new version of scikit-learn.
Releasing the first RC of e.g. version `0.99.0` involves creating the release
branch `0.99.X` directly on the main repo, where `X` really is the letter X,
**not a placeholder**. The development for the major and minor releases of `0.99`
should **also** happen under `0.99.X`. Each release (rc, major, or minor) is a
tag under that branch.
This is done only once, as the major and minor releases happen on the same
.. prompt:: bash $
# Assuming upstream is an alias for the main scikit-learn repo:
git fetch upstream main
git checkout upstream/main
git checkout -b 0.99.X
git push --set-upstream upstream 0.99.X
Again, `X` is literal here, and `99` is replaced by the release number.
The branches are called ``0.19.X``, ``0.20.X``, etc.
In terms of including changes, the first RC ideally counts as a *feature
freeze*. Each coming release candidate and the final release afterwards will
include only minor documentation changes and bug fixes. Any major enhancement
or feature should be excluded.
Then you can prepare a local branch for the release itself, for instance:
``release-0.99.0rc1``, push it to your github fork and open a PR **to the**
`scikit-learn/0.99.X` **branch**. Copy the :ref:`release_checklist` templates
in the description of the Pull Request to track progress.
This PR will be used to push commits related to the release as explained in
You can also create a second PR from main and targeting main to increment the
``__version__`` variable in `sklearn/` and in `pyproject.toml` to increment
the dev version. This means while we're in the release candidate period, the latest
stable is two versions behind the main branch, instead of one. In this PR targeting
main you should also include a new file for the matching version under the
``doc/whats_new/`` folder so PRs that target the next version can contribute their
changelog entries to this file in parallel to the release process.
Minor version release (also known as bug-fix release)
The minor releases should include bug fixes and some relevant documentation
changes only. Any PR resulting in a behavior change which is not a bug fix
should be excluded. As an example, instructions are given for the `1.2.2` release.
- Create a branch, **on your own fork** (here referred to as `fork`) for the release
from `upstream/main`.
.. prompt:: bash $
git fetch upstream/main
git checkout -b release-1.2.2 upstream/main
git push -u fork release-1.2.2:release-1.2.2
- Create a **draft** PR to the `upstream/1.2.X` branch (not to `upstream/main`)
with all the desired changes.
- Do not push anything on that branch yet.
- Locally rebase `release-1.2.2` from the `upstream/1.2.X` branch using:
.. prompt:: bash $
git rebase -i upstream/1.2.X
This will open an interactive rebase with the `git-rebase-todo` containing all
the latest commit on `main`. At this stage, you have to perform
this interactive rebase with at least someone else (being three people rebasing
is better not to forget something and to avoid any doubt).
- **Do not remove lines but drop commit by replace** ``pick`` **with** ``drop``
- Commits to pick for bug-fix release *generally* are prefixed with: `FIX`, `CI`,
`DOC`. They should at least include all the commits of the merged PRs
that were milestoned for this release on GitHub and/or documented as such in
the changelog. It's likely that some bugfixes were documented in the
changelog of the main major release instead of the next bugfix release,
in which case, the matching changelog entries will need to be moved,
first in the `main` branch then backported in the release PR.
- Commits to drop for bug-fix release *generally* are prefixed with: `FEAT`,
`MAINT`, `ENH`, `API`. Reasons for not including them is to prevent change of
behavior (which only must feature in breaking or major releases).
- After having dropped or picked commit, **do no exit** but paste the content
of the `git-rebase-todo` message in the PR.
This file is located at `.git/rebase-merge/git-rebase-todo`.
- Save and exit, starting the interactive rebase.
- Resolve merge conflicts when they happen.
- Force push the result of the rebase and the extra release commits to the release PR:
.. prompt:: bash $
git push -f fork release-1.2.2:release-1.2.2
- Copy the :ref:`release_checklist` template and paste it in the description of the
Pull Request to track progress.
- Review all the commits included in the release to make sure that they do not
introduce any new feature. We should not blindly trust the commit message prefixes.
- Remove the draft status of the release PR and invite other maintainers to review the
list of included commits.
.. _making_a_release:
Making a release
0. Ensure that you have checked out the branch of the release PR as explained
in :ref:`preparing_a_release_pr` above.
1. Update docs. Note that this is for the final release, not necessarily for
the RC releases. These changes should be made in main and cherry-picked
into the release branch, only before the final release.
- Edit the ``doc/whats_new/v0.99.rst`` file to add release title and list of
You can retrieve the list of contributor names with:
$ git shortlog -s 0.98.33.. | cut -f2- | sort --ignore-case | tr '\n' ';' | sed 's/;/, /g;s/, $//' | fold -s
- For major releases, link the release highlights example from the ``doc/whats_new/v0.99.rst`` file.
- Update the release date in ``whats_new.rst``
- Edit the ``doc/templates/index.html`` to change the 'News' entry of the
front page (with the release month as well). Do not forget to remove
the old entries (two years or three releases are typically good
enough) and to update the on-going development entry.
2. On the branch for releasing, update the version number in ``sklearn/``,
the ``__version__`` variable, and in `pyproject.toml`.
For major releases, please add a 0 at the end: `0.99.0` instead of `0.99`.
For the first release candidate, use the `rc1` suffix on the expected final
release number: `0.99.0rc1`.
3. Trigger the wheel builder with the ``[cd build]`` commit marker using
the command:
.. prompt:: bash $
git commit --allow-empty -m "Trigger wheel builder workflow: [cd build]"
The wheel building workflow is managed by GitHub Actions and the results be browsed at:
.. note::
Before building the wheels, make sure that the ``pyproject.toml`` file is
up to date and using the oldest version of ``numpy`` for each Python version
to avoid `ABI <>`_
incompatibility issues. Moreover, a new line have to be included in the
``pyproject.toml`` file for each new supported version of Python.
.. note::
The acronym CD in `[cd build]` stands for `Continuous Delivery
<>`_ and refers to the
automation used to generate the release artifacts (binary and source
packages). This can be seen as an extension to CI which stands for
`Continuous Integration
<>`_. The CD workflow on
GitHub Actions is also used to automatically create nightly builds and
publish packages for the development branch of scikit-learn. See
4. Once all the CD jobs have completed successfully in the PR, merge it,
again with the `[cd build]` marker in the commit message. This time
the results will be uploaded to the staging area.
You should then be able to upload the generated artifacts (.tar.gz and .whl
files) to using the "Run workflow" form for the
following GitHub Actions workflow:
5. If this went fine, you can proceed with tagging. Proceed with caution.
Ideally, tags should be created when you're almost certain that the release
is ready, since adding a tag to the main repo can trigger certain automated
Create the tag and push it (if it's an RC, it can be ``0.xx.0rc1`` for
.. prompt:: bash $
git tag -a 0.99.0 # in the 0.99.X branch
git push 0.99.0
6. Confirm that the bot has detected the tag on the conda-forge feedstock repo:
|||| If not, submit a PR for the
release. If you want to publish an RC release on conda-forge, the PR should target
the `rc` branch as opposed to the `main` branch. The two branches need to be kept
sync together otherwise.
7. Trigger the GitHub Actions workflow again but this time to upload the artifacts
to the real (replace "testpypi" by "pypi" in the "Run
workflow" form).
8. **Alternative to step 7**: it's possible to collect locally the generated binary
wheel packages and source tarball and upload them all to PyPI by running the
following commands in the scikit-learn source folder (checked out at the
release tag):
.. prompt:: bash $
rm -r dist
pip install -U wheelhouse_uploader twine
python -m wheelhouse_uploader fetch \
--version 0.99.0 \
--local-folder dist \
scikit-learn \
This command will download all the binary packages accumulated in the
`staging area on the hosting service
<>`_ and
put them in your local `./dist` folder.
Check the content of the `./dist` folder: it should contain all the wheels
along with the source tarball ("scikit-learn-RRR.tar.gz").
Make sure that you do not have developer versions or older versions of
the scikit-learn package in that folder.
Before uploading to pypi, you can test upload to
.. prompt:: bash $
twine upload --verbose --repository-url dist/*
Upload everything at once to
.. prompt:: bash $
twine upload dist/*
9. For major/minor (not bug-fix release or release candidates), update the symlink for
``stable`` and the ``latestStable`` variable in
.. prompt:: bash $
cd /tmp
git clone --depth 1 --no-checkout
echo stable > .git/info/sparse-checkout
git checkout main
rm stable
ln -s 0.999 stable
sed -i "s/latestStable = '.*/latestStable = '0.999';/" versionwarning.js
git add stable versionwarning.js
git commit -m "Update stable to point to 0.999"
git push origin main
10. Update ```` to reflect the latest supported version.
.. _release_checklist:
Release checklist
The following GitHub checklist might be helpful in a release PR::
* [ ] update news and what's new date in release branch
* [ ] update news and what's new date and sklearn dev0 version in main branch
* [ ] check that the wheels for the release can be built successfully
* [ ] merge the PR with `[cd build]` commit message to upload wheels to the staging repo
* [ ] upload the wheels and source tarball to
* [ ] create tag on the main github repo
* [ ] confirm bot detected at
|||| and wait for merge
* [ ] upload the wheels and source tarball to PyPI
* [ ] publish (except for RC)
* [ ] announce on mailing list and on Twitter, and LinkedIn
* [ ] update symlink for stable in
|||| (only major/minor)
* [ ] update in main branch (except for RC)
Merging Pull Requests
Individual commits are squashed when a Pull Request (PR) is merged on Github.
Before merging,
- the resulting commit title can be edited if necessary. Note
that this will rename the PR title by default.
- the detailed description, containing the titles of all the commits, can
be edited or deleted.
- for PRs with multiple code contributors care must be taken to keep
the `Co-authored-by: name <>` tags in the detailed
description. This will mark the PR as having `multiple co-authors
Whether code contributions are significantly enough to merit co-authorship is
left to the maintainer's discretion, same as for the "what's new" entry.
The web site
The scikit-learn web site ( is hosted at GitHub,
but should rarely be updated manually by pushing to the
|||| repository. Most
updates can be made by pushing to master (for /dev) or a release branch
like 0.99.X, from which Circle CI builds and uploads the documentation
Experimental features
The :mod:`sklearn.experimental` module was introduced in 0.21 and contains
experimental features / estimators that are subject to change without
deprecation cycle.
To create an experimental module, you can just copy and modify the content of
.. note::
These are permalink as in 0.24, where these estimators are still
experimental. They might be stable at the time of reading - hence the
permalink. See below for instructions on the transition from experimental
to stable.
Note that the public import path must be to a public subpackage (like
``sklearn/ensemble`` or ``sklearn/impute``), not just a ``.py`` module.
Also, the (private) experimental features that are imported must be in a
submodule/subpackage of the public subpackage, e.g.
``sklearn/ensemble/_hist_gradient_boosting/`` or
``sklearn/impute/``. This is needed so that pickles still work
in the future when the features aren't experimental anymore.
To avoid type checker (e.g. mypy) errors a direct import of experimental
estimators should be done in the parent module, protected by the
``if typing.TYPE_CHECKING`` check. See `sklearn/ensemble/
or `sklearn/impute/
for an example.
Please also write basic tests following those in
Make sure every user-facing code you write explicitly mentions that the feature
is experimental, and add a ``# noqa`` comment to avoid pep8-related warnings::
# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental import enable_hist_gradient_boosting # noqa
from sklearn.ensemble import HistGradientBoostingRegressor
For the docs to render properly, please also import
``enable_my_experimental_feature`` in ``doc/``, else sphinx won't be
able to import the corresponding modules. Note that using ``from
sklearn.experimental import *`` **does not work**.
Note that some experimental classes / functions are not included in the
:mod:`sklearn.experimental` module: ``sklearn.datasets.fetch_openml``.
Once the feature become stable, remove all `enable_my_experimental_feature`
in the scikit-learn code (even feature highlights etc.) and make the
`enable_my_experimental_feature` a no-op that just raises a warning:
The file should stay there indefinitely as we don't want to break users code:
we just incentivize them to remove that import with the warning.
Also update the tests accordingly: `
Normal file
@ -0,0 +1,434 @@
.. _minimal_reproducer:
Crafting a minimal reproducer for scikit-learn
Whether submitting a bug report, designing a suite of tests, or simply posting a
question in the discussions, being able to craft minimal, reproducible examples
(or minimal, workable examples) is the key to communicating effectively and
efficiently with the community.
There are very good guidelines on the internet such as `this StackOverflow
document <>`_ or `this blogpost by Matthew
Rocklin <>`_
on crafting Minimal Complete Verifiable Examples (referred below as MCVE).
Our goal is not to be repetitive with those references but rather to provide a
step-by-step guide on how to narrow down a bug until you have reached the
shortest possible code to reproduce it.
The first step before submitting a bug report to scikit-learn is to read the
`Issue template
It is already quite informative about the information you will be asked to
.. _good_practices:
Good practices
In this section we will focus on the **Steps/Code to Reproduce** section of the
`Issue template
We will start with a snippet of code that already provides a failing example but
that has room for readability improvement. We then craft a MCVE from it.
.. code-block:: python
# I am currently working in a ML project and when I tried to fit a
# GradientBoostingRegressor instance to my_data.csv I get a UserWarning:
# "X has feature names, but DecisionTreeRegressor was fitted without
# feature names". You can get a copy of my dataset from
# and verify my features do have
# names. The problem seems to arise during fit when I pass an integer
# to the n_iter_no_change parameter.
df = pd.read_csv('my_data.csv')
X = df[["feature_name"]] # my features do have names
y = df["target"]
# We set random_state=42 for the train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# An instance with default n_iter_no_change raises no error nor warnings
gbdt = GradientBoostingRegressor(random_state=0)
||||, y_train)
default_score = gbdt.score(X_test, y_test)
# the bug appears when I change the value for n_iter_no_change
gbdt = GradientBoostingRegressor(random_state=0, n_iter_no_change=5)
||||, y_train)
other_score = gbdt.score(X_test, y_test)
other_score = gbdt.score(X_test, y_test)
Provide a failing code example with minimal comments
Writing instructions to reproduce the problem in English is often ambiguous.
Better make sure that all the necessary details to reproduce the problem are
illustrated in the Python code snippet to avoid any ambiguity. Besides, by this
point you already provided a concise description in the **Describe the bug**
section of the `Issue template
The following code, while **still not minimal**, is already **much better**
because it can be copy-pasted in a Python terminal to reproduce the problem in
one step. In particular:
- it contains **all necessary imports statements**;
- it can fetch the public dataset without having to manually download a
file and put it in the expected location on the disk.
**Improved example**
.. code-block:: python
import pandas as pd
df = pd.read_csv("")
X = df[["feature_name"]]
y = df["target"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
from sklearn.ensemble import GradientBoostingRegressor
gbdt = GradientBoostingRegressor(random_state=0)
||||, y_train) # no warning
default_score = gbdt.score(X_test, y_test)
gbdt = GradientBoostingRegressor(random_state=0, n_iter_no_change=5)
||||, y_train) # raises warning
other_score = gbdt.score(X_test, y_test)
other_score = gbdt.score(X_test, y_test)
Boil down your script to something as small as possible
You have to ask yourself which lines of code are relevant and which are not for
reproducing the bug. Deleting unnecessary lines of code or simplifying the
function calls by omitting unrelated non-default options will help you and other
contributors narrow down the cause of the bug.
In particular, for this specific example:
- the warning has nothing to do with the `train_test_split` since it already
appears in the training step, before we use the test set.
- similarly, the lines that compute the scores on the test set are not
- the bug can be reproduced for any value of `random_state` so leave it to its
- the bug can be reproduced without preprocessing the data with the
**Improved example**
.. code-block:: python
import pandas as pd
df = pd.read_csv("")
X = df[["feature_name"]]
y = df["target"]
from sklearn.ensemble import GradientBoostingRegressor
gbdt = GradientBoostingRegressor()
||||, y) # no warning
gbdt = GradientBoostingRegressor(n_iter_no_change=5)
||||, y) # raises warning
**DO NOT** report your data unless it is extremely necessary
The idea is to make the code as self-contained as possible. For doing so, you
can use a :ref:`synth_data`. It can be generated using numpy, pandas or the
:mod:`sklearn.datasets` module. Most of the times the bug is not related to a
particular structure of your data. Even if it is, try to find an available
dataset that has similar characteristics to yours and that reproduces the
problem. In this particular case, we are interested in data that has labeled
feature names.
**Improved example**
.. code-block:: python
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
df = pd.DataFrame(
"feature_name": [-12.32, 1.43, 30.01, 22.17],
"target": [72, 55, 32, 43],
X = df[["feature_name"]]
y = df["target"]
gbdt = GradientBoostingRegressor()
||||, y) # no warning
gbdt = GradientBoostingRegressor(n_iter_no_change=5)
||||, y) # raises warning
As already mentioned, the key to communication is the readability of the code
and good formatting can really be a plus. Notice that in the previous snippet
- try to limit all lines to a maximum of 79 characters to avoid horizontal
scrollbars in the code snippets blocks rendered on the GitHub issue;
- use blank lines to separate groups of related functions;
- place all the imports in their own group at the beginning.
The simplification steps presented in this guide can be implemented in a
different order than the progression we have shown here. The important points
- a minimal reproducer should be runnable by a simple copy-and-paste in a
python terminal;
- it should be simplified as much as possible by removing any code steps
that are not strictly needed to reproducing the original problem;
- it should ideally only rely on a minimal dataset generated on-the-fly by
running the code instead of relying on external data, if possible.
Use markdown formatting
To format code or text into its own distinct block, use triple backticks.
supports an optional language identifier to enable syntax highlighting in your
fenced code block. For example::
from sklearn.datasets import make_blobs
n_samples = 100
n_components = 3
X, y = make_blobs(n_samples=n_samples, centers=n_components)
will render a python formatted snippet as follows
.. code-block:: python
from sklearn.datasets import make_blobs
n_samples = 100
n_components = 3
X, y = make_blobs(n_samples=n_samples, centers=n_components)
It is not necessary to create several blocks of code when submitting a bug
report. Remember other reviewers are going to copy-paste your code and having a
single cell will make their task easier.
In the section named **Actual results** of the `Issue template
you are asked to provide the error message including the full traceback of the
exception. In this case, use the `python-traceback` qualifier. For example::
TypeError Traceback (most recent call last)
<ipython-input-1-a674e682c281> in <module>
4 vectorizer = CountVectorizer(input=docs, analyzer='word')
5 lda_features = vectorizer.fit_transform(docs)
----> 6 lda_model = LatentDirichletAllocation(
7 n_topics=10,
8 learning_method='online',
TypeError: __init__() got an unexpected keyword argument 'n_topics'
yields the following when rendered:
.. code-block:: python
TypeError Traceback (most recent call last)
<ipython-input-1-a674e682c281> in <module>
4 vectorizer = CountVectorizer(input=docs, analyzer='word')
5 lda_features = vectorizer.fit_transform(docs)
----> 6 lda_model = LatentDirichletAllocation(
7 n_topics=10,
8 learning_method='online',
TypeError: __init__() got an unexpected keyword argument 'n_topics'
.. _synth_data:
Synthetic dataset
Before choosing a particular synthetic dataset, first you have to identify the
type of problem you are solving: Is it a classification, a regression,
a clustering, etc?
Once that you narrowed down the type of problem, you need to provide a synthetic
dataset accordingly. Most of the times you only need a minimalistic dataset.
Here is a non-exhaustive list of tools that may help you.
NumPy tools such as `numpy.random.randn
and `numpy.random.randint
can be used to create dummy numeric data.
- regression
Regressions take continuous numeric data as features and target.
.. code-block:: python
import numpy as np
rng = np.random.RandomState(0)
n_samples, n_features = 5, 5
X = rng.randn(n_samples, n_features)
y = rng.randn(n_samples)
A similar snippet can be used as synthetic data when testing scaling tools such
as :class:`sklearn.preprocessing.StandardScaler`.
- classification
If the bug is not raised during when encoding a categorical variable, you can
feed numeric data to a classifier. Just remember to ensure that the target
is indeed an integer.
.. code-block:: python
import numpy as np
rng = np.random.RandomState(0)
n_samples, n_features = 5, 5
X = rng.randn(n_samples, n_features)
y = rng.randint(0, 2, n_samples) # binary target with values in {0, 1}
If the bug only happens with non-numeric class labels, you might want to
generate a random target with `numpy.random.choice
.. code-block:: python
import numpy as np
rng = np.random.RandomState(0)
n_samples, n_features = 50, 5
X = rng.randn(n_samples, n_features)
y = np.random.choice(
["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
Some scikit-learn objects expect pandas dataframes as input. In this case you can
transform numpy arrays into pandas objects using `pandas.DataFrame
<>`_, or
.. code-block:: python
import numpy as np
import pandas as pd
rng = np.random.RandomState(0)
n_samples, n_features = 5, 5
X = pd.DataFrame(
"continuous_feature": rng.randn(n_samples),
"positive_feature": rng.uniform(low=0.0, high=100.0, size=n_samples),
"categorical_feature": rng.choice(["a", "b", "c"], size=n_samples),
y = pd.Series(rng.randn(n_samples))
In addition, scikit-learn includes various :ref:`sample_generators` that can be
used to build artificial datasets of controlled size and complexity.
As hinted by the name, :class:`sklearn.datasets.make_regression` produces
regression targets with noise as an optionally-sparse random linear combination
of random features.
.. code-block:: python
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=1000, n_features=20)
:class:`sklearn.datasets.make_classification` creates multiclass datasets with multiple Gaussian
clusters per class. Noise can be introduced by means of correlated, redundant or
uninformative features.
.. code-block:: python
from sklearn.datasets import make_classification
X, y = make_classification(
n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1
Similarly to `make_classification`, :class:`sklearn.datasets.make_blobs` creates
multiclass datasets using normally-distributed clusters of points. It provides
greater control regarding the centers and standard deviations of each cluster,
and therefore it is useful to demonstrate clustering.
.. code-block:: python
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10, centers=3, n_features=2)
Dataset loading utilities
You can use the :ref:`datasets` to load and fetch several popular reference
datasets. This option is useful when the bug relates to the particular structure
of the data, e.g. dealing with missing values or image recognition.
.. code-block:: python
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
Normal file
@ -0,0 +1,420 @@
.. _performance-howto:
How to optimize for speed
The following gives some practical guidelines to help you write efficient
code for the scikit-learn project.
.. note::
While it is always useful to profile your code so as to **check
performance assumptions**, it is also highly recommended
to **review the literature** to ensure that the implemented algorithm
is the state of the art for the task before investing into costly
implementation optimization.
Times and times, hours of efforts invested in optimizing complicated
implementation details have been rendered irrelevant by the subsequent
discovery of simple **algorithmic tricks**, or by using another algorithm
altogether that is better suited to the problem.
The section :ref:`warm-restarts` gives an example of such a trick.
Python, Cython or C/C++?
.. currentmodule:: sklearn
In general, the scikit-learn project emphasizes the **readability** of
the source code to make it easy for the project users to dive into the
source code so as to understand how the algorithm behaves on their data
but also for ease of maintainability (by the developers).
When implementing a new algorithm is thus recommended to **start
implementing it in Python using Numpy and Scipy** by taking care of avoiding
looping code using the vectorized idioms of those libraries. In practice
this means trying to **replace any nested for loops by calls to equivalent
Numpy array methods**. The goal is to avoid the CPU wasting time in the
Python interpreter rather than crunching numbers to fit your statistical
model. It's generally a good idea to consider NumPy and SciPy performance tips:
Sometimes however an algorithm cannot be expressed efficiently in simple
vectorized Numpy code. In this case, the recommended strategy is the
1. **Profile** the Python implementation to find the main bottleneck and
isolate it in a **dedicated module level function**. This function
will be reimplemented as a compiled extension module.
2. If there exists a well maintained BSD or MIT **C/C++** implementation
of the same algorithm that is not too big, you can write a
**Cython wrapper** for it and include a copy of the source code
of the library in the scikit-learn source tree: this strategy is
used for the classes :class:`svm.LinearSVC`, :class:`svm.SVC` and
:class:`linear_model.LogisticRegression` (wrappers for liblinear
and libsvm).
3. Otherwise, write an optimized version of your Python function using
**Cython** directly. This strategy is used
for the :class:`linear_model.ElasticNet` and
:class:`linear_model.SGDClassifier` classes for instance.
4. **Move the Python version of the function in the tests** and use
it to check that the results of the compiled extension are consistent
with the gold standard, easy to debug Python version.
5. Once the code is optimized (not simple bottleneck spottable by
profiling), check whether it is possible to have **coarse grained
parallelism** that is amenable to **multi-processing** by using the
``joblib.Parallel`` class.
When using Cython, use either
.. prompt:: bash $
python build_ext -i
python install
to generate C files. You are responsible for adding .c/.cpp extensions along
with build parameters in each submodule ````.
C/C++ generated files are embedded in distributed stable packages. The goal is
to make it possible to install scikit-learn stable version
on any machine with Python, Numpy, Scipy and C/C++ compiler.
.. _profiling-python-code:
Profiling Python code
In order to profile Python code we recommend to write a script that
loads and prepare you data and then use the IPython integrated profiler
for interactively exploring the relevant part for the code.
Suppose we want to profile the Non Negative Matrix Factorization module
of scikit-learn. Let us setup a new IPython session and load the digits
dataset and as in the :ref:`` example::
In [1]: from sklearn.decomposition import NMF
In [2]: from sklearn.datasets import load_digits
In [3]: X, _ = load_digits(return_X_y=True)
Before starting the profiling session and engaging in tentative
optimization iterations, it is important to measure the total execution
time of the function we want to optimize without any kind of profiler
overhead and save it somewhere for later reference::
In [4]: %timeit NMF(n_components=16, tol=1e-2).fit(X)
1 loops, best of 3: 1.7 s per loop
To have a look at the overall performance profile using the ``%prun``
magic command::
In [5]: %prun -l NMF(n_components=16, tol=1e-2).fit(X)
14496 function calls in 1.682 CPU seconds
Ordered by: internal time
List reduced from 90 to 9 due to restriction <''>
ncalls tottime percall cumtime percall filename:lineno(function)
36 0.609 0.017 1.499 0.042
1263 0.157 0.000 0.157 0.000
1 0.053 0.053 1.681 1.681
673 0.008 0.000 0.057 0.000
1 0.006 0.006 0.047 0.047
36 0.001 0.000 0.010 0.000
30 0.001 0.000 0.001 0.000
1 0.000 0.000 0.000 0.000
1 0.000 0.000 1.681 1.681
The ``tottime`` column is the most interesting: it gives to total time spent
executing the code of a given function ignoring the time spent in executing the
sub-functions. The real total time (local code + sub-function calls) is given by
the ``cumtime`` column.
Note the use of the ``-l`` that restricts the output to lines that
contains the "" string. This is useful to have a quick look at the hotspot
of the nmf Python module it-self ignoring anything else.
Here is the beginning of the output of the same command without the ``-l``
In [5] %prun NMF(n_components=16, tol=1e-2).fit(X)
16159 function calls in 1.840 CPU seconds
Ordered by: internal time
ncalls tottime percall cumtime percall filename:lineno(function)
2833 0.653 0.000 0.653 0.000 {}
46 0.651 0.014 1.636 0.036
1397 0.171 0.000 0.171 0.000
2780 0.167 0.000 0.167 0.000 {method 'sum' of 'numpy.ndarray' objects}
1 0.064 0.064 1.840 1.840
1542 0.043 0.000 0.043 0.000 {method 'flatten' of 'numpy.ndarray' objects}
337 0.019 0.000 0.019 0.000 {method 'all' of 'numpy.ndarray' objects}
2734 0.011 0.000 0.181 0.000
2 0.010 0.005 0.010 0.005 {numpy.linalg.lapack_lite.dgesdd}
748 0.009 0.000 0.065 0.000
The above results show that the execution is largely dominated by
dot products operations (delegated to blas). Hence there is probably
no huge gain to expect by rewriting this code in Cython or C/C++: in
this case out of the 1.7s total execution time, almost 0.7s are spent
in compiled code we can consider optimal. By rewriting the rest of the
Python code and assuming we could achieve a 1000% boost on this portion
(which is highly unlikely given the shallowness of the Python loops),
we would not gain more than a 2.4x speed-up globally.
Hence major improvements can only be achieved by **algorithmic
improvements** in this particular example (e.g. trying to find operation
that are both costly and useless to avoid computing then rather than
trying to optimize their implementation).
It is however still interesting to check what's happening inside the
``_nls_subproblem`` function which is the hotspot if we only consider
Python code: it takes around 100% of the accumulated time of the module. In
order to better understand the profile of this specific function, let
us install ``line_profiler`` and wire it to IPython:
.. prompt:: bash $
pip install line_profiler
**Under IPython 0.13+**, first create a configuration profile:
.. prompt:: bash $
ipython profile create
Then register the line_profiler extension in
This will register the ``%lprun`` magic command in the IPython terminal application and the other frontends such as qtconsole and notebook.
Now restart IPython and let us use this new toy::
In [1]: from sklearn.datasets import load_digits
In [2]: from sklearn.decomposition import NMF
... : from sklearn.decomposition._nmf import _nls_subproblem
In [3]: X, _ = load_digits(return_X_y=True)
In [4]: %lprun -f _nls_subproblem NMF(n_components=16, tol=1e-2).fit(X)
Timer unit: 1e-06 s
File: sklearn/decomposition/
Function: _nls_subproblem at line 137
Total time: 1.73153 s
Line # Hits Time Per Hit % Time Line Contents
137 def _nls_subproblem(V, W, H_init, tol, max_iter):
138 """Non-negative least square solver
170 """
171 48 5863 122.1 0.3 if (H_init < 0).any():
172 raise ValueError("Negative values in H_init passed to NLS solver.")
174 48 139 2.9 0.0 H = H_init
175 48 112141 2336.3 5.8 WtV =, V)
176 48 16144 336.3 0.8 WtW =, W)
178 # values justified in the paper
179 48 144 3.0 0.0 alpha = 1
180 48 113 2.4 0.0 beta = 0.1
181 638 1880 2.9 0.1 for n_iter in range(1, max_iter + 1):
182 638 195133 305.9 10.2 grad =, H) - WtV
183 638 495761 777.1 25.9 proj_gradient = norm(grad[np.logical_or(grad < 0, H > 0)])
184 638 2449 3.8 0.1 if proj_gradient < tol:
185 48 130 2.7 0.0 break
187 1474 4474 3.0 0.2 for inner_iter in range(1, 20):
188 1474 83833 56.9 4.4 Hn = H - alpha * grad
189 # Hn = np.where(Hn > 0, Hn, 0)
190 1474 194239 131.8 10.1 Hn = _pos(Hn)
191 1474 48858 33.1 2.5 d = Hn - H
192 1474 150407 102.0 7.8 gradd = np.sum(grad * d)
193 1474 515390 349.7 26.9 dQd = np.sum(, d) * d)
By looking at the top values of the ``% Time`` column it is really easy to
pin-point the most expensive expressions that would deserve additional care.
Memory usage profiling
You can analyze in detail the memory usage of any Python code with the help of
`memory_profiler <>`_. First,
install the latest version:
.. prompt:: bash $
pip install -U memory_profiler
Then, setup the magics in a manner similar to ``line_profiler``.
**Under IPython 0.11+**, first create a configuration profile:
.. prompt:: bash $
ipython profile create
Then register the extension in
alongside the line profiler::
This will register the ``%memit`` and ``%mprun`` magic commands in the
IPython terminal application and the other frontends such as qtconsole and notebook.
``%mprun`` is useful to examine, line-by-line, the memory usage of key
functions in your program. It is very similar to ``%lprun``, discussed in the
previous section. For example, from the ``memory_profiler`` ``examples``
In [1] from example import my_func
In [2] %mprun -f my_func my_func()
Line # Mem usage Increment Line Contents
3 @profile
4 5.97 MB 0.00 MB def my_func():
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
7 13.61 MB -152.59 MB del b
8 13.61 MB 0.00 MB return a
Another useful magic that ``memory_profiler`` defines is ``%memit``, which is
analogous to ``%timeit``. It can be used as follows::
In [1]: import numpy as np
In [2]: %memit np.zeros(1e7)
maximum of 3: 76.402344 MB per loop
For more details, see the docstrings of the magics, using ``%memit?`` and
Using Cython
If profiling of the Python code reveals that the Python interpreter
overhead is larger by one order of magnitude or more than the cost of the
actual numerical computation (e.g. ``for`` loops over vector components,
nested evaluation of conditional expression, scalar arithmetic...), it
is probably adequate to extract the hotspot portion of the code as a
standalone function in a ``.pyx`` file, add static type declarations and
then use Cython to generate a C program suitable to be compiled as a
Python extension module.
The `Cython's documentation <>`_ contains a tutorial and
reference guide for developing such a module.
For more information about developing in Cython for scikit-learn, see :ref:`cython`.
.. _profiling-compiled-extension:
Profiling compiled extensions
When working with compiled extensions (written in C/C++ with a wrapper or
directly as Cython extension), the default Python profiler is useless:
we need a dedicated tool to introspect what's happening inside the
compiled extension it-self.
Using yep and gperftools
Easy profiling without special compilation options use yep:
Using a debugger, gdb
* It is helpful to use ``gdb`` to debug. In order to do so, one must use
a Python interpreter built with debug support (debug symbols and proper
optimization). To create a new conda environment (which you might need
to deactivate and reactivate after building/installing) with a source-built
CPython interpreter:
.. code-block:: bash
git clone
conda create -n debug-scikit-dev
conda activate debug-scikit-dev
cd cpython
mkdir debug
cd debug
../configure --prefix=$CONDA_PREFIX --with-pydebug
make EXTRA_CFLAGS='-DPy_DEBUG' -j<num_cores>
make install
Using gprof
In order to profile compiled Python extensions one could use ``gprof``
after having recompiled the project with ``gcc -pg`` and using the
``python-dbg`` variant of the interpreter on debian / ubuntu: however
this approach requires to also have ``numpy`` and ``scipy`` recompiled
with ``-pg`` which is rather complicated to get working.
Fortunately there exist two alternative profilers that don't require you to
recompile everything.
Using valgrind / callgrind / kcachegrind
``yep`` can be used to create a profiling report.
``kcachegrind`` provides a graphical environment to visualize this report:
.. prompt:: bash $
# Run yep to profile some python script
python -m yep -c
.. prompt:: bash $
# open with kcachegrind
.. note::
``yep`` can be executed with the argument ``--lines`` or ``-l`` to compile
a profiling report 'line by line'.
Multi-core parallelism using ``joblib.Parallel``
See `joblib documentation <>`_
.. _warm-restarts:
A simple algorithmic trick: warm restarts
See the glossary entry for :term:`warm_start`
Normal file
@ -0,0 +1,97 @@
.. _plotting_api:
Developing with the Plotting API
Scikit-learn defines a simple API for creating visualizations for machine
learning. The key features of this API is to run calculations once and to have
the flexibility to adjust the visualizations after the fact. This section is
intended for developers who wish to develop or maintain plotting tools. For
usage, users should refer to the :ref:`User Guide <visualizations>`.
Plotting API Overview
This logic is encapsulated into a display object where the computed data is
stored and the plotting is done in a `plot` method. The display object's
`__init__` method contains only the data needed to create the visualization.
The `plot` method takes in parameters that only have to do with visualization,
such as a matplotlib axes. The `plot` method will store the matplotlib artists
as attributes allowing for style adjustments through the display object. The
`Display` class should define one or both class methods: `from_estimator` and
`from_predictions`. These methods allows to create the `Display` object from
the estimator and some data or from the true and predicted values. After these
class methods create the display object with the computed values, then call the
display's plot method. Note that the `plot` method defines attributes related
to matplotlib, such as the line artist. This allows for customizations after
calling the `plot` method.
For example, the `RocCurveDisplay` defines the following methods and
class RocCurveDisplay:
def __init__(self, fpr, tpr, roc_auc, estimator_name):
self.fpr = fpr
self.tpr = tpr
self.roc_auc = roc_auc
self.estimator_name = estimator_name
def from_estimator(cls, estimator, X, y):
# get the predictions
y_pred = estimator.predict_proba(X)[:, 1]
return cls.from_predictions(y, y_pred, estimator.__class__.__name__)
def from_predictions(cls, y, y_pred, estimator_name):
# do ROC computation from y and y_pred
fpr, tpr, roc_auc = ...
viz = RocCurveDisplay(fpr, tpr, roc_auc, estimator_name)
return viz.plot()
def plot(self, ax=None, name=None, **kwargs):
self.line_ = ...
self.ax_ = ax
self.figure_ = ax.figure_
Read more in :ref:``
and the :ref:`User Guide <visualizations>`.
Plotting with Multiple Axes
Some of the plotting tools like
:func:`~sklearn.inspection.PartialDependenceDisplay.from_estimator` and
:class:`~sklearn.inspection.PartialDependenceDisplay` support plotting on
multiple axes. Two different scenarios are supported:
1. If a list of axes is passed in, `plot` will check if the number of axes is
consistent with the number of axes it expects and then draws on those axes. 2.
If a single axes is passed in, that axes defines a space for multiple axes to
be placed. In this case, we suggest using matplotlib's
`~matplotlib.gridspec.GridSpecFromSubplotSpec` to split up the space::
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpecFromSubplotSpec
fig, ax = plt.subplots()
gs = GridSpecFromSubplotSpec(2, 2, subplot_spec=ax.get_subplotspec())
ax_top_left = fig.add_subplot(gs[0, 0])
ax_top_right = fig.add_subplot(gs[0, 1])
ax_bottom = fig.add_subplot(gs[1, :])
By default, the `ax` keyword in `plot` is `None`. In this case, the single
axes is created and the gridspec api is used to create the regions to plot in.
See for example, :meth:`~sklearn.inspection.PartialDependenceDisplay.from_estimator`
which plots multiple lines and contours using this API. The axes defining the
bounding box is saved in a `bounding_ax_` attribute. The individual axes
created are stored in an `axes_` ndarray, corresponding to the axes position on
the grid. Positions that are not used are set to `None`. Furthermore, the
matplotlib Artists are stored in `lines_` and `contours_` where the key is the
position on the grid. When a list of axes is passed in, the `axes_`, `lines_`,
and `contours_` is a 1d ndarray corresponding to the list of axes passed in.
Normal file
@ -0,0 +1,373 @@
.. _developers-tips:
Developers' Tips and Tricks
Productivity and sanity-preserving tips
In this section we gather some useful advice and tools that may increase your
quality-of-life when reviewing pull requests, running unit tests, and so forth.
Some of these tricks consist of userscripts that require a browser extension
such as `TamperMonkey`_ or `GreaseMonkey`_; to set up userscripts you must have
one of these extensions installed, enabled and running. We provide userscripts
as GitHub gists; to install them, click on the "Raw" button on the gist page.
.. _TamperMonkey:
.. _GreaseMonkey:
Folding and unfolding outdated diffs on pull requests
GitHub hides discussions on PRs when the corresponding lines of code have been
changed in the mean while. This `userscript
provides a shortcut (Control-Alt-P at the time of writing but look at the code
to be sure) to unfold all such hidden discussions at once, so you can catch up.
Checking out pull requests as remote-tracking branches
In your local fork, add to your ``.git/config``, under the ``[remote
"upstream"]`` heading, the line::
fetch = +refs/pull/*/head:refs/remotes/upstream/pr/*
You may then use ``git checkout pr/PR_NUMBER`` to navigate to the code of the
pull-request with the given number. (`Read more in this gist.
Display code coverage in pull requests
To overlay the code coverage reports generated by the CodeCov continuous
integration, consider `this browser extension
<>`_. The coverage of each line
will be displayed as a color background behind the line number.
.. _pytest_tips:
Useful pytest aliases and flags
The full test suite takes fairly long to run. For faster iterations,
it is possibly to select a subset of tests using pytest selectors.
In particular, one can run a `single test based on its node ID
.. prompt:: bash $
pytest -v sklearn/linear_model/tests/
or use the `-k pytest parameter
to select tests based on their name. For instance,:
.. prompt:: bash $
pytest sklearn/tests/ -v -k LogisticRegression
will run all :term:`common tests` for the ``LogisticRegression`` estimator.
When a unit test fails, the following tricks can make debugging easier:
1. The command line argument ``pytest -l`` instructs pytest to print the local
variables when a failure occurs.
2. The argument ``pytest --pdb`` drops into the Python debugger on failure. To
instead drop into the rich IPython debugger ``ipdb``, you may set up a
shell alias to:
.. prompt:: bash $
pytest --pdbcls=IPython.terminal.debugger:TerminalPdb --capture no
Other `pytest` options that may become useful include:
- ``-x`` which exits on the first failed test,
- ``--lf`` to rerun the tests that failed on the previous run,
- ``--ff`` to rerun all previous tests, running the ones that failed first,
- ``-s`` so that pytest does not capture the output of ``print()`` statements,
- ``--tb=short`` or ``--tb=line`` to control the length of the logs,
- ``--runxfail`` also run tests marked as a known failure (XFAIL) and report errors.
Since our continuous integration tests will error if
``FutureWarning`` isn't properly caught,
it is also recommended to run ``pytest`` along with the
``-Werror::FutureWarning`` flag.
.. _saved_replies:
Standard replies for reviewing
It may be helpful to store some of these in GitHub's `saved
replies <>`_ for reviewing:
.. highlight:: none
Note that putting this content on a single line in a literal is the easiest way to make it copyable and wrapped on screen.
Issue: Usage questions
You are asking a usage question. The issue tracker is for bugs and new features. For usage questions, it is recommended to try [Stack Overflow]( or [the Mailing List](
Unfortunately, we need to close this issue as this issue tracker is a communication tool used for the development of scikit-learn. The additional activity created by usage questions crowds it too much and impedes this development. The conversation can continue here, however there is no guarantee that it will receive attention from core developers.
Issue: You're welcome to update the docs
Please feel free to offer a pull request updating the documentation if you feel it could be improved.
Issue: Self-contained example for bug
Please provide [self-contained example code](, including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.
Issue: Software versions
To help diagnose your issue, please paste the output of:
import sklearn; sklearn.show_versions()
Issue: Code blocks
Readability can be greatly improved if you [format]( your code snippets and complete error messages appropriately. For example:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'hello'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named 'hello'
You can edit your issue descriptions and comments at any time to improve readability. This helps maintainers a lot. Thanks!
Issue/Comment: Linking to code
Friendly advice: for clarity's sake, you can link to code like [this](
Issue/Comment: Linking to comments
Please use links to comments, which make it a lot easier to see what you are referring to, rather than just linking to the issue. See [this]( for more details.
PR-NEW: Better description and title
Thanks for the pull request! Please make the title of the PR more descriptive. The title will become the commit message when this is merged. You should state what issue (or PR) it fixes/resolves in the description using the syntax described [here](
PR-NEW: Fix #
Please use "Fix #issueNumber" in your PR description (and you can do it more than once). This way the associated issue gets closed automatically when the PR is merged. For more details, look at [this](
PR-NEW or Issue: Maintenance cost
Every feature we include has a [maintenance cost]( Our maintainers are mostly volunteers. For a new feature to be included, we need evidence that it is often useful and, ideally, [well-established]( in the literature or in practice. Also, we expect PR authors to take part in the maintenance for the code they submit, at least initially. That doesn't stop you implementing it for yourself and publishing it in a separate repository, or even [scikit-learn-contrib](
PR-WIP: What's needed before merge?
Please clarify (perhaps as a TODO list in the PR description) what work you believe still needs to be done before it can be reviewed for merge. When it is ready, please prefix the PR title with `[MRG]`.
PR-WIP: Regression test needed
Please add a [non-regression test]( that would fail at main but pass in this PR.
You have some [PEP8]( violations, whose details you can see in the Circle CI `lint` job. It might be worth configuring your code editor to check for such errors on the fly, so you can catch them before committing.
PR-MRG: Patience
Before merging, we generally require two core developers to agree that your pull request is desirable and ready. [Please be patient](, as we mostly rely on volunteered time from busy core developers. (You are also welcome to help us out with [reviewing other PRs](
PR-MRG: Add to what's new
Please add an entry to the change log at `doc/whats_new/v*.rst`. Like the other entries there, please reference this pull request with `:pr:` and credit yourself (and other contributors if applicable) with `:user:`.
PR: Don't change unrelated
Please do not change unrelated lines. It makes your contribution harder to review and may introduce merge conflicts to other pull requests.
.. highlight:: default
Debugging memory errors in Cython with valgrind
While python/numpy's built-in memory management is relatively robust, it can
lead to performance penalties for some routines. For this reason, much of
the high-performance code in scikit-learn is written in cython. This
performance gain comes with a tradeoff, however: it is very easy for memory
bugs to crop up in cython code, especially in situations where that code
relies heavily on pointer arithmetic.
Memory errors can manifest themselves a number of ways. The easiest ones to
debug are often segmentation faults and related glibc errors. Uninitialized
variables can lead to unexpected behavior that is difficult to track down.
A very useful tool when debugging these sorts of errors is
Valgrind is a command-line tool that can trace memory errors in a variety of
code. Follow these steps:
1. Install `valgrind`_ on your system.
2. Download the python valgrind suppression file: `valgrind-python.supp`_.
3. Follow the directions in the `README.valgrind`_ file to customize your
python suppressions. If you don't, you will have spurious output coming
related to the python interpreter instead of your own code.
4. Run valgrind as follows:
.. prompt:: bash $
valgrind -v --suppressions=valgrind-python.supp python
.. _valgrind:
.. _`README.valgrind`:
.. _`valgrind-python.supp`:
The result will be a list of all the memory-related errors, which reference
lines in the C-code generated by cython from your .pyx file. If you examine
the referenced lines in the .c file, you will see comments which indicate the
corresponding location in your .pyx source file. Hopefully the output will
give you clues as to the source of your memory error.
For more information on valgrind and the array of options it has, see the
tutorials and documentation on the `valgrind web site <>`_.
.. _arm64_dev_env:
Building and testing for the ARM64 platform on a x86_64 machine
ARM-based machines are a popular target for mobile, edge or other low-energy
deployments (including in the cloud, for instance on Scaleway or AWS Graviton).
Here are instructions to setup a local dev environment to reproduce
ARM-specific bugs or test failures on a x86_64 host laptop or workstation. This
is based on QEMU user mode emulation using docker for convenience (see
.. note::
The following instructions are illustrated for ARM64 but they also apply to
ppc64le, after changing the Docker image and Miniforge paths appropriately.
Prepare a folder on the host filesystem and download the necessary tools and
source code:
.. prompt:: bash $
mkdir arm64
pushd arm64
git clone
Use docker to install QEMU user mode and run an ARM64v8 container with access
to your shared folder under the `/io` mount point:
.. prompt:: bash $
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
docker run -v`pwd`:/io --rm -it arm64v8/ubuntu /bin/bash
In the container, install miniforge3 for the ARM64 (a.k.a. aarch64)
.. prompt:: bash $
# Choose to install miniforge3 under: `/io/miniforge3`
Whenever you restart a new container, you will need to reinit the conda env
previously installed under `/io/miniforge3`:
.. prompt:: bash $
/io/miniforge3/bin/conda init
source /root/.bashrc
as the `/root` home folder is part of the ephemeral docker container. Every
file or directory stored under `/io` is persistent on the other hand.
You can then build scikit-learn as usual (you will need to install compiler
tools and dependencies using apt or conda as usual). Building scikit-learn
takes a lot of time because of the emulation layer, however it needs to be
done only once if you put the scikit-learn folder under the `/io` mount
Then use pytest to run only the tests of the module you are interested in
.. _meson_build_backend:
The Meson Build Backend
Since scikit-learn 1.5.0 we use meson-python as the build tool. Meson is
a new tool for scikit-learn and the PyData ecosystem. It is used by several
other packages that have written good guides about what it is and how it works.
- `pandas setup doc
pandas has a similar setup as ours (no spin or
- `scipy Meson doc
<>`_ gives
more background about how Meson works behind the scenes
Normal file
@ -0,0 +1,242 @@
.. _developers-utils:
Utilities for Developers
Scikit-learn contains a number of utilities to help with development. These are
located in :mod:`sklearn.utils`, and include tools in a number of categories.
All the following functions and classes are in the module :mod:`sklearn.utils`.
.. warning ::
These utilities are meant to be used internally within the scikit-learn
package. They are not guaranteed to be stable between versions of
scikit-learn. Backports, in particular, will be removed as the scikit-learn
dependencies evolve.
.. currentmodule:: sklearn.utils
Validation Tools
These are tools used to check and validate input. When you write a function
which accepts arrays, matrices, or sparse matrices as arguments, the following
should be used when applicable.
- :func:`assert_all_finite`: Throw an error if array contains NaNs or Infs.
- :func:`as_float_array`: convert input to an array of floats. If a sparse
matrix is passed, a sparse matrix will be returned.
- :func:`check_array`: check that input is a 2D array, raise error on sparse
matrices. Allowed sparse matrix formats can be given optionally, as well as
allowing 1D or N-dimensional arrays. Calls :func:`assert_all_finite` by
- :func:`check_X_y`: check that X and y have consistent length, calls
check_array on X, and column_or_1d on y. For multilabel classification or
multitarget regression, specify multi_output=True, in which case check_array
will be called on y.
- :func:`indexable`: check that all input arrays have consistent length and can
be sliced or indexed using safe_index. This is used to validate input for
- :func:`validation.check_memory` checks that input is ``joblib.Memory``-like,
which means that it can be converted into a
``sklearn.utils.Memory`` instance (typically a str denoting
the ``cachedir``) or has the same interface.
If your code relies on a random number generator, it should never use
functions like ``numpy.random.random`` or ``numpy.random.normal``. This
approach can lead to repeatability issues in unit tests. Instead, a
``numpy.random.RandomState`` object should be used, which is built from
a ``random_state`` argument passed to the class or function. The function
:func:`check_random_state`, below, can then be used to create a random
number generator object.
- :func:`check_random_state`: create a ``np.random.RandomState`` object from
a parameter ``random_state``.
- If ``random_state`` is ``None`` or ``np.random``, then a
randomly-initialized ``RandomState`` object is returned.
- If ``random_state`` is an integer, then it is used to seed a new
``RandomState`` object.
- If ``random_state`` is a ``RandomState`` object, then it is passed through.
For example::
>>> from sklearn.utils import check_random_state
>>> random_state = 0
>>> random_state = check_random_state(random_state)
>>> random_state.rand(4)
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])
When developing your own scikit-learn compatible estimator, the following
helpers are available.
- :func:`validation.check_is_fitted`: check that the estimator has been fitted
before calling ``transform``, ``predict``, or similar methods. This helper
allows to raise a standardized error message across estimator.
- :func:`validation.has_fit_parameter`: check that a given parameter is
supported in the ``fit`` method of a given estimator.
Efficient Linear Algebra & Array Operations
- :func:`extmath.randomized_range_finder`: construct an orthonormal matrix
whose range approximates the range of the input. This is used in
:func:`extmath.randomized_svd`, below.
- :func:`extmath.randomized_svd`: compute the k-truncated randomized SVD.
This algorithm finds the exact truncated singular values decomposition
using randomization to speed up the computations. It is particularly
fast on large matrices on which you wish to extract only a small
number of components.
- `arrayfuncs.cholesky_delete`:
(used in :func:`~sklearn.linear_model.lars_path`) Remove an
item from a cholesky factorization.
- :func:`arrayfuncs.min_pos`: (used in ``sklearn.linear_model.least_angle``)
Find the minimum of the positive values within an array.
- :func:`extmath.fast_logdet`: efficiently compute the log of the determinant
of a matrix.
- :func:`extmath.density`: efficiently compute the density of a sparse vector
- :func:`extmath.safe_sparse_dot`: dot product which will correctly handle
``scipy.sparse`` inputs. If the inputs are dense, it is equivalent to
- :func:`extmath.weighted_mode`: an extension of ``scipy.stats.mode`` which
allows each item to have a real-valued weight.
- :func:`resample`: Resample arrays or sparse matrices in a consistent way.
used in :func:`shuffle`, below.
- :func:`shuffle`: Shuffle arrays or sparse matrices in a consistent way.
Used in :func:`~sklearn.cluster.k_means`.
Efficient Random Sampling
- :func:`random.sample_without_replacement`: implements efficient algorithms
for sampling ``n_samples`` integers from a population of size ``n_population``
without replacement.
Efficient Routines for Sparse Matrices
The ``sklearn.utils.sparsefuncs`` cython module hosts compiled extensions to
efficiently process ``scipy.sparse`` data.
- :func:`sparsefuncs.mean_variance_axis`: compute the means and
variances along a specified axis of a CSR matrix.
Used for normalizing the tolerance stopping criterion in
- :func:`sparsefuncs_fast.inplace_csr_row_normalize_l1` and
:func:`sparsefuncs_fast.inplace_csr_row_normalize_l2`: can be used to normalize
individual sparse samples to unit L1 or L2 norm as done in
- :func:`sparsefuncs.inplace_csr_column_scale`: can be used to multiply the
columns of a CSR matrix by a constant scale (one scale per column).
Used for scaling features to unit standard deviation in
- :func:`~sklearn.neighbors.sort_graph_by_row_values`: can be used to sort a
CSR sparse matrix such that each row is stored with increasing values. This
is useful to improve efficiency when using precomputed sparse distance
matrices in estimators relying on nearest neighbors graph.
Graph Routines
- :func:`graph.single_source_shortest_path_length`:
(not currently used in scikit-learn)
Return the shortest path from a single source
to all connected nodes on a graph. Code is adapted from `networkx
If this is ever needed again, it would be far faster to use a single
iteration of Dijkstra's algorithm from ``graph_shortest_path``.
Testing Functions
- :func:`discovery.all_estimators` : returns a list of all estimators in
scikit-learn to test for consistent behavior and interfaces.
- :func:`discovery.all_displays` : returns a list of all displays (related to
plotting API) in scikit-learn to test for consistent behavior and interfaces.
- :func:`discovery.all_functions` : returns a list all functions in
scikit-learn to test for consistent behavior and interfaces.
Multiclass and multilabel utility function
- :func:`multiclass.is_multilabel`: Helper function to check if the task
is a multi-label classification one.
- :func:`multiclass.unique_labels`: Helper function to extract an ordered
array of unique labels from different formats of target.
Helper Functions
- :class:`gen_even_slices`: generator to create ``n``-packs of slices going up
to ``n``. Used in :func:`~sklearn.decomposition.dict_learning` and
- :class:`gen_batches`: generator to create slices containing batch size elements
from 0 to ``n``
- :func:`safe_mask`: Helper function to convert a mask to the format expected
by the numpy array or scipy sparse matrix on which to use it (sparse
matrices support integer indices only while numpy arrays support both
boolean masks and integer indices).
- :func:`safe_sqr`: Helper function for unified squaring (``**2``) of
array-likes, matrices and sparse matrices.
Hash Functions
- :func:`murmurhash3_32` provides a python wrapper for the
``MurmurHash3_x86_32`` C++ non cryptographic hash function. This hash
function is suitable for implementing lookup tables, Bloom filters,
Count Min Sketch, feature hashing and implicitly defined sparse
random projections::
>>> from sklearn.utils import murmurhash3_32
>>> murmurhash3_32("some feature", seed=0) == -384616559
>>> murmurhash3_32("some feature", seed=0, positive=True) == 3910350737
The ``sklearn.utils.murmurhash`` module can also be "cimported" from
other cython modules so as to benefit from the high performance of
MurmurHash while skipping the overhead of the Python interpreter.
Warnings and Exceptions
- :class:`deprecated`: Decorator to mark a function or class as deprecated.
- :class:`~sklearn.exceptions.ConvergenceWarning`: Custom warning to catch
convergence problems. Used in ``sklearn.covariance.graphical_lasso``.
Normal file
@ -0,0 +1,8 @@
.. toctree::
:maxdepth: 2
Normal file
@ -0,0 +1,20 @@
.. raw :: html
<!-- Generated by -->
<div class="sk-authors-container">
img.avatar {border-radius: 10px;}
<a href=''><img src='' class='avatar' /></a> <br />
<p>Arturo Amor</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Lucy Liu</p>
<a href=''><img src='' class='avatar' /></a> <br />
<p>Yao Xiao</p>
Normal file
@ -0,0 +1,530 @@
.. raw:: html
/* h3 headings on this page are the questions; make them rubric-like */
h3 {
font-size: 1rem;
font-weight: bold;
padding-bottom: 0.2rem;
margin: 2rem 0 1.15rem 0;
border-bottom: 1px solid var(--pst-color-border);
/* Increase top margin for first question in each section */
h2 + section > h3 {
margin-top: 2.5rem;
/* Make the headerlinks a bit more visible */
h3 > a.headerlink {
font-size: 0.9rem;
/* Remove the backlink decoration on the titles */
h2 > a.toc-backref,
h3 > a.toc-backref {
text-decoration: none;
.. _faq:
Frequently Asked Questions
.. currentmodule:: sklearn
Here we try to give some answers to questions that regularly pop up on the mailing list.
.. contents:: Table of Contents
:depth: 2
About the project
What is the project name (a lot of people get it wrong)?
scikit-learn, but not scikit or SciKit nor sci-kit learn.
Also not scikits.learn or scikits-learn, which were previously used.
How do you pronounce the project name?
sy-kit learn. sci stands for science!
Why scikit?
There are multiple scikits, which are scientific toolboxes built around SciPy.
Apart from scikit-learn, another popular one is `scikit-image <>`_.
Do you support PyPy?
scikit-learn is regularly tested and maintained to work with
`PyPy <>`_ (an alternative Python implementation with
a built-in just-in-time compiler).
Note however that this support is still considered experimental and specific
components might behave slightly differently. Please refer to the test
suite of the specific module of interest for more details.
How can I obtain permission to use the images in scikit-learn for my work?
The images contained in the `scikit-learn repository
<>`_ and the images generated within
the `scikit-learn documentation <>`_
can be used via the `BSD 3-Clause License
<>`_ for
your work. Citations of scikit-learn are highly encouraged and appreciated. See
:ref:`citing scikit-learn <citing-scikit-learn>`.
Implementation decisions
Why is there no support for deep or reinforcement learning? Will there be such support in the future?
Deep learning and reinforcement learning both require a rich vocabulary to
define an architecture, with deep learning additionally requiring
GPUs for efficient computing. However, neither of these fit within
the design constraints of scikit-learn. As a result, deep learning
and reinforcement learning are currently out of scope for what
scikit-learn seeks to achieve.
You can find more information about the addition of GPU support at
`Will you add GPU support?`_.
Note that scikit-learn currently implements a simple multilayer perceptron
in :mod:`sklearn.neural_network`. We will only accept bug fixes for this module.
If you want to implement more complex deep learning models, please turn to
popular deep learning frameworks such as
`tensorflow <>`_,
`keras <>`_,
and `pytorch <>`_.
.. _adding_graphical_models:
Will you add graphical models or sequence prediction to scikit-learn?
Not in the foreseeable future.
scikit-learn tries to provide a unified API for the basic tasks in machine
learning, with pipelines and meta-algorithms like grid search to tie
everything together. The required concepts, APIs, algorithms and
expertise required for structured learning are different from what
scikit-learn has to offer. If we started doing arbitrary structured
learning, we'd need to redesign the whole package and the project
would likely collapse under its own weight.
There are two projects with API similar to scikit-learn that
do structured prediction:
* `pystruct <>`_ handles general structured
learning (focuses on SSVMs on arbitrary graph structures with
approximate inference; defines the notion of sample as an instance of
the graph structure).
* `seqlearn <>`_ handles sequences only
(focuses on exact inference; has HMMs, but mostly for the sake of
completeness; treats a feature vector as a sample and uses an offset encoding
for the dependencies between feature vectors).
Why did you remove HMMs from scikit-learn?
See :ref:`adding_graphical_models`.
Will you add GPU support?
Adding GPU support by default would introduce heavy harware-specific software
dependencies and existing algorithms would need to be reimplemented. This would
make it both harder for the average user to install scikit-learn and harder for
the developers to maintain the code.
However, since 2023, a limited but growing :ref:`list of scikit-learn
estimators <array_api_supported>` can already run on GPUs if the input data is
provided as a PyTorch or CuPy array and if scikit-learn has been configured to
accept such inputs as explained in :ref:`array_api`. This Array API support
allows scikit-learn to run on GPUs without introducing heavy and
hardware-specific software dependencies to the main package.
Most estimators that rely on NumPy for their computationally intensive operations
can be considered for Array API support and therefore GPU support.
However, not all scikit-learn estimators are amenable to efficiently running
on GPUs via the Array API for fundamental algorithmic reasons. For instance,
tree-based models currently implemented with Cython in scikit-learn are
fundamentally not array-based algorithms. Other algorithms such as k-means or
k-nearest neighbors rely on array-based algorithms but are also implemented in
Cython. Cython is used to manually interleave consecutive array operations to
avoid introducing performance killing memory access to large intermediate
arrays: this low-level algorithmic rewrite is called "kernel fusion" and cannot
be expressed via the Array API for the foreseeable future.
Adding efficient GPU support to estimators that cannot be efficiently
implemented with the Array API would require designing and adopting a more
flexible extension system for scikit-learn. This possibility is being
considered in the following GitHub issue (under discussion):
Why do categorical variables need preprocessing in scikit-learn, compared to other tools?
Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices
of a single numeric dtype. These do not explicitly represent categorical
variables at present. Thus, unlike R's ``data.frames`` or :class:`pandas.DataFrame`,
we require explicit conversion of categorical features to numeric values, as
discussed in :ref:`preprocessing_categorical_features`.
See also :ref:`` for an
example of working with heterogeneous (e.g. categorical and numeric) data.
Why does scikit-learn not directly work with, for example, :class:`pandas.DataFrame`?
The homogeneous NumPy and SciPy data objects currently expected are most
efficient to process for most operations. Extensive work would also be needed
to support Pandas categorical types. Restricting input to homogeneous
types therefore reduces maintenance cost and encourages usage of efficient
data structures.
Note however that :class:`~sklearn.compose.ColumnTransformer` makes it
convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of
dataframe columns selected by name or dtype to dedicated scikit-learn transformers.
Therefore :class:`~sklearn.compose.ColumnTransformer` are often used in the first
step of scikit-learn pipelines when dealing
with heterogeneous dataframes (see :ref:`pipeline` for more details).
See also :ref:``
for an example of working with heterogeneous (e.g. categorical and numeric) data.
Do you plan to implement transform for target ``y`` in a pipeline?
Currently transform only works for features ``X`` in a pipeline. There's a
long-standing discussion about not being able to transform ``y`` in a pipeline.
Follow on GitHub issue :issue:`4143`. Meanwhile, you can check out
`pipegraph <>`_,
and `imbalanced-learn <>`_.
Note that scikit-learn solved for the case where ``y``
has an invertible transformation applied before training
and inverted after prediction. scikit-learn intends to solve for
use cases where ``y`` should be transformed at training time
and not at test time, for resampling and similar uses, like at
`imbalanced-learn <>`_.
In general, these use cases can be solved
with a custom meta estimator rather than a :class:`~pipeline.Pipeline`.
Why are there so many different estimators for linear models?
Usually, there is one classifier and one regressor per model type, e.g.
:class:`~ensemble.GradientBoostingClassifier` and
:class:`~ensemble.GradientBoostingRegressor`. Both have similar options and
both have the parameter `loss`, which is especially useful in the regression
case as it enables the estimation of conditional mean as well as conditional
For linear models, there are many estimator classes which are very close to
each other. Let us have a look at
- :class:`~linear_model.LinearRegression`, no penalty
- :class:`~linear_model.Ridge`, L2 penalty
- :class:`~linear_model.Lasso`, L1 penalty (sparse models)
- :class:`~linear_model.ElasticNet`, L1 + L2 penalty (less sparse models)
- :class:`~linear_model.SGDRegressor` with `loss="squared_loss"`
**Maintainer perspective:**
They all do in principle the same and are different only by the penalty they
impose. This, however, has a large impact on the way the underlying
optimization problem is solved. In the end, this amounts to usage of different
methods and tricks from linear algebra. A special case is
:class:`~linear_model.SGDRegressor` which
comprises all 4 previous models and is different by the optimization procedure.
A further side effect is that the different estimators favor different data
layouts (`X` C-contiguous or F-contiguous, sparse csr or csc). This complexity
of the seemingly simple linear models is the reason for having different
estimator classes for different penalties.
**User perspective:**
First, the current design is inspired by the scientific literature where linear
regression models with different regularization/penalty were given different
names, e.g. *ridge regression*. Having different model classes with according
names makes it easier for users to find those regression models.
Secondly, if all the 5 above mentioned linear models were unified into a single
class, there would be parameters with a lot of options like the ``solver``
parameter. On top of that, there would be a lot of exclusive interactions
between different parameters. For example, the possible options of the
parameters ``solver``, ``precompute`` and ``selection`` would depend on the
chosen values of the penalty parameters ``alpha`` and ``l1_ratio``.
How can I contribute to scikit-learn?
See :ref:`contributing`. Before wanting to add a new algorithm, which is
usually a major and lengthy undertaking, it is recommended to start with
:ref:`known issues <new_contributors>`. Please do not contact the contributors
of scikit-learn directly regarding contributing to scikit-learn.
Why is my pull request not getting any attention?
The scikit-learn review process takes a significant amount of time, and
contributors should not be discouraged by a lack of activity or review on
their pull request. We care a lot about getting things right
the first time, as maintenance and later change comes at a high cost.
We rarely release any "experimental" code, so all of our contributions
will be subject to high use immediately and should be of the highest
quality possible initially.
Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the
reviewers and core developers are working on scikit-learn on their own time.
If a review of your pull request comes slowly, it is likely because the
reviewers are busy. We ask for your understanding and request that you
not close your pull request or discontinue your work solely because of
this reason.
.. _new_algorithms_inclusion_criteria:
What are the inclusion criteria for new algorithms?
We only consider well-established algorithms for inclusion. A rule of thumb is
at least 3 years since publication, 200+ citations, and wide use and
usefulness. A technique that provides a clear-cut improvement (e.g. an
enhanced data structure or a more efficient approximation technique) on
a widely-used method will also be considered for inclusion.
From the algorithms or techniques that meet the above criteria, only those
which fit well within the current API of scikit-learn, that is a ``fit``,
``predict/transform`` interface and ordinarily having input/output that is a
numpy array or sparse matrix, are accepted.
The contributor should support the importance of the proposed addition with
research papers and/or implementations in other similar packages, demonstrate
its usefulness via common use-cases/applications and corroborate performance
improvements, if any, with benchmarks and/or plots. It is expected that the
proposed algorithm should outperform the methods that are already implemented
in scikit-learn at least in some areas.
Inclusion of a new algorithm speeding up an existing model is easier if:
- it does not introduce new hyper-parameters (as it makes the library
more future-proof),
- it is easy to document clearly when the contribution improves the speed
and when it does not, for instance, "when ``n_features >>
- benchmarks clearly show a speed up.
Also, note that your implementation need not be in scikit-learn to be used
together with scikit-learn tools. You can implement your favorite algorithm
in a scikit-learn compatible way, upload it to GitHub and let us know. We
will be happy to list it under :ref:`related_projects`. If you already have
a package on GitHub following the scikit-learn API, you may also be
interested to look at `scikit-learn-contrib
.. _selectiveness:
Why are you so selective on what algorithms you include in scikit-learn?
Code comes with maintenance cost, and we need to balance the amount of
code we have with the size of the team (and add to this the fact that
complexity scales non linearly with the number of features).
The package relies on core developers using their free time to
fix bugs, maintain code and review contributions.
Any algorithm that is added needs future attention by the developers,
at which point the original author might long have lost interest.
See also :ref:`new_algorithms_inclusion_criteria`. For a great read about
long-term maintenance issues in open-source software, look at
`the Executive Summary of Roads and Bridges
Using scikit-learn
What's the best way to get help on scikit-learn usage?
* General machine learning questions: use `Cross Validated
<>`_ with the ``[machine-learning]`` tag.
* scikit-learn usage questions: use `Stack Overflow
<>`_ with the
``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the `mailing list
Please make sure to include a minimal reproduction code snippet (ideally shorter
than 10 lines) that highlights your problem on a toy dataset (for instance from
:mod:`sklearn.datasets` or randomly generated with functions of ``numpy.random`` with
a fixed random seed). Please remove any line of code that is not necessary to
reproduce your problem.
The problem should be reproducible by simply copy-pasting your code snippet in a Python
shell with scikit-learn installed. Do not forget to include the import statements.
More guidance to write good reproduction code snippets can be found at:
If your problem raises an exception that you do not understand (even after googling it),
please make sure to include the full traceback that you obtain when running the
reproduction script.
For bug reports or feature requests, please make use of the
`issue tracker on GitHub <>`_.
.. warning::
Please do not email any authors directly to ask for assistance, report bugs,
or for any other issue related to scikit-learn.
How should I save, export or deploy estimators for production?
See :ref:`model_persistence`.
How can I create a bunch object?
Bunch objects are sometimes used as an output for functions and methods. They
extend dictionaries by enabling values to be accessed by key,
`bunch["value_key"]`, or by an attribute, `bunch.value_key`.
They should not be used as an input. Therefore you almost never need to create
a :class:`~utils.Bunch` object, unless you are extending scikit-learn's API.
How can I load my own datasets into a format usable by scikit-learn?
Generally, scikit-learn works on any numeric data stored as numpy arrays
or scipy sparse matrices. Other types that are convertible to numeric
arrays such as :class:`pandas.DataFrame` are also acceptable.
For more information on loading your data files into these usable data
structures, please refer to :ref:`loading external datasets <external_datasets>`.
How do I deal with string data (or trees, graphs...)?
scikit-learn estimators assume you'll feed them real-valued feature vectors.
This assumption is hard-coded in pretty much all of the library.
However, you can feed non-numerical inputs to estimators in several ways.
If you have text documents, you can use a term frequency features; see
:ref:`text_feature_extraction` for the built-in *text vectorizers*.
For more general feature extraction from any kind of data, see
:ref:`dict_feature_extraction` and :ref:`feature_hashing`.
Another common case is when you have non-numerical data and a custom distance
(or similarity) metric on these data. Examples include strings with edit
distance (aka. Levenshtein distance), for instance, DNA or RNA sequences. These can be
encoded as numbers, but doing so is painful and error-prone. Working with
distance metrics on arbitrary data can be done in two ways.
Firstly, many estimators take precomputed distance/similarity matrices, so if
the dataset is not too large, you can compute distances for all pairs of inputs.
If the dataset is large, you can use feature vectors with only one "feature",
which is an index into a separate data structure, and supply a custom metric
function that looks up the actual data in this data structure. For instance, to use
:class:`~cluster.dbscan` with Levenshtein distances::
>>> import numpy as np
>>> from leven import levenshtein # doctest: +SKIP
>>> from sklearn.cluster import dbscan
>>> def lev_metric(x, y):
... i, j = int(x[0]), int(y[0]) # extract indices
... return levenshtein(data[i], data[j])
>>> X = np.arange(len(data)).reshape(-1, 1)
>>> X
>>> # We need to specify algorithm='brute' as the default assumes
>>> # a continuous feature space.
>>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute') # doctest: +SKIP
(array([0, 1]), array([ 0, 0, -1]))
Note that the example above uses the third-party edit distance package
`leven <>`_. Similar tricks can be used,
with some care, for tree kernels, graph kernels, etc.
Why do I sometimes get a crash/freeze with ``n_jobs > 1`` under OSX or Linux?
Several scikit-learn tools such as :class:`~model_selection.GridSearchCV` and
:class:`~model_selection.cross_val_score` rely internally on Python's
:mod:`multiprocessing` module to parallelize execution
onto several Python processes by passing ``n_jobs > 1`` as an argument.
The problem is that Python :mod:`multiprocessing` does a ``fork`` system call
without following it with an ``exec`` system call for performance reasons. Many
libraries like (some versions of) Accelerate or vecLib under OSX, (some versions
of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
manage their own internal thread pool. Upon a call to `fork`, the thread pool
state in the child process is corrupted: the thread pool believes it has many
threads while only the main thread state has been forked. It is possible to
change the libraries to make them detect when a fork happens and reinitialize
the thread pool in that case: we did that for OpenBLAS (merged upstream in
main since 0.2.10) and we contributed a `patch
<>`_ to GCC's OpenMP runtime
(not yet reviewed).
But in the end the real culprit is Python's :mod:`multiprocessing` that does
``fork`` without ``exec`` to reduce the overhead of starting and using new
Python processes for parallel computing. Unfortunately this is a violation of
the POSIX standard and therefore some software editors like Apple refuse to
consider the lack of fork-safety in Accelerate and vecLib as a bug.
In Python 3.4+ it is now possible to configure :mod:`multiprocessing` to
use the ``"forkserver"`` or ``"spawn"`` start methods (instead of the default
``"fork"``) to manage the process pools. To work around this issue when
using scikit-learn, you can set the ``JOBLIB_START_METHOD`` environment
variable to ``"forkserver"``. However the user should be aware that using
the ``"forkserver"`` method prevents :class:`joblib.Parallel` to call function
interactively defined in a shell session.
If you have custom code that uses :mod:`multiprocessing` directly instead of using
it via :mod:`joblib` you can enable the ``"forkserver"`` mode globally for your
program. Insert the following instructions in your main script::
import multiprocessing
# other imports, custom code, load data, define model...
if __name__ == "__main__":
# call scikit-learn utils with n_jobs > 1 here
You can find more default on the new start methods in the `multiprocessing
documentation <>`_.
.. _faq_mkl_threading:
Why does my job use more cores than specified with ``n_jobs``?
This is because ``n_jobs`` only controls the number of jobs for
routines that are parallelized with :mod:`joblib`, but parallel code can come
from other sources:
- some routines may be parallelized with OpenMP (for code written in C or
- scikit-learn relies a lot on numpy, which in turn may rely on numerical
libraries like MKL, OpenBLAS or BLIS which can provide parallel
For more details, please refer to our :ref:`notes on parallelism <parallelism>`.
How do I set a ``random_state`` for an entire execution?
Please refer to :ref:`randomness`.
Normal file
@ -0,0 +1,231 @@
Getting Started
The purpose of this guide is to illustrate some of the main features that
``scikit-learn`` provides. It assumes a very basic working knowledge of
machine learning practices (model fitting, predicting, cross-validation,
etc.). Please refer to our :ref:`installation instructions
<installation-instructions>` for installing ``scikit-learn``.
``Scikit-learn`` is an open source machine learning library that supports
supervised and unsupervised learning. It also provides various tools for
model fitting, data preprocessing, model selection, model evaluation,
and many other utilities.
Fitting and predicting: estimator basics
``Scikit-learn`` provides dozens of built-in machine learning algorithms and
models, called :term:`estimators`. Each estimator can be fitted to some data
using its :term:`fit` method.
Here is a simple example where we fit a
:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data::
>>> from sklearn.ensemble import RandomForestClassifier
>>> clf = RandomForestClassifier(random_state=0)
>>> X = [[ 1, 2, 3], # 2 samples, 3 features
... [11, 12, 13]]
>>> y = [0, 1] # classes of each sample
>>>, y)
The :term:`fit` method generally accepts 2 inputs:
- The samples matrix (or design matrix) :term:`X`. The size of ``X``
is typically ``(n_samples, n_features)``, which means that samples are
represented as rows and features are represented as columns.
- The target values :term:`y` which are real numbers for regression tasks, or
integers for classification (or any other discrete set of values). For
unsupervised learning tasks, ``y`` does not need to be specified. ``y`` is
usually a 1d array where the ``i`` th entry corresponds to the target of the
``i`` th sample (row) of ``X``.
Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent
:term:`array-like` data types, though some estimators work with other
formats such as sparse matrices.
Once the estimator is fitted, it can be used for predicting target values of
new data. You don't need to re-train the estimator::
>>> clf.predict(X) # predict classes of the training data
array([0, 1])
>>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data
array([0, 1])
You can check :ref:`ml_map` on how to choose the right model for your use case.
Transformers and pre-processors
Machine learning workflows are often composed of different parts. A typical
pipeline consists of a pre-processing step that transforms or imputes the
data, and a final predictor that predicts target values.
In ``scikit-learn``, pre-processors and transformers follow the same API as
the estimator objects (they actually all inherit from the same
``BaseEstimator`` class). The transformer objects don't have a
:term:`predict` method but rather a :term:`transform` method that outputs a
newly transformed sample matrix ``X``::
>>> from sklearn.preprocessing import StandardScaler
>>> X = [[0, 15],
... [1, -10]]
>>> # scale data according to computed scaling values
>>> StandardScaler().fit(X).transform(X)
array([[-1., 1.],
[ 1., -1.]])
Sometimes, you want to apply different transformations to different features:
the :ref:`ColumnTransformer<column_transformer>` is designed for these
Pipelines: chaining pre-processors and estimators
Transformers and estimators (predictors) can be combined together into a
single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline
offers the same API as a regular estimator: it can be fitted and used for
prediction with ``fit`` and ``predict``. As we will see later, using a
pipeline will also prevent you from data leakage, i.e. disclosing some
testing data in your training data.
In the following example, we :ref:`load the Iris dataset <datasets>`, split it
into train and test sets, and compute the accuracy score of a pipeline on
the test data::
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import accuracy_score
>>> # create a pipeline object
>>> pipe = make_pipeline(
... StandardScaler(),
... LogisticRegression()
... )
>>> # load the iris dataset and split it into train and test sets
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> # fit the whole pipeline
>>>, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()),
('logisticregression', LogisticRegression())])
>>> # we can now use it like any other estimator
>>> accuracy_score(pipe.predict(X_test), y_test)
Model evaluation
Fitting a model to some data does not entail that it will predict well on
unseen data. This needs to be directly evaluated. We have just seen the
:func:`~sklearn.model_selection.train_test_split` helper that splits a
dataset into train and test sets, but ``scikit-learn`` provides many other
tools for model evaluation, in particular for :ref:`cross-validation
We here briefly show how to perform a 5-fold cross-validation procedure,
using the :func:`~sklearn.model_selection.cross_validate` helper. Note that
it is also possible to manually iterate over the folds, use different
data splitting strategies, and use custom scoring functions. Please refer to
our :ref:`User Guide <cross_validation>` for more details::
>>> from sklearn.datasets import make_regression
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.model_selection import cross_validate
>>> X, y = make_regression(n_samples=1000, random_state=0)
>>> lr = LinearRegression()
>>> result = cross_validate(lr, X, y) # defaults to 5-fold CV
>>> result['test_score'] # r_squared score is high because dataset is easy
array([1., 1., 1., 1., 1.])
Automatic parameter searches
All estimators have parameters (often called hyper-parameters in the
literature) that can be tuned. The generalization power of an estimator
often critically depends on a few parameters. For example a
:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators``
parameter that determines the number of trees in the forest, and a
``max_depth`` parameter that determines the maximum depth of each tree.
Quite often, it is not clear what the exact values of these parameters
should be since they depend on the data at hand.
``Scikit-learn`` provides tools to automatically find the best parameter
combinations (via cross-validation). In the following example, we randomly
search over the parameter space of a random forest with a
:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search
is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as
a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with
the best set of parameters. Read more in the :ref:`User Guide
>>> from sklearn.datasets import fetch_california_housing
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
>>> X, y = fetch_california_housing(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> # define the parameter space that will be searched over
>>> param_distributions = {'n_estimators': randint(1, 5),
... 'max_depth': randint(5, 10)}
>>> # now create a searchCV object and fit it to the data
>>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
... n_iter=5,
... param_distributions=param_distributions,
... random_state=0)
>>>, y_train)
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
param_distributions={'max_depth': ...,
'n_estimators': ...},
>>> search.best_params_
{'max_depth': 9, 'n_estimators': 4}
>>> # the search object now acts like a normal random forest estimator
>>> # with max_depth=9 and n_estimators=4
>>> search.score(X_test, y_test)
.. note::
In practice, you almost always want to :ref:`search over a pipeline
<composite_grid_search>`, instead of a single estimator. One of the main
reasons is that if you apply a pre-processing step to the whole dataset
without using a pipeline, and then perform any kind of cross-validation,
you would be breaking the fundamental assumption of independence between
training and testing data. Indeed, since you pre-processed the data
using the whole dataset, some information about the test sets are
available to the train sets. This will lead to over-estimating the
generalization power of the estimator (you can read more in this `Kaggle
post <>`_).
Using a pipeline for cross-validation and searching will largely keep
you from this common pitfall.
Next steps
We have briefly covered estimator fitting and predicting, pre-processing
steps, pipelines, cross-validation tools and automatic hyper-parameter
searches. This guide should give you an overview of some of the main
features of the library, but there is much more to ``scikit-learn``!
Please refer to our :ref:`user_guide` for details on all the tools that we
provide. You can also find an exhaustive list of the public API in the
You can also look at our numerous :ref:`examples <general_examples>` that
illustrate the use of ``scikit-learn`` in many different contexts.
Normal file
Normal file
@ -0,0 +1,198 @@
.. _governance:
Scikit-learn governance and decision-making
The purpose of this document is to formalize the governance process used by the
scikit-learn project, to clarify how decisions are made and how the various
elements of our community interact.
This document establishes a decision-making structure that takes into account
feedback from all members of the community and strives to find consensus, while
avoiding any deadlocks.
This is a meritocratic, consensus-based community project. Anyone with an
interest in the project can join the community, contribute to the project
design and participate in the decision making process. This document describes
how that participation takes place and how to set about earning merit within
the project community.
Roles And Responsibilities
We distinguish between contributors, core contributors, and the technical
committee. A key distinction between them is their voting rights: contributors
have no voting rights, whereas the other two groups all have voting rights,
as well as permissions to the tools relevant to their roles.
Contributors are community members who contribute in concrete ways to the
project. Anyone can become a contributor, and contributions can take many forms
– not only code – as detailed in the :ref:`contributors guide <contributing>`.
There is no process to become a contributor: once somebody contributes to the
project in any way, they are a contributor.
Core Contributors
All core contributor members have the same voting rights and right to propose
new members to any of the roles listed below. Their membership is represented
as being an organization member on the scikit-learn `GitHub organization
They are also welcome to join our `monthly core contributor meetings
New members can be nominated by any existing member. Once they have been
nominated, there will be a vote by the current core contributors. Voting on new
members is one of the few activities that takes place on the project's private
mailing list. While it is expected that most votes will be unanimous, a
two-thirds majority of the cast votes is enough. The vote needs to be open for
at least 1 week.
Core contributors that have not contributed to the project, corresponding to
their role, in the past 12 months will be asked if they want to become emeritus
members and recant their rights until they become active again. The list of
members, active and emeritus (with dates at which they became active) is public
on the scikit-learn website.
The following teams form the core contributors group:
* **Contributor Experience Team**
The contributor experience team improves the experience of contributors by
helping with the triage of issues and pull requests, as well as noticing any
repeating patterns where people might struggle, and to help with improving
those aspects of the project.
To this end, they have the required permissions on github to label and close
issues. :ref:`Their work <bug_triaging>` is crucial to improve the
communication in the project and limit the crowding of the issue tracker.
.. _communication_team:
* **Communication Team**
Members of the communication team help with outreach and communication
for scikit-learn. The goal of the team is to develop public awareness of
scikit-learn, of its features and usage, as well as branding.
For this, they can operate the scikit-learn accounts on various social networks
and produce materials. They also have the required rights to our blog
repository and other relevant accounts and platforms.
* **Documentation Team**
Members of the documentation team engage with the documentation of the project
among other things. They might also be involved in other aspects of the
project, but their reviews on documentation contributions are considered
authoritative, and can merge such contributions.
To this end, they have permissions to merge pull requests in scikit-learn's
* **Maintainers Team**
Maintainers are community members who have shown that they are dedicated to the
continued development of the project through ongoing engagement with the
community. They have shown they can be trusted to maintain scikit-learn with
care. Being a maintainer allows contributors to more easily carry on with their
project related activities by giving them direct access to the project's
repository. Maintainers are expected to review code contributions, merge
approved pull requests, cast votes for and against merging a pull-request,
and to be involved in deciding major changes to the API.
Technical Committee
The Technical Committee (TC) members are maintainers who have additional
responsibilities to ensure the smooth running of the project. TC members are
expected to participate in strategic planning, and approve changes to the
governance model. The purpose of the TC is to ensure a smooth progress from the
big-picture perspective. Indeed changes that impact the full project require a
synthetic analysis and a consensus that is both explicit and informed. In cases
that the core contributor community (which includes the TC members) fails to
reach such a consensus in the required time frame, the TC is the entity to
resolve the issue. Membership of the TC is by nomination by a core contributor.
A nomination will result in discussion which cannot take more than a month and
then a vote by the core contributors which will stay open for a week. TC
membership votes are subject to a two-third majority of all cast votes as well
as a simple majority approval of all the current TC members. TC members who do
not actively engage with the TC duties are expected to resign.
The Technical Committee of scikit-learn consists of :user:`Thomas Fan
<thomasjpfan>`, :user:`Alexandre Gramfort <agramfort>`, :user:`Olivier Grisel
<ogrisel>`, :user:`Adrin Jalali <adrinjalali>`, :user:`Andreas Müller
<amueller>`, :user:`Joel Nothman <jnothman>` and :user:`Gaël Varoquaux
Decision Making Process
Decisions about the future of the project are made through discussion with all
members of the community. All non-sensitive project management discussion takes
place on the project contributors' `mailing list <>`_
and the `issue tracker <>`_.
Occasionally, sensitive discussion occurs on a private list.
Scikit-learn uses a "consensus seeking" process for making decisions. The group
tries to find a resolution that has no open objections among core contributors.
At any point during the discussion, any core contributor can call for a vote,
which will conclude one month from the call for the vote. Most votes have to be
backed by a :ref:`SLEP <slep>`. If no option can gather two thirds of the votes
cast, the decision is escalated to the TC, which in turn will use consensus
seeking with the fallback option of a simple majority vote if no consensus can
be found within a month. This is what we hereafter may refer to as "**the
decision making process**".
Decisions (in addition to adding core contributors and TC membership as above)
are made according to the following rules:
* **Minor Documentation changes**, such as typo fixes, or addition / correction
of a sentence, but no change of the ```` landing page or the
“about” page: Requires +1 by a maintainer, no -1 by a maintainer (lazy
consensus), happens on the issue or pull request page. Maintainers are
expected to give “reasonable time” to others to give their opinion on the
pull request if they're not confident others would agree.
* **Code changes and major documentation changes**
require +1 by two maintainers, no -1 by a maintainer (lazy
consensus), happens on the issue of pull-request page.
* **Changes to the API principles and changes to dependencies or supported
versions** happen via a :ref:`slep` and follows the decision-making process
outlined above.
* **Changes to the governance model** follow the process outlined in `SLEP020
If a veto -1 vote is cast on a lazy consensus, the proposer can appeal to the
community and maintainers and the change can be approved or rejected using
the decision making procedure outlined above.
Governance Model Changes
Governance model changes occur through an enhancement proposal or a GitHub Pull
Request. An enhancement proposal will go through "**the decision-making process**"
described in the previous section. Alternatively, an author may propose a change
directly to the governance model with a GitHub Pull Request. Logistically, an
author can open a Draft Pull Request for feedback and follow up with a new
revised Pull Request for voting. Once that author is happy with the state of the
Pull Request, they can call for a vote on the public mailing list. During the
one-month voting period, the Pull Request can not change. A Pull Request
Approval will count as a positive vote, and a "Request Changes" review will
count as a negative vote. If two-thirds of the cast votes are positive, then
the governance model change is accepted.
.. _slep:
Enhancement proposals (SLEPs)
For all votes, a proposal must have been made public and discussed before the
vote. Such proposal must be a consolidated document, in the form of a
"Scikit-Learn Enhancement Proposal" (SLEP), rather than a long discussion on an
issue. A SLEP must be submitted as a pull-request to `enhancement proposals
<>`_ using the `SLEP
describes the process in more detail.
Normal file
@ -0,0 +1,33 @@
