first commit
|
@ -0,0 +1,16 @@
|
||||||
|
# Code of Conduct
|
||||||
|
|
||||||
|
We are a community based on openness, as well as friendly and didactic discussions.
|
||||||
|
|
||||||
|
We aspire to treat everybody equally, and value their contributions.
|
||||||
|
|
||||||
|
Decisions are made based on technical merit and consensus.
|
||||||
|
|
||||||
|
Code is not the only way to help the project. Reviewing pull requests,
|
||||||
|
answering questions to help others on mailing lists or issues, organizing and
|
||||||
|
teaching tutorials, working on the website, improving the documentation, are
|
||||||
|
all priceless contributions.
|
||||||
|
|
||||||
|
We abide by the principles of openness, respect, and consideration of others of
|
||||||
|
the Python Software Foundation: https://www.python.org/psf/codeofconduct/
|
||||||
|
|
|
@ -0,0 +1,42 @@
|
||||||
|
|
||||||
|
Contributing to scikit-learn
|
||||||
|
============================
|
||||||
|
|
||||||
|
The latest contributing guide is available in the repository at
|
||||||
|
`doc/developers/contributing.rst`, or online at:
|
||||||
|
|
||||||
|
https://scikit-learn.org/dev/developers/contributing.html
|
||||||
|
|
||||||
|
There are many ways to contribute to scikit-learn, with the most common ones
|
||||||
|
being contribution of code or documentation to the project. Improving the
|
||||||
|
documentation is no less important than improving the library itself. If you
|
||||||
|
find a typo in the documentation, or have made improvements, do not hesitate to
|
||||||
|
send an email to the mailing list or preferably submit a GitHub pull request.
|
||||||
|
Documentation can be found under the
|
||||||
|
[doc/](https://github.com/scikit-learn/scikit-learn/tree/main/doc) directory.
|
||||||
|
|
||||||
|
But there are many other ways to help. In particular answering queries on the
|
||||||
|
[issue tracker](https://github.com/scikit-learn/scikit-learn/issues),
|
||||||
|
investigating bugs, and [reviewing other developers' pull
|
||||||
|
requests](https://scikit-learn.org/dev/developers/contributing.html#code-review-guidelines)
|
||||||
|
are very valuable contributions that decrease the burden on the project
|
||||||
|
maintainers.
|
||||||
|
|
||||||
|
Another way to contribute is to report issues you're facing, and give a "thumbs
|
||||||
|
up" on issues that others reported and that are relevant to you. It also helps
|
||||||
|
us if you spread the word: reference the project from your blog and articles,
|
||||||
|
link to it from your website, or simply star it in GitHub to say "I use it".
|
||||||
|
|
||||||
|
Quick links
|
||||||
|
-----------
|
||||||
|
|
||||||
|
* [Submitting a bug report or feature request](https://scikit-learn.org/dev/developers/contributing.html#submitting-a-bug-report-or-a-feature-request)
|
||||||
|
* [Contributing code](https://scikit-learn.org/dev/developers/contributing.html#contributing-code)
|
||||||
|
* [Coding guidelines](https://scikit-learn.org/dev/developers/develop.html#coding-guidelines)
|
||||||
|
* [Tips to read current code](https://scikit-learn.org/dev/developers/contributing.html#reading-the-existing-code-base)
|
||||||
|
|
||||||
|
Code of Conduct
|
||||||
|
---------------
|
||||||
|
|
||||||
|
We abide by the principles of openness, respect, and consideration of others
|
||||||
|
of the Python Software Foundation: https://www.python.org/psf/codeofconduct/.
|
|
@ -0,0 +1,29 @@
|
||||||
|
BSD 3-Clause License
|
||||||
|
|
||||||
|
Copyright (c) 2007-2024 The scikit-learn developers.
|
||||||
|
All rights reserved.
|
||||||
|
|
||||||
|
Redistribution and use in source and binary forms, with or without
|
||||||
|
modification, are permitted provided that the following conditions are met:
|
||||||
|
|
||||||
|
* Redistributions of source code must retain the above copyright notice, this
|
||||||
|
list of conditions and the following disclaimer.
|
||||||
|
|
||||||
|
* Redistributions in binary form must reproduce the above copyright notice,
|
||||||
|
this list of conditions and the following disclaimer in the documentation
|
||||||
|
and/or other materials provided with the distribution.
|
||||||
|
|
||||||
|
* Neither the name of the copyright holder nor the names of its
|
||||||
|
contributors may be used to endorse or promote products derived from
|
||||||
|
this software without specific prior written permission.
|
||||||
|
|
||||||
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
||||||
|
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
||||||
|
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
||||||
|
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
||||||
|
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
||||||
|
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
||||||
|
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
||||||
|
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
||||||
|
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||||
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
@ -0,0 +1,36 @@
|
||||||
|
include *.rst
|
||||||
|
include *.build
|
||||||
|
recursive-include sklearn *.build
|
||||||
|
recursive-include doc *
|
||||||
|
recursive-include examples *
|
||||||
|
recursive-include sklearn *.c *.cpp *.h *.pyx *.pxd *.pxi *.tp
|
||||||
|
recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt *.arff.gz *.json.gz
|
||||||
|
include COPYING
|
||||||
|
include README.rst
|
||||||
|
include pyproject.toml
|
||||||
|
include sklearn/externals/README
|
||||||
|
include sklearn/svm/src/liblinear/COPYRIGHT
|
||||||
|
include sklearn/svm/src/libsvm/LIBSVM_CHANGES
|
||||||
|
include conftest.py
|
||||||
|
include Makefile
|
||||||
|
include MANIFEST.in
|
||||||
|
include .coveragerc
|
||||||
|
|
||||||
|
# exclude from sdist
|
||||||
|
recursive-exclude asv_benchmarks *
|
||||||
|
recursive-exclude benchmarks *
|
||||||
|
recursive-exclude build_tools *
|
||||||
|
recursive-exclude maint_tools *
|
||||||
|
recursive-exclude benchmarks *
|
||||||
|
recursive-exclude .binder *
|
||||||
|
recursive-exclude .circleci *
|
||||||
|
exclude .cirrus.star
|
||||||
|
exclude .codecov.yml
|
||||||
|
exclude .git-blame-ignore-revs
|
||||||
|
exclude .mailmap
|
||||||
|
exclude .pre-commit-config.yaml
|
||||||
|
exclude azure-pipelines.yml
|
||||||
|
exclude CODE_OF_CONDUCT.md
|
||||||
|
exclude CONTRIBUTING.md
|
||||||
|
exclude SECURITY.md
|
||||||
|
exclude PULL_REQUEST_TEMPLATE.md
|
|
@ -0,0 +1,70 @@
|
||||||
|
# simple makefile to simplify repetitive build env management tasks under posix
|
||||||
|
|
||||||
|
# caution: testing won't work on windows, see README
|
||||||
|
|
||||||
|
PYTHON ?= python
|
||||||
|
CYTHON ?= cython
|
||||||
|
PYTEST ?= pytest
|
||||||
|
CTAGS ?= ctags
|
||||||
|
|
||||||
|
# skip doctests on 32bit python
|
||||||
|
BITS := $(shell python -c 'import struct; print(8 * struct.calcsize("P"))')
|
||||||
|
|
||||||
|
all: clean inplace test
|
||||||
|
|
||||||
|
clean-ctags:
|
||||||
|
rm -f tags
|
||||||
|
|
||||||
|
clean: clean-ctags
|
||||||
|
$(PYTHON) setup.py clean
|
||||||
|
rm -rf dist
|
||||||
|
|
||||||
|
in: inplace # just a shortcut
|
||||||
|
inplace:
|
||||||
|
$(PYTHON) setup.py build_ext -i
|
||||||
|
|
||||||
|
dev-meson:
|
||||||
|
pip install --verbose --no-build-isolation --editable . --config-settings editable-verbose=true
|
||||||
|
|
||||||
|
clean-meson:
|
||||||
|
pip uninstall -y scikit-learn
|
||||||
|
|
||||||
|
test-code: in
|
||||||
|
$(PYTEST) --showlocals -v sklearn --durations=20
|
||||||
|
test-sphinxext:
|
||||||
|
$(PYTEST) --showlocals -v doc/sphinxext/
|
||||||
|
test-doc:
|
||||||
|
ifeq ($(BITS),64)
|
||||||
|
$(PYTEST) $(shell find doc -name '*.rst' | sort)
|
||||||
|
endif
|
||||||
|
test-code-parallel: in
|
||||||
|
$(PYTEST) -n auto --showlocals -v sklearn --durations=20
|
||||||
|
|
||||||
|
test-coverage:
|
||||||
|
rm -rf coverage .coverage
|
||||||
|
$(PYTEST) sklearn --showlocals -v --cov=sklearn --cov-report=html:coverage
|
||||||
|
test-coverage-parallel:
|
||||||
|
rm -rf coverage .coverage .coverage.*
|
||||||
|
$(PYTEST) sklearn -n auto --showlocals -v --cov=sklearn --cov-report=html:coverage
|
||||||
|
|
||||||
|
test: test-code test-sphinxext test-doc
|
||||||
|
|
||||||
|
trailing-spaces:
|
||||||
|
find sklearn -name "*.py" -exec perl -pi -e 's/[ \t]*$$//' {} \;
|
||||||
|
|
||||||
|
cython:
|
||||||
|
python setup.py build_src
|
||||||
|
|
||||||
|
ctags:
|
||||||
|
# make tags for symbol based navigation in emacs and vim
|
||||||
|
# Install with: sudo apt-get install exuberant-ctags
|
||||||
|
$(CTAGS) --python-kinds=-i -R sklearn
|
||||||
|
|
||||||
|
doc: inplace
|
||||||
|
$(MAKE) -C doc html
|
||||||
|
|
||||||
|
doc-noplot: inplace
|
||||||
|
$(MAKE) -C doc html-noplot
|
||||||
|
|
||||||
|
code-analysis:
|
||||||
|
build_tools/linting.sh
|
|
@ -0,0 +1,301 @@
|
||||||
|
Metadata-Version: 2.1
|
||||||
|
Name: scikit-learn
|
||||||
|
Version: 1.5.1
|
||||||
|
Summary: A set of python modules for machine learning and data mining
|
||||||
|
Home-page: https://scikit-learn.org
|
||||||
|
Maintainer-Email: scikit-learn developers <scikit-learn@python.org>
|
||||||
|
License: new BSD
|
||||||
|
Classifier: Intended Audience :: Science/Research
|
||||||
|
Classifier: Intended Audience :: Developers
|
||||||
|
Classifier: License :: OSI Approved :: BSD License
|
||||||
|
Classifier: Programming Language :: C
|
||||||
|
Classifier: Programming Language :: Python
|
||||||
|
Classifier: Topic :: Software Development
|
||||||
|
Classifier: Topic :: Scientific/Engineering
|
||||||
|
Classifier: Development Status :: 5 - Production/Stable
|
||||||
|
Classifier: Operating System :: Microsoft :: Windows
|
||||||
|
Classifier: Operating System :: POSIX
|
||||||
|
Classifier: Operating System :: Unix
|
||||||
|
Classifier: Operating System :: MacOS
|
||||||
|
Classifier: Programming Language :: Python :: 3
|
||||||
|
Classifier: Programming Language :: Python :: 3.9
|
||||||
|
Classifier: Programming Language :: Python :: 3.10
|
||||||
|
Classifier: Programming Language :: Python :: 3.11
|
||||||
|
Classifier: Programming Language :: Python :: 3.12
|
||||||
|
Classifier: Programming Language :: Python :: Implementation :: CPython
|
||||||
|
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
||||||
|
Project-URL: Homepage, https://scikit-learn.org
|
||||||
|
Project-URL: Source, https://github.com/scikit-learn/scikit-learn
|
||||||
|
Project-URL: Download, https://pypi.org/project/scikit-learn/#files
|
||||||
|
Project-URL: Tracker, https://github.com/scikit-learn/scikit-learn/issues
|
||||||
|
Project-URL: Release notes, https://scikit-learn.org/stable/whats_new
|
||||||
|
Requires-Python: >=3.9
|
||||||
|
Requires-Dist: numpy>=1.19.5
|
||||||
|
Requires-Dist: scipy>=1.6.0
|
||||||
|
Requires-Dist: joblib>=1.2.0
|
||||||
|
Requires-Dist: threadpoolctl>=3.1.0
|
||||||
|
Requires-Dist: numpy>=1.19.5; extra == "build"
|
||||||
|
Requires-Dist: scipy>=1.6.0; extra == "build"
|
||||||
|
Requires-Dist: cython>=3.0.10; extra == "build"
|
||||||
|
Requires-Dist: meson-python>=0.16.0; extra == "build"
|
||||||
|
Requires-Dist: numpy>=1.19.5; extra == "install"
|
||||||
|
Requires-Dist: scipy>=1.6.0; extra == "install"
|
||||||
|
Requires-Dist: joblib>=1.2.0; extra == "install"
|
||||||
|
Requires-Dist: threadpoolctl>=3.1.0; extra == "install"
|
||||||
|
Requires-Dist: matplotlib>=3.3.4; extra == "benchmark"
|
||||||
|
Requires-Dist: pandas>=1.1.5; extra == "benchmark"
|
||||||
|
Requires-Dist: memory_profiler>=0.57.0; extra == "benchmark"
|
||||||
|
Requires-Dist: matplotlib>=3.3.4; extra == "docs"
|
||||||
|
Requires-Dist: scikit-image>=0.17.2; extra == "docs"
|
||||||
|
Requires-Dist: pandas>=1.1.5; extra == "docs"
|
||||||
|
Requires-Dist: seaborn>=0.9.0; extra == "docs"
|
||||||
|
Requires-Dist: memory_profiler>=0.57.0; extra == "docs"
|
||||||
|
Requires-Dist: sphinx>=7.3.7; extra == "docs"
|
||||||
|
Requires-Dist: sphinx-copybutton>=0.5.2; extra == "docs"
|
||||||
|
Requires-Dist: sphinx-gallery>=0.16.0; extra == "docs"
|
||||||
|
Requires-Dist: numpydoc>=1.2.0; extra == "docs"
|
||||||
|
Requires-Dist: Pillow>=7.1.2; extra == "docs"
|
||||||
|
Requires-Dist: pooch>=1.6.0; extra == "docs"
|
||||||
|
Requires-Dist: sphinx-prompt>=1.4.0; extra == "docs"
|
||||||
|
Requires-Dist: sphinxext-opengraph>=0.9.1; extra == "docs"
|
||||||
|
Requires-Dist: plotly>=5.14.0; extra == "docs"
|
||||||
|
Requires-Dist: polars>=0.20.23; extra == "docs"
|
||||||
|
Requires-Dist: sphinx-design>=0.5.0; extra == "docs"
|
||||||
|
Requires-Dist: sphinxcontrib-sass>=0.3.4; extra == "docs"
|
||||||
|
Requires-Dist: pydata-sphinx-theme>=0.15.3; extra == "docs"
|
||||||
|
Requires-Dist: sphinx-remove-toctrees>=1.0.0.post1; extra == "docs"
|
||||||
|
Requires-Dist: matplotlib>=3.3.4; extra == "examples"
|
||||||
|
Requires-Dist: scikit-image>=0.17.2; extra == "examples"
|
||||||
|
Requires-Dist: pandas>=1.1.5; extra == "examples"
|
||||||
|
Requires-Dist: seaborn>=0.9.0; extra == "examples"
|
||||||
|
Requires-Dist: pooch>=1.6.0; extra == "examples"
|
||||||
|
Requires-Dist: plotly>=5.14.0; extra == "examples"
|
||||||
|
Requires-Dist: matplotlib>=3.3.4; extra == "tests"
|
||||||
|
Requires-Dist: scikit-image>=0.17.2; extra == "tests"
|
||||||
|
Requires-Dist: pandas>=1.1.5; extra == "tests"
|
||||||
|
Requires-Dist: pytest>=7.1.2; extra == "tests"
|
||||||
|
Requires-Dist: pytest-cov>=2.9.0; extra == "tests"
|
||||||
|
Requires-Dist: ruff>=0.2.1; extra == "tests"
|
||||||
|
Requires-Dist: black>=24.3.0; extra == "tests"
|
||||||
|
Requires-Dist: mypy>=1.9; extra == "tests"
|
||||||
|
Requires-Dist: pyamg>=4.0.0; extra == "tests"
|
||||||
|
Requires-Dist: polars>=0.20.23; extra == "tests"
|
||||||
|
Requires-Dist: pyarrow>=12.0.0; extra == "tests"
|
||||||
|
Requires-Dist: numpydoc>=1.2.0; extra == "tests"
|
||||||
|
Requires-Dist: pooch>=1.6.0; extra == "tests"
|
||||||
|
Requires-Dist: conda-lock==2.5.6; extra == "maintenance"
|
||||||
|
Provides-Extra: build
|
||||||
|
Provides-Extra: install
|
||||||
|
Provides-Extra: benchmark
|
||||||
|
Provides-Extra: docs
|
||||||
|
Provides-Extra: examples
|
||||||
|
Provides-Extra: tests
|
||||||
|
Provides-Extra: maintenance
|
||||||
|
Description-Content-Type: text/x-rst
|
||||||
|
|
||||||
|
.. -*- mode: rst -*-
|
||||||
|
|
||||||
|
|Azure| |CirrusCI| |Codecov| |CircleCI| |Nightly wheels| |Black| |PythonVersion| |PyPi| |DOI| |Benchmark|
|
||||||
|
|
||||||
|
.. |Azure| image:: https://dev.azure.com/scikit-learn/scikit-learn/_apis/build/status/scikit-learn.scikit-learn?branchName=main
|
||||||
|
:target: https://dev.azure.com/scikit-learn/scikit-learn/_build/latest?definitionId=1&branchName=main
|
||||||
|
|
||||||
|
.. |CircleCI| image:: https://circleci.com/gh/scikit-learn/scikit-learn/tree/main.svg?style=shield
|
||||||
|
:target: https://circleci.com/gh/scikit-learn/scikit-learn
|
||||||
|
|
||||||
|
.. |CirrusCI| image:: https://img.shields.io/cirrus/github/scikit-learn/scikit-learn/main?label=Cirrus%20CI
|
||||||
|
:target: https://cirrus-ci.com/github/scikit-learn/scikit-learn/main
|
||||||
|
|
||||||
|
.. |Codecov| image:: https://codecov.io/gh/scikit-learn/scikit-learn/branch/main/graph/badge.svg?token=Pk8G9gg3y9
|
||||||
|
:target: https://codecov.io/gh/scikit-learn/scikit-learn
|
||||||
|
|
||||||
|
.. |Nightly wheels| image:: https://github.com/scikit-learn/scikit-learn/workflows/Wheel%20builder/badge.svg?event=schedule
|
||||||
|
:target: https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Wheel+builder%22+event%3Aschedule
|
||||||
|
|
||||||
|
.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/scikit-learn.svg
|
||||||
|
:target: https://pypi.org/project/scikit-learn/
|
||||||
|
|
||||||
|
.. |PyPi| image:: https://img.shields.io/pypi/v/scikit-learn
|
||||||
|
:target: https://pypi.org/project/scikit-learn
|
||||||
|
|
||||||
|
.. |Black| image:: https://img.shields.io/badge/code%20style-black-000000.svg
|
||||||
|
:target: https://github.com/psf/black
|
||||||
|
|
||||||
|
.. |DOI| image:: https://zenodo.org/badge/21369/scikit-learn/scikit-learn.svg
|
||||||
|
:target: https://zenodo.org/badge/latestdoi/21369/scikit-learn/scikit-learn
|
||||||
|
|
||||||
|
.. |Benchmark| image:: https://img.shields.io/badge/Benchmarked%20by-asv-blue
|
||||||
|
:target: https://scikit-learn.org/scikit-learn-benchmarks
|
||||||
|
|
||||||
|
.. |PythonMinVersion| replace:: 3.9
|
||||||
|
.. |NumPyMinVersion| replace:: 1.19.5
|
||||||
|
.. |SciPyMinVersion| replace:: 1.6.0
|
||||||
|
.. |JoblibMinVersion| replace:: 1.2.0
|
||||||
|
.. |ThreadpoolctlMinVersion| replace:: 3.1.0
|
||||||
|
.. |MatplotlibMinVersion| replace:: 3.3.4
|
||||||
|
.. |Scikit-ImageMinVersion| replace:: 0.17.2
|
||||||
|
.. |PandasMinVersion| replace:: 1.1.5
|
||||||
|
.. |SeabornMinVersion| replace:: 0.9.0
|
||||||
|
.. |PytestMinVersion| replace:: 7.1.2
|
||||||
|
.. |PlotlyMinVersion| replace:: 5.14.0
|
||||||
|
|
||||||
|
.. image:: https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/doc/logos/scikit-learn-logo.png
|
||||||
|
:target: https://scikit-learn.org/
|
||||||
|
|
||||||
|
**scikit-learn** is a Python module for machine learning built on top of
|
||||||
|
SciPy and is distributed under the 3-Clause BSD license.
|
||||||
|
|
||||||
|
The project was started in 2007 by David Cournapeau as a Google Summer
|
||||||
|
of Code project, and since then many volunteers have contributed. See
|
||||||
|
the `About us <https://scikit-learn.org/dev/about.html#authors>`__ page
|
||||||
|
for a list of core contributors.
|
||||||
|
|
||||||
|
It is currently maintained by a team of volunteers.
|
||||||
|
|
||||||
|
Website: https://scikit-learn.org
|
||||||
|
|
||||||
|
Installation
|
||||||
|
------------
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
scikit-learn requires:
|
||||||
|
|
||||||
|
- Python (>= |PythonMinVersion|)
|
||||||
|
- NumPy (>= |NumPyMinVersion|)
|
||||||
|
- SciPy (>= |SciPyMinVersion|)
|
||||||
|
- joblib (>= |JoblibMinVersion|)
|
||||||
|
- threadpoolctl (>= |ThreadpoolctlMinVersion|)
|
||||||
|
|
||||||
|
=======
|
||||||
|
|
||||||
|
**Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4.**
|
||||||
|
scikit-learn 1.0 and later require Python 3.7 or newer.
|
||||||
|
scikit-learn 1.1 and later require Python 3.8 or newer.
|
||||||
|
|
||||||
|
Scikit-learn plotting capabilities (i.e., functions start with ``plot_`` and
|
||||||
|
classes end with ``Display``) require Matplotlib (>= |MatplotlibMinVersion|).
|
||||||
|
For running the examples Matplotlib >= |MatplotlibMinVersion| is required.
|
||||||
|
A few examples require scikit-image >= |Scikit-ImageMinVersion|, a few examples
|
||||||
|
require pandas >= |PandasMinVersion|, some examples require seaborn >=
|
||||||
|
|SeabornMinVersion| and plotly >= |PlotlyMinVersion|.
|
||||||
|
|
||||||
|
User installation
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If you already have a working installation of NumPy and SciPy,
|
||||||
|
the easiest way to install scikit-learn is using ``pip``::
|
||||||
|
|
||||||
|
pip install -U scikit-learn
|
||||||
|
|
||||||
|
or ``conda``::
|
||||||
|
|
||||||
|
conda install -c conda-forge scikit-learn
|
||||||
|
|
||||||
|
The documentation includes more detailed `installation instructions <https://scikit-learn.org/stable/install.html>`_.
|
||||||
|
|
||||||
|
|
||||||
|
Changelog
|
||||||
|
---------
|
||||||
|
|
||||||
|
See the `changelog <https://scikit-learn.org/dev/whats_new.html>`__
|
||||||
|
for a history of notable changes to scikit-learn.
|
||||||
|
|
||||||
|
Development
|
||||||
|
-----------
|
||||||
|
|
||||||
|
We welcome new contributors of all experience levels. The scikit-learn
|
||||||
|
community goals are to be helpful, welcoming, and effective. The
|
||||||
|
`Development Guide <https://scikit-learn.org/stable/developers/index.html>`_
|
||||||
|
has detailed information about contributing code, documentation, tests, and
|
||||||
|
more. We've included some basic information in this README.
|
||||||
|
|
||||||
|
Important links
|
||||||
|
~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- Official source code repo: https://github.com/scikit-learn/scikit-learn
|
||||||
|
- Download releases: https://pypi.org/project/scikit-learn/
|
||||||
|
- Issue tracker: https://github.com/scikit-learn/scikit-learn/issues
|
||||||
|
|
||||||
|
Source code
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
You can check the latest sources with the command::
|
||||||
|
|
||||||
|
git clone https://github.com/scikit-learn/scikit-learn.git
|
||||||
|
|
||||||
|
Contributing
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
To learn more about making a contribution to scikit-learn, please see our
|
||||||
|
`Contributing guide
|
||||||
|
<https://scikit-learn.org/dev/developers/contributing.html>`_.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
~~~~~~~
|
||||||
|
|
||||||
|
After installation, you can launch the test suite from outside the source
|
||||||
|
directory (you will need to have ``pytest`` >= |PyTestMinVersion| installed)::
|
||||||
|
|
||||||
|
pytest sklearn
|
||||||
|
|
||||||
|
See the web page https://scikit-learn.org/dev/developers/contributing.html#testing-and-improving-test-coverage
|
||||||
|
for more information.
|
||||||
|
|
||||||
|
Random number generation can be controlled during testing by setting
|
||||||
|
the ``SKLEARN_SEED`` environment variable.
|
||||||
|
|
||||||
|
Submitting a Pull Request
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Before opening a Pull Request, have a look at the
|
||||||
|
full Contributing page to make sure your code complies
|
||||||
|
with our guidelines: https://scikit-learn.org/stable/developers/index.html
|
||||||
|
|
||||||
|
Project History
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The project was started in 2007 by David Cournapeau as a Google Summer
|
||||||
|
of Code project, and since then many volunteers have contributed. See
|
||||||
|
the `About us <https://scikit-learn.org/dev/about.html#authors>`__ page
|
||||||
|
for a list of core contributors.
|
||||||
|
|
||||||
|
The project is currently maintained by a team of volunteers.
|
||||||
|
|
||||||
|
**Note**: `scikit-learn` was previously referred to as `scikits.learn`.
|
||||||
|
|
||||||
|
Help and Support
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Documentation
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- HTML documentation (stable release): https://scikit-learn.org
|
||||||
|
- HTML documentation (development version): https://scikit-learn.org/dev/
|
||||||
|
- FAQ: https://scikit-learn.org/stable/faq.html
|
||||||
|
|
||||||
|
Communication
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- Mailing list: https://mail.python.org/mailman/listinfo/scikit-learn
|
||||||
|
- Logos & Branding: https://github.com/scikit-learn/scikit-learn/tree/main/doc/logos
|
||||||
|
- Blog: https://blog.scikit-learn.org
|
||||||
|
- Calendar: https://blog.scikit-learn.org/calendar/
|
||||||
|
- Twitter: https://twitter.com/scikit_learn
|
||||||
|
- Stack Overflow: https://stackoverflow.com/questions/tagged/scikit-learn
|
||||||
|
- GitHub Discussions: https://github.com/scikit-learn/scikit-learn/discussions
|
||||||
|
- Website: https://scikit-learn.org
|
||||||
|
- LinkedIn: https://www.linkedin.com/company/scikit-learn
|
||||||
|
- YouTube: https://www.youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw/playlists
|
||||||
|
- Facebook: https://www.facebook.com/scikitlearnofficial/
|
||||||
|
- Instagram: https://www.instagram.com/scikitlearnofficial/
|
||||||
|
- TikTok: https://www.tiktok.com/@scikit.learn
|
||||||
|
- Mastodon: https://mastodon.social/@sklearn@fosstodon.org
|
||||||
|
- Discord: https://discord.gg/h9qyrK8Jc8
|
||||||
|
|
||||||
|
|
||||||
|
Citation
|
||||||
|
~~~~~~~~
|
||||||
|
|
||||||
|
If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn
|
|
@ -0,0 +1,206 @@
|
||||||
|
.. -*- mode: rst -*-
|
||||||
|
|
||||||
|
|Azure| |CirrusCI| |Codecov| |CircleCI| |Nightly wheels| |Black| |PythonVersion| |PyPi| |DOI| |Benchmark|
|
||||||
|
|
||||||
|
.. |Azure| image:: https://dev.azure.com/scikit-learn/scikit-learn/_apis/build/status/scikit-learn.scikit-learn?branchName=main
|
||||||
|
:target: https://dev.azure.com/scikit-learn/scikit-learn/_build/latest?definitionId=1&branchName=main
|
||||||
|
|
||||||
|
.. |CircleCI| image:: https://circleci.com/gh/scikit-learn/scikit-learn/tree/main.svg?style=shield
|
||||||
|
:target: https://circleci.com/gh/scikit-learn/scikit-learn
|
||||||
|
|
||||||
|
.. |CirrusCI| image:: https://img.shields.io/cirrus/github/scikit-learn/scikit-learn/main?label=Cirrus%20CI
|
||||||
|
:target: https://cirrus-ci.com/github/scikit-learn/scikit-learn/main
|
||||||
|
|
||||||
|
.. |Codecov| image:: https://codecov.io/gh/scikit-learn/scikit-learn/branch/main/graph/badge.svg?token=Pk8G9gg3y9
|
||||||
|
:target: https://codecov.io/gh/scikit-learn/scikit-learn
|
||||||
|
|
||||||
|
.. |Nightly wheels| image:: https://github.com/scikit-learn/scikit-learn/workflows/Wheel%20builder/badge.svg?event=schedule
|
||||||
|
:target: https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Wheel+builder%22+event%3Aschedule
|
||||||
|
|
||||||
|
.. |PythonVersion| image:: https://img.shields.io/pypi/pyversions/scikit-learn.svg
|
||||||
|
:target: https://pypi.org/project/scikit-learn/
|
||||||
|
|
||||||
|
.. |PyPi| image:: https://img.shields.io/pypi/v/scikit-learn
|
||||||
|
:target: https://pypi.org/project/scikit-learn
|
||||||
|
|
||||||
|
.. |Black| image:: https://img.shields.io/badge/code%20style-black-000000.svg
|
||||||
|
:target: https://github.com/psf/black
|
||||||
|
|
||||||
|
.. |DOI| image:: https://zenodo.org/badge/21369/scikit-learn/scikit-learn.svg
|
||||||
|
:target: https://zenodo.org/badge/latestdoi/21369/scikit-learn/scikit-learn
|
||||||
|
|
||||||
|
.. |Benchmark| image:: https://img.shields.io/badge/Benchmarked%20by-asv-blue
|
||||||
|
:target: https://scikit-learn.org/scikit-learn-benchmarks
|
||||||
|
|
||||||
|
.. |PythonMinVersion| replace:: 3.9
|
||||||
|
.. |NumPyMinVersion| replace:: 1.19.5
|
||||||
|
.. |SciPyMinVersion| replace:: 1.6.0
|
||||||
|
.. |JoblibMinVersion| replace:: 1.2.0
|
||||||
|
.. |ThreadpoolctlMinVersion| replace:: 3.1.0
|
||||||
|
.. |MatplotlibMinVersion| replace:: 3.3.4
|
||||||
|
.. |Scikit-ImageMinVersion| replace:: 0.17.2
|
||||||
|
.. |PandasMinVersion| replace:: 1.1.5
|
||||||
|
.. |SeabornMinVersion| replace:: 0.9.0
|
||||||
|
.. |PytestMinVersion| replace:: 7.1.2
|
||||||
|
.. |PlotlyMinVersion| replace:: 5.14.0
|
||||||
|
|
||||||
|
.. image:: https://raw.githubusercontent.com/scikit-learn/scikit-learn/main/doc/logos/scikit-learn-logo.png
|
||||||
|
:target: https://scikit-learn.org/
|
||||||
|
|
||||||
|
**scikit-learn** is a Python module for machine learning built on top of
|
||||||
|
SciPy and is distributed under the 3-Clause BSD license.
|
||||||
|
|
||||||
|
The project was started in 2007 by David Cournapeau as a Google Summer
|
||||||
|
of Code project, and since then many volunteers have contributed. See
|
||||||
|
the `About us <https://scikit-learn.org/dev/about.html#authors>`__ page
|
||||||
|
for a list of core contributors.
|
||||||
|
|
||||||
|
It is currently maintained by a team of volunteers.
|
||||||
|
|
||||||
|
Website: https://scikit-learn.org
|
||||||
|
|
||||||
|
Installation
|
||||||
|
------------
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
scikit-learn requires:
|
||||||
|
|
||||||
|
- Python (>= |PythonMinVersion|)
|
||||||
|
- NumPy (>= |NumPyMinVersion|)
|
||||||
|
- SciPy (>= |SciPyMinVersion|)
|
||||||
|
- joblib (>= |JoblibMinVersion|)
|
||||||
|
- threadpoolctl (>= |ThreadpoolctlMinVersion|)
|
||||||
|
|
||||||
|
=======
|
||||||
|
|
||||||
|
**Scikit-learn 0.20 was the last version to support Python 2.7 and Python 3.4.**
|
||||||
|
scikit-learn 1.0 and later require Python 3.7 or newer.
|
||||||
|
scikit-learn 1.1 and later require Python 3.8 or newer.
|
||||||
|
|
||||||
|
Scikit-learn plotting capabilities (i.e., functions start with ``plot_`` and
|
||||||
|
classes end with ``Display``) require Matplotlib (>= |MatplotlibMinVersion|).
|
||||||
|
For running the examples Matplotlib >= |MatplotlibMinVersion| is required.
|
||||||
|
A few examples require scikit-image >= |Scikit-ImageMinVersion|, a few examples
|
||||||
|
require pandas >= |PandasMinVersion|, some examples require seaborn >=
|
||||||
|
|SeabornMinVersion| and plotly >= |PlotlyMinVersion|.
|
||||||
|
|
||||||
|
User installation
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If you already have a working installation of NumPy and SciPy,
|
||||||
|
the easiest way to install scikit-learn is using ``pip``::
|
||||||
|
|
||||||
|
pip install -U scikit-learn
|
||||||
|
|
||||||
|
or ``conda``::
|
||||||
|
|
||||||
|
conda install -c conda-forge scikit-learn
|
||||||
|
|
||||||
|
The documentation includes more detailed `installation instructions <https://scikit-learn.org/stable/install.html>`_.
|
||||||
|
|
||||||
|
|
||||||
|
Changelog
|
||||||
|
---------
|
||||||
|
|
||||||
|
See the `changelog <https://scikit-learn.org/dev/whats_new.html>`__
|
||||||
|
for a history of notable changes to scikit-learn.
|
||||||
|
|
||||||
|
Development
|
||||||
|
-----------
|
||||||
|
|
||||||
|
We welcome new contributors of all experience levels. The scikit-learn
|
||||||
|
community goals are to be helpful, welcoming, and effective. The
|
||||||
|
`Development Guide <https://scikit-learn.org/stable/developers/index.html>`_
|
||||||
|
has detailed information about contributing code, documentation, tests, and
|
||||||
|
more. We've included some basic information in this README.
|
||||||
|
|
||||||
|
Important links
|
||||||
|
~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- Official source code repo: https://github.com/scikit-learn/scikit-learn
|
||||||
|
- Download releases: https://pypi.org/project/scikit-learn/
|
||||||
|
- Issue tracker: https://github.com/scikit-learn/scikit-learn/issues
|
||||||
|
|
||||||
|
Source code
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
You can check the latest sources with the command::
|
||||||
|
|
||||||
|
git clone https://github.com/scikit-learn/scikit-learn.git
|
||||||
|
|
||||||
|
Contributing
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
To learn more about making a contribution to scikit-learn, please see our
|
||||||
|
`Contributing guide
|
||||||
|
<https://scikit-learn.org/dev/developers/contributing.html>`_.
|
||||||
|
|
||||||
|
Testing
|
||||||
|
~~~~~~~
|
||||||
|
|
||||||
|
After installation, you can launch the test suite from outside the source
|
||||||
|
directory (you will need to have ``pytest`` >= |PyTestMinVersion| installed)::
|
||||||
|
|
||||||
|
pytest sklearn
|
||||||
|
|
||||||
|
See the web page https://scikit-learn.org/dev/developers/contributing.html#testing-and-improving-test-coverage
|
||||||
|
for more information.
|
||||||
|
|
||||||
|
Random number generation can be controlled during testing by setting
|
||||||
|
the ``SKLEARN_SEED`` environment variable.
|
||||||
|
|
||||||
|
Submitting a Pull Request
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Before opening a Pull Request, have a look at the
|
||||||
|
full Contributing page to make sure your code complies
|
||||||
|
with our guidelines: https://scikit-learn.org/stable/developers/index.html
|
||||||
|
|
||||||
|
Project History
|
||||||
|
---------------
|
||||||
|
|
||||||
|
The project was started in 2007 by David Cournapeau as a Google Summer
|
||||||
|
of Code project, and since then many volunteers have contributed. See
|
||||||
|
the `About us <https://scikit-learn.org/dev/about.html#authors>`__ page
|
||||||
|
for a list of core contributors.
|
||||||
|
|
||||||
|
The project is currently maintained by a team of volunteers.
|
||||||
|
|
||||||
|
**Note**: `scikit-learn` was previously referred to as `scikits.learn`.
|
||||||
|
|
||||||
|
Help and Support
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Documentation
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- HTML documentation (stable release): https://scikit-learn.org
|
||||||
|
- HTML documentation (development version): https://scikit-learn.org/dev/
|
||||||
|
- FAQ: https://scikit-learn.org/stable/faq.html
|
||||||
|
|
||||||
|
Communication
|
||||||
|
~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
- Mailing list: https://mail.python.org/mailman/listinfo/scikit-learn
|
||||||
|
- Logos & Branding: https://github.com/scikit-learn/scikit-learn/tree/main/doc/logos
|
||||||
|
- Blog: https://blog.scikit-learn.org
|
||||||
|
- Calendar: https://blog.scikit-learn.org/calendar/
|
||||||
|
- Twitter: https://twitter.com/scikit_learn
|
||||||
|
- Stack Overflow: https://stackoverflow.com/questions/tagged/scikit-learn
|
||||||
|
- GitHub Discussions: https://github.com/scikit-learn/scikit-learn/discussions
|
||||||
|
- Website: https://scikit-learn.org
|
||||||
|
- LinkedIn: https://www.linkedin.com/company/scikit-learn
|
||||||
|
- YouTube: https://www.youtube.com/channel/UCJosFjYm0ZYVUARxuOZqnnw/playlists
|
||||||
|
- Facebook: https://www.facebook.com/scikitlearnofficial/
|
||||||
|
- Instagram: https://www.instagram.com/scikitlearnofficial/
|
||||||
|
- TikTok: https://www.tiktok.com/@scikit.learn
|
||||||
|
- Mastodon: https://mastodon.social/@sklearn@fosstodon.org
|
||||||
|
- Discord: https://discord.gg/h9qyrK8Jc8
|
||||||
|
|
||||||
|
|
||||||
|
Citation
|
||||||
|
~~~~~~~~
|
||||||
|
|
||||||
|
If you use scikit-learn in a scientific publication, we would appreciate citations: https://scikit-learn.org/stable/about.html#citing-scikit-learn
|
|
@ -0,0 +1,20 @@
|
||||||
|
# Security Policy
|
||||||
|
|
||||||
|
## Supported Versions
|
||||||
|
|
||||||
|
| Version | Supported |
|
||||||
|
| ------------- | ------------------ |
|
||||||
|
| 1.4.2 | :white_check_mark: |
|
||||||
|
| < 1.4.2 | :x: |
|
||||||
|
|
||||||
|
## Reporting a Vulnerability
|
||||||
|
|
||||||
|
Please report security vulnerabilities by email to `security@scikit-learn.org`.
|
||||||
|
This email is an alias to a subset of the scikit-learn maintainers' team.
|
||||||
|
|
||||||
|
If the security vulnerability is accepted, a patch will be crafted privately
|
||||||
|
in order to prepare a dedicated bugfix release as timely as possible (depending
|
||||||
|
on the complexity of the fix).
|
||||||
|
|
||||||
|
In addition to sending the report by email, you can also report security
|
||||||
|
vulnerabilities to [tidelift](https://tidelift.com/security).
|
|
@ -0,0 +1,152 @@
|
||||||
|
# Makefile for Sphinx documentation
|
||||||
|
#
|
||||||
|
|
||||||
|
# You can set these variables from the command line.
|
||||||
|
SPHINXOPTS = -T
|
||||||
|
SPHINXBUILD ?= sphinx-build
|
||||||
|
PAPER =
|
||||||
|
BUILDDIR = _build
|
||||||
|
|
||||||
|
ifneq ($(EXAMPLES_PATTERN),)
|
||||||
|
EXAMPLES_PATTERN_OPTS := -D sphinx_gallery_conf.filename_pattern="$(EXAMPLES_PATTERN)"
|
||||||
|
endif
|
||||||
|
|
||||||
|
ifeq ($(CI), true)
|
||||||
|
# On CircleCI using -j2 does not seem to speed up the html-noplot build
|
||||||
|
SPHINX_NUMJOBS_NOPLOT_DEFAULT=1
|
||||||
|
else ifeq ($(shell uname), Darwin)
|
||||||
|
# Avoid stalling issues on MacOS
|
||||||
|
SPHINX_NUMJOBS_NOPLOT_DEFAULT=1
|
||||||
|
else
|
||||||
|
SPHINX_NUMJOBS_NOPLOT_DEFAULT=auto
|
||||||
|
endif
|
||||||
|
|
||||||
|
# Internal variables.
|
||||||
|
PAPEROPT_a4 = -D latex_paper_size=a4
|
||||||
|
PAPEROPT_letter = -D latex_paper_size=letter
|
||||||
|
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS)\
|
||||||
|
$(EXAMPLES_PATTERN_OPTS) .
|
||||||
|
|
||||||
|
|
||||||
|
.PHONY: help clean html dirhtml ziphtml pickle json latex latexpdf changes linkcheck doctest optipng
|
||||||
|
|
||||||
|
all: html-noplot
|
||||||
|
|
||||||
|
help:
|
||||||
|
@echo "Please use \`make <target>' where <target> is one of"
|
||||||
|
@echo " html to make standalone HTML files"
|
||||||
|
@echo " dirhtml to make HTML files named index.html in directories"
|
||||||
|
@echo " ziphtml to make a ZIP of the HTML"
|
||||||
|
@echo " pickle to make pickle files"
|
||||||
|
@echo " json to make JSON files"
|
||||||
|
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
|
||||||
|
@echo " latexpdf to make LaTeX files and run them through pdflatex"
|
||||||
|
@echo " changes to make an overview of all changed/added/deprecated items"
|
||||||
|
@echo " linkcheck to check all external links for integrity"
|
||||||
|
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
|
||||||
|
|
||||||
|
clean:
|
||||||
|
-rm -rf $(BUILDDIR)/*
|
||||||
|
@echo "Removed $(BUILDDIR)/*"
|
||||||
|
-rm -rf auto_examples/
|
||||||
|
@echo "Removed auto_examples/"
|
||||||
|
-rm -rf generated/*
|
||||||
|
@echo "Removed generated/"
|
||||||
|
-rm -rf modules/generated/
|
||||||
|
@echo "Removed modules/generated/"
|
||||||
|
-rm -rf css/styles/
|
||||||
|
@echo "Removed css/styles/"
|
||||||
|
-rm -rf api/*.rst
|
||||||
|
@echo "Removed api/*.rst"
|
||||||
|
|
||||||
|
# Default to SPHINX_NUMJOBS=1 for full documentation build. Using
|
||||||
|
# SPHINX_NUMJOBS!=1 may actually slow down the build, or cause weird issues in
|
||||||
|
# the CI (job stalling or EOFError), see
|
||||||
|
# https://github.com/scikit-learn/scikit-learn/pull/25836 or
|
||||||
|
# https://github.com/scikit-learn/scikit-learn/pull/25809
|
||||||
|
html: SPHINX_NUMJOBS ?= 1
|
||||||
|
html:
|
||||||
|
# These two lines make the build a bit more lengthy, and the
|
||||||
|
# the embedding of images more robust
|
||||||
|
rm -rf $(BUILDDIR)/html/_images
|
||||||
|
#rm -rf _build/doctrees/
|
||||||
|
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) -j$(SPHINX_NUMJOBS) $(BUILDDIR)/html/stable
|
||||||
|
@echo
|
||||||
|
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable"
|
||||||
|
|
||||||
|
# Default to SPHINX_NUMJOBS=auto (except on MacOS and CI) since this makes
|
||||||
|
# html-noplot build faster
|
||||||
|
html-noplot: SPHINX_NUMJOBS ?= $(SPHINX_NUMJOBS_NOPLOT_DEFAULT)
|
||||||
|
html-noplot:
|
||||||
|
$(SPHINXBUILD) -D plot_gallery=0 -b html $(ALLSPHINXOPTS) -j$(SPHINX_NUMJOBS) \
|
||||||
|
$(BUILDDIR)/html/stable
|
||||||
|
@echo
|
||||||
|
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable."
|
||||||
|
|
||||||
|
dirhtml:
|
||||||
|
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
|
||||||
|
@echo
|
||||||
|
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
|
||||||
|
|
||||||
|
ziphtml:
|
||||||
|
@if [ ! -d "$(BUILDDIR)/html/stable/" ]; then \
|
||||||
|
make html; \
|
||||||
|
fi
|
||||||
|
# Optimize the images to reduce the size of the ZIP
|
||||||
|
optipng $(BUILDDIR)/html/stable/_images/*.png
|
||||||
|
# Exclude the output directory to avoid infinity recursion
|
||||||
|
cd $(BUILDDIR)/html/stable; \
|
||||||
|
zip -q -x _downloads \
|
||||||
|
-r _downloads/scikit-learn-docs.zip .
|
||||||
|
@echo
|
||||||
|
@echo "Build finished. The ZIP of the HTML is in $(BUILDDIR)/html/stable/_downloads."
|
||||||
|
|
||||||
|
pickle:
|
||||||
|
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
|
||||||
|
@echo
|
||||||
|
@echo "Build finished; now you can process the pickle files."
|
||||||
|
|
||||||
|
json:
|
||||||
|
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
|
||||||
|
@echo
|
||||||
|
@echo "Build finished; now you can process the JSON files."
|
||||||
|
|
||||||
|
latex:
|
||||||
|
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
|
||||||
|
@echo
|
||||||
|
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
|
||||||
|
@echo "Run \`make' in that directory to run these through (pdf)latex" \
|
||||||
|
"(use \`make latexpdf' here to do that automatically)."
|
||||||
|
|
||||||
|
latexpdf:
|
||||||
|
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
|
||||||
|
@echo "Running LaTeX files through pdflatex..."
|
||||||
|
make -C $(BUILDDIR)/latex all-pdf
|
||||||
|
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
|
||||||
|
|
||||||
|
changes:
|
||||||
|
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
|
||||||
|
@echo
|
||||||
|
@echo "The overview file is in $(BUILDDIR)/changes."
|
||||||
|
|
||||||
|
linkcheck:
|
||||||
|
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
|
||||||
|
@echo
|
||||||
|
@echo "Link check complete; look for any errors in the above output " \
|
||||||
|
"or in $(BUILDDIR)/linkcheck/output.txt."
|
||||||
|
|
||||||
|
doctest:
|
||||||
|
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
|
||||||
|
@echo "Testing of doctests in the sources finished, look at the " \
|
||||||
|
"results in $(BUILDDIR)/doctest/output.txt."
|
||||||
|
|
||||||
|
download-data:
|
||||||
|
python -c "from sklearn.datasets._lfw import _check_fetch_lfw; _check_fetch_lfw()"
|
||||||
|
|
||||||
|
# Optimize PNG files. Needs OptiPNG. Change the -P argument to the number of
|
||||||
|
# cores you have available, so -P 64 if you have a real computer ;)
|
||||||
|
optipng:
|
||||||
|
find _build auto_examples */generated -name '*.png' -print0 \
|
||||||
|
| xargs -0 -n 1 -P 4 optipng -o10
|
||||||
|
|
||||||
|
dist: html ziphtml
|
|
@ -0,0 +1,6 @@
|
||||||
|
# Documentation for scikit-learn
|
||||||
|
|
||||||
|
This directory contains the full manual and website as displayed at
|
||||||
|
https://scikit-learn.org. See
|
||||||
|
https://scikit-learn.org/dev/developers/contributing.html#documentation for
|
||||||
|
detailed information about the documentation.
|
|
@ -0,0 +1,599 @@
|
||||||
|
.. _about:
|
||||||
|
|
||||||
|
About us
|
||||||
|
========
|
||||||
|
|
||||||
|
History
|
||||||
|
-------
|
||||||
|
|
||||||
|
This project was started in 2007 as a Google Summer of Code project by
|
||||||
|
David Cournapeau. Later that year, Matthieu Brucher started work on
|
||||||
|
this project as part of his thesis.
|
||||||
|
|
||||||
|
In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent
|
||||||
|
Michel of INRIA took leadership of the project and made the first public
|
||||||
|
release, February the 1st 2010. Since then, several releases have appeared
|
||||||
|
following an approximately 3-month cycle, and a thriving international
|
||||||
|
community has been leading the development.
|
||||||
|
|
||||||
|
Governance
|
||||||
|
----------
|
||||||
|
|
||||||
|
The decision making process and governance structure of scikit-learn is laid
|
||||||
|
out in the :ref:`governance document <governance>`.
|
||||||
|
|
||||||
|
.. The "author" anchors below is there to ensure that old html links (in
|
||||||
|
the form of "about.html#author" still work)
|
||||||
|
|
||||||
|
.. _authors:
|
||||||
|
|
||||||
|
The people behind scikit-learn
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
Scikit-learn is a community project, developed by a large group of
|
||||||
|
people, all across the world. A few teams, listed below, have central
|
||||||
|
roles, however a more complete list of contributors can be found `on
|
||||||
|
github
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/graphs/contributors>`__.
|
||||||
|
|
||||||
|
Maintainers Team
|
||||||
|
................
|
||||||
|
|
||||||
|
The following people are currently maintainers, in charge of
|
||||||
|
consolidating scikit-learn's development and maintenance:
|
||||||
|
|
||||||
|
.. include:: maintainers.rst
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Please do not email the authors directly to ask for assistance or report issues.
|
||||||
|
Instead, please see `What's the best way to ask questions about scikit-learn
|
||||||
|
<https://scikit-learn.org/stable/faq.html#what-s-the-best-way-to-get-help-on-scikit-learn-usage>`_
|
||||||
|
in the FAQ.
|
||||||
|
|
||||||
|
.. seealso::
|
||||||
|
|
||||||
|
How you can :ref:`contribute to the project <contributing>`.
|
||||||
|
|
||||||
|
Documentation Team
|
||||||
|
..................
|
||||||
|
|
||||||
|
The following people help with documenting the project:
|
||||||
|
|
||||||
|
.. include:: documentation_team.rst
|
||||||
|
|
||||||
|
Contributor Experience Team
|
||||||
|
...........................
|
||||||
|
|
||||||
|
The following people are active contributors who also help with
|
||||||
|
:ref:`triaging issues <bug_triaging>`, PRs, and general
|
||||||
|
maintenance:
|
||||||
|
|
||||||
|
.. include:: contributor_experience_team.rst
|
||||||
|
|
||||||
|
Communication Team
|
||||||
|
..................
|
||||||
|
|
||||||
|
The following people help with :ref:`communication around scikit-learn
|
||||||
|
<communication_team>`.
|
||||||
|
|
||||||
|
.. include:: communication_team.rst
|
||||||
|
|
||||||
|
Emeritus Core Developers
|
||||||
|
........................
|
||||||
|
|
||||||
|
The following people have been active contributors in the past, but are no
|
||||||
|
longer active in the project:
|
||||||
|
|
||||||
|
.. include:: maintainers_emeritus.rst
|
||||||
|
|
||||||
|
Emeritus Communication Team
|
||||||
|
...........................
|
||||||
|
|
||||||
|
The following people have been active in the communication team in the
|
||||||
|
past, but no longer have communication responsibilities:
|
||||||
|
|
||||||
|
.. include:: communication_team_emeritus.rst
|
||||||
|
|
||||||
|
Emeritus Contributor Experience Team
|
||||||
|
....................................
|
||||||
|
|
||||||
|
The following people have been active in the contributor experience team in the
|
||||||
|
past:
|
||||||
|
|
||||||
|
.. include:: contributor_experience_team_emeritus.rst
|
||||||
|
|
||||||
|
.. _citing-scikit-learn:
|
||||||
|
|
||||||
|
Citing scikit-learn
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
If you use scikit-learn in a scientific publication, we would appreciate
|
||||||
|
citations to the following paper:
|
||||||
|
|
||||||
|
`Scikit-learn: Machine Learning in Python
|
||||||
|
<https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html>`_, Pedregosa
|
||||||
|
*et al.*, JMLR 12, pp. 2825-2830, 2011.
|
||||||
|
|
||||||
|
Bibtex entry::
|
||||||
|
|
||||||
|
@article{scikit-learn,
|
||||||
|
title={Scikit-learn: Machine Learning in {P}ython},
|
||||||
|
author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
|
||||||
|
and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
|
||||||
|
and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
|
||||||
|
Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
|
||||||
|
journal={Journal of Machine Learning Research},
|
||||||
|
volume={12},
|
||||||
|
pages={2825--2830},
|
||||||
|
year={2011}
|
||||||
|
}
|
||||||
|
|
||||||
|
If you want to cite scikit-learn for its API or design, you may also want to consider the
|
||||||
|
following paper:
|
||||||
|
|
||||||
|
:arxiv:`API design for machine learning software: experiences from the scikit-learn
|
||||||
|
project <1309.0238>`, Buitinck *et al.*, 2013.
|
||||||
|
|
||||||
|
Bibtex entry::
|
||||||
|
|
||||||
|
@inproceedings{sklearn_api,
|
||||||
|
author = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and
|
||||||
|
Fabian Pedregosa and Andreas Mueller and Olivier Grisel and
|
||||||
|
Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort
|
||||||
|
and Jaques Grobler and Robert Layton and Jake VanderPlas and
|
||||||
|
Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux},
|
||||||
|
title = {{API} design for machine learning software: experiences from the scikit-learn
|
||||||
|
project},
|
||||||
|
booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine Learning},
|
||||||
|
year = {2013},
|
||||||
|
pages = {108--122},
|
||||||
|
}
|
||||||
|
|
||||||
|
Artwork
|
||||||
|
-------
|
||||||
|
|
||||||
|
High quality PNG and SVG logos are available in the `doc/logos/
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/tree/main/doc/logos>`_
|
||||||
|
source directory.
|
||||||
|
|
||||||
|
.. image:: images/scikit-learn-logo-notext.png
|
||||||
|
:align: center
|
||||||
|
|
||||||
|
Funding
|
||||||
|
-------
|
||||||
|
|
||||||
|
Scikit-learn is a community driven project, however institutional and private
|
||||||
|
grants help to assure its sustainability.
|
||||||
|
|
||||||
|
The project would like to thank the following funders.
|
||||||
|
|
||||||
|
...................................
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`:probabl. <https://probabl.ai>`_ funds Adrin Jalali, Arturo Amor, François Goupil,
|
||||||
|
Guillaume Lemaitre, Jérémie du Boisberranger, Olivier Grisel, and Stefanie Senger.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/probabl.png
|
||||||
|
:target: https://probabl.ai
|
||||||
|
|
||||||
|
..........
|
||||||
|
|
||||||
|
.. |chanel| image:: images/chanel.png
|
||||||
|
:target: https://www.chanel.com
|
||||||
|
|
||||||
|
.. |axa| image:: images/axa.png
|
||||||
|
:target: https://www.axa.fr/
|
||||||
|
|
||||||
|
.. |bnp| image:: images/bnp.png
|
||||||
|
:target: https://www.bnpparibascardif.com/
|
||||||
|
|
||||||
|
.. |dataiku| image:: images/dataiku.png
|
||||||
|
:target: https://www.dataiku.com/
|
||||||
|
|
||||||
|
.. |hf| image:: images/huggingface_logo-noborder.png
|
||||||
|
:target: https://huggingface.co
|
||||||
|
|
||||||
|
.. |nvidia| image:: images/nvidia.png
|
||||||
|
:target: https://www.nvidia.com
|
||||||
|
|
||||||
|
.. |inria| image:: images/inria-logo.jpg
|
||||||
|
:target: https://www.inria.fr
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<style>
|
||||||
|
table.image-subtable tr {
|
||||||
|
border-color: transparent;
|
||||||
|
}
|
||||||
|
|
||||||
|
table.image-subtable td {
|
||||||
|
width: 50%;
|
||||||
|
vertical-align: middle;
|
||||||
|
text-align: center;
|
||||||
|
}
|
||||||
|
|
||||||
|
table.image-subtable td img {
|
||||||
|
max-height: 40px !important;
|
||||||
|
max-width: 90% !important;
|
||||||
|
}
|
||||||
|
</style>
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
The `Members <https://scikit-learn.fondation-inria.fr/en/home/#sponsors>`_ of
|
||||||
|
the `Scikit-learn Consortium at Inria Foundation
|
||||||
|
<https://scikit-learn.fondation-inria.fr/en/home/>`_ help at maintaining and
|
||||||
|
improving the project through their financial support.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. table::
|
||||||
|
:class: image-subtable
|
||||||
|
|
||||||
|
+----------+-----------+
|
||||||
|
| |chanel| |
|
||||||
|
+----------+-----------+
|
||||||
|
| |axa| | |bnp| |
|
||||||
|
+----------+-----------+
|
||||||
|
| |nvidia| | |hf| |
|
||||||
|
+----------+-----------+
|
||||||
|
| |dataiku| |
|
||||||
|
+----------+-----------+
|
||||||
|
| |inria| |
|
||||||
|
+----------+-----------+
|
||||||
|
|
||||||
|
..........
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`NVidia <https://nvidia.com>`_ funds Tim Head since 2022
|
||||||
|
and is part of the scikit-learn consortium at Inria.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/nvidia.png
|
||||||
|
:target: https://nvidia.com
|
||||||
|
|
||||||
|
..........
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`Microsoft <https://microsoft.com/>`_ funds Andreas Müller since 2020.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/microsoft.png
|
||||||
|
:target: https://microsoft.com
|
||||||
|
|
||||||
|
...........
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`Quansight Labs <https://labs.quansight.org>`_ funds Lucy Liu since 2022.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/quansight-labs.png
|
||||||
|
:target: https://labs.quansight.org
|
||||||
|
|
||||||
|
...........
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`Tidelift <https://tidelift.com/>`_ supports the project via their service
|
||||||
|
agreement.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/Tidelift-logo-on-light.svg
|
||||||
|
:target: https://tidelift.com/
|
||||||
|
|
||||||
|
...........
|
||||||
|
|
||||||
|
|
||||||
|
Past Sponsors
|
||||||
|
.............
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`Quansight Labs <https://labs.quansight.org>`_ funded Meekail Zain in 2022 and 2023,
|
||||||
|
and funded Thomas J. Fan from 2021 to 2023.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/quansight-labs.png
|
||||||
|
:target: https://labs.quansight.org
|
||||||
|
|
||||||
|
...........
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`Columbia University <https://columbia.edu/>`_ funded Andreas Müller
|
||||||
|
(2016-2020).
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/columbia.png
|
||||||
|
:target: https://columbia.edu
|
||||||
|
|
||||||
|
........
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`The University of Sydney <https://sydney.edu.au/>`_ funded Joel Nothman
|
||||||
|
(2017-2021).
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/sydney-primary.jpeg
|
||||||
|
:target: https://sydney.edu.au/
|
||||||
|
|
||||||
|
...........
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
Andreas Müller received a grant to improve scikit-learn from the
|
||||||
|
`Alfred P. Sloan Foundation <https://sloan.org>`_ .
|
||||||
|
This grant supported the position of Nicolas Hug and Thomas J. Fan.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/sloan_banner.png
|
||||||
|
:target: https://sloan.org/
|
||||||
|
|
||||||
|
.............
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`INRIA <https://www.inria.fr>`_ actively supports this project. It has
|
||||||
|
provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler
|
||||||
|
(2012-2013) and Olivier Grisel (2013-2017) to work on this project
|
||||||
|
full-time. It also hosts coding sprints and other events.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/inria-logo.jpg
|
||||||
|
:target: https://www.inria.fr
|
||||||
|
|
||||||
|
.....................
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`Paris-Saclay Center for Data Science <http://www.datascience-paris-saclay.fr/>`_
|
||||||
|
funded one year for a developer to work on the project full-time (2014-2015), 50%
|
||||||
|
of the time of Guillaume Lemaitre (2016-2017) and 50% of the time of Joris van den
|
||||||
|
Bossche (2017-2018).
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/cds-logo.png
|
||||||
|
:target: http://www.datascience-paris-saclay.fr/
|
||||||
|
|
||||||
|
..........................
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`NYU Moore-Sloan Data Science Environment <https://cds.nyu.edu/mooresloan/>`_
|
||||||
|
funded Andreas Mueller (2014-2016) to work on this project. The Moore-Sloan
|
||||||
|
Data Science Environment also funds several students to work on the project
|
||||||
|
part-time.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/nyu_short_color.png
|
||||||
|
:target: https://cds.nyu.edu/mooresloan/
|
||||||
|
|
||||||
|
........................
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`Télécom Paristech <https://www.telecom-paristech.fr/>`_ funded Manoj Kumar
|
||||||
|
(2014), Tom Dupré la Tour (2015), Raghav RV (2015-2017), Thierry Guillemot
|
||||||
|
(2016-2017) and Albert Thomas (2017) to work on scikit-learn.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/telecom.png
|
||||||
|
:target: https://www.telecom-paristech.fr/
|
||||||
|
|
||||||
|
.....................
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`The Labex DigiCosme <https://digicosme.lri.fr>`_ funded Nicolas Goix
|
||||||
|
(2015-2016), Tom Dupré la Tour (2015-2016 and 2017-2018), Mathurin Massias
|
||||||
|
(2018-2019) to work part time on scikit-learn during their PhDs. It also
|
||||||
|
funded a scikit-learn coding sprint in 2015.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/digicosme.png
|
||||||
|
:target: https://digicosme.lri.fr
|
||||||
|
|
||||||
|
.....................
|
||||||
|
|
||||||
|
.. div:: sk-text-image-grid-small
|
||||||
|
|
||||||
|
.. div:: text-box
|
||||||
|
|
||||||
|
`The Chan-Zuckerberg Initiative <https://chanzuckerberg.com/>`_ funded Nicolas
|
||||||
|
Hug to work full-time on scikit-learn in 2020.
|
||||||
|
|
||||||
|
.. div:: image-box
|
||||||
|
|
||||||
|
.. image:: images/czi_logo.svg
|
||||||
|
:target: https://chanzuckerberg.com
|
||||||
|
|
||||||
|
......................
|
||||||
|
|
||||||
|
The following students were sponsored by `Google
|
||||||
|
<https://opensource.google/>`_ to work on scikit-learn through
|
||||||
|
the `Google Summer of Code <https://en.wikipedia.org/wiki/Google_Summer_of_Code>`_
|
||||||
|
program.
|
||||||
|
|
||||||
|
- 2007 - David Cournapeau
|
||||||
|
- 2011 - `Vlad Niculae`_
|
||||||
|
- 2012 - `Vlad Niculae`_, Immanuel Bayer
|
||||||
|
- 2013 - Kemal Eren, Nicolas Trésegnie
|
||||||
|
- 2014 - Hamzeh Alsalhi, Issam Laradji, Maheshakya Wijewardena, Manoj Kumar
|
||||||
|
- 2015 - `Raghav RV <https://github.com/raghavrv>`_, Wei Xue
|
||||||
|
- 2016 - `Nelson Liu <http://nelsonliu.me>`_, `YenChen Lin <https://yenchenlin.me/>`_
|
||||||
|
|
||||||
|
.. _Vlad Niculae: https://vene.ro/
|
||||||
|
|
||||||
|
...................
|
||||||
|
|
||||||
|
The `NeuroDebian <http://neuro.debian.net>`_ project providing `Debian
|
||||||
|
<https://www.debian.org/>`_ packaging and contributions is supported by
|
||||||
|
`Dr. James V. Haxby <http://haxbylab.dartmouth.edu/>`_ (`Dartmouth
|
||||||
|
College <https://pbs.dartmouth.edu/>`_).
|
||||||
|
|
||||||
|
...................
|
||||||
|
|
||||||
|
The following organizations funded the scikit-learn consortium at Inria in
|
||||||
|
the past:
|
||||||
|
|
||||||
|
.. |msn| image:: images/microsoft.png
|
||||||
|
:target: https://www.microsoft.com/
|
||||||
|
|
||||||
|
.. |bcg| image:: images/bcg.png
|
||||||
|
:target: https://www.bcg.com/beyond-consulting/bcg-gamma/default.aspx
|
||||||
|
|
||||||
|
.. |fujitsu| image:: images/fujitsu.png
|
||||||
|
:target: https://www.fujitsu.com/global/
|
||||||
|
|
||||||
|
.. |aphp| image:: images/logo_APHP_text.png
|
||||||
|
:target: https://aphp.fr/
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<style>
|
||||||
|
div.image-subgrid img {
|
||||||
|
max-height: 50px;
|
||||||
|
max-width: 90%;
|
||||||
|
}
|
||||||
|
</style>
|
||||||
|
|
||||||
|
.. grid:: 2 2 4 4
|
||||||
|
:class-row: image-subgrid
|
||||||
|
:gutter: 1
|
||||||
|
|
||||||
|
.. grid-item::
|
||||||
|
:class: sd-text-center
|
||||||
|
:child-align: center
|
||||||
|
|
||||||
|
|msn|
|
||||||
|
|
||||||
|
.. grid-item::
|
||||||
|
:class: sd-text-center
|
||||||
|
:child-align: center
|
||||||
|
|
||||||
|
|bcg|
|
||||||
|
|
||||||
|
.. grid-item::
|
||||||
|
:class: sd-text-center
|
||||||
|
:child-align: center
|
||||||
|
|
||||||
|
|fujitsu|
|
||||||
|
|
||||||
|
.. grid-item::
|
||||||
|
:class: sd-text-center
|
||||||
|
:child-align: center
|
||||||
|
|
||||||
|
|aphp|
|
||||||
|
|
||||||
|
|
||||||
|
Sprints
|
||||||
|
-------
|
||||||
|
|
||||||
|
- The International 2019 Paris sprint was kindly hosted by `AXA <https://www.axa.fr/>`_.
|
||||||
|
Also some participants could attend thanks to the support of the `Alfred P.
|
||||||
|
Sloan Foundation <https://sloan.org>`_, the `Python Software
|
||||||
|
Foundation <https://www.python.org/psf/>`_ (PSF) and the `DATAIA Institute
|
||||||
|
<https://dataia.eu/en>`_.
|
||||||
|
|
||||||
|
- The 2013 International Paris Sprint was made possible thanks to the support of
|
||||||
|
`Télécom Paristech <https://www.telecom-paristech.fr/>`_, `tinyclues
|
||||||
|
<https://www.tinyclues.com/>`_, the `French Python Association
|
||||||
|
<https://www.afpy.org/>`_ and the `Fonds de la Recherche Scientifique
|
||||||
|
<https://www.frs-fnrs.be>`_.
|
||||||
|
|
||||||
|
- The 2011 International Granada sprint was made possible thanks to the support
|
||||||
|
of the `PSF <https://www.python.org/psf/>`_ and `tinyclues
|
||||||
|
<https://www.tinyclues.com/>`_.
|
||||||
|
|
||||||
|
|
||||||
|
Donating to the project
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
If you are interested in donating to the project or to one of our code-sprints,
|
||||||
|
please donate via the `NumFOCUS Donations Page
|
||||||
|
<https://numfocus.org/donate-to-scikit-learn>`_.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<p class="text-center">
|
||||||
|
<a class="btn sk-btn-orange mb-1" href="https://numfocus.org/donate-to-scikit-learn">
|
||||||
|
Help us, <strong>donate!</strong>
|
||||||
|
</a>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
All donations will be handled by `NumFOCUS <https://numfocus.org/>`_, a non-profit
|
||||||
|
organization which is managed by a board of `Scipy community members
|
||||||
|
<https://numfocus.org/board.html>`_. NumFOCUS's mission is to foster scientific
|
||||||
|
computing software, in particular in Python. As a fiscal home of scikit-learn, it
|
||||||
|
ensures that money is available when needed to keep the project funded and available
|
||||||
|
while in compliance with tax regulations.
|
||||||
|
|
||||||
|
The received donations for the scikit-learn project mostly will go towards covering
|
||||||
|
travel-expenses for code sprints, as well as towards the organization budget of the
|
||||||
|
project [#f1]_.
|
||||||
|
|
||||||
|
.. rubric:: Notes
|
||||||
|
|
||||||
|
.. [#f1] Regarding the organization budget, in particular, we might use some of
|
||||||
|
the donated funds to pay for other project expenses such as DNS,
|
||||||
|
hosting or continuous integration services.
|
||||||
|
|
||||||
|
|
||||||
|
Infrastructure support
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
We would also like to thank `Microsoft Azure <https://azure.microsoft.com/en-us/>`_,
|
||||||
|
`Cirrus Cl <https://cirrus-ci.org>`_, `CircleCl <https://circleci.com/>`_ for free CPU
|
||||||
|
time on their Continuous Integration servers, and `Anaconda Inc. <https://www.anaconda.com>`_
|
||||||
|
for the storage they provide for our staging and nightly builds.
|
|
@ -0,0 +1,24 @@
|
||||||
|
:html_theme.sidebar_secondary.remove:
|
||||||
|
|
||||||
|
.. _api_depr_ref:
|
||||||
|
|
||||||
|
Recently Deprecated
|
||||||
|
===================
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn
|
||||||
|
|
||||||
|
{% for ver, objs in DEPRECATED_API_REFERENCE %}
|
||||||
|
.. _api_depr_ref-{{ ver|replace(".", "-") }}:
|
||||||
|
|
||||||
|
.. rubric:: To be removed in {{ ver }}
|
||||||
|
|
||||||
|
.. autosummary::
|
||||||
|
:nosignatures:
|
||||||
|
:toctree: ../modules/generated/
|
||||||
|
:template: base.rst
|
||||||
|
|
||||||
|
{% for obj in objs %}
|
||||||
|
{{ obj }}
|
||||||
|
{%- endfor %}
|
||||||
|
|
||||||
|
{% endfor %}
|
|
@ -0,0 +1,77 @@
|
||||||
|
:html_theme.sidebar_secondary.remove:
|
||||||
|
|
||||||
|
.. _api_ref:
|
||||||
|
|
||||||
|
=============
|
||||||
|
API Reference
|
||||||
|
=============
|
||||||
|
|
||||||
|
This is the class and function reference of scikit-learn. Please refer to the
|
||||||
|
:ref:`full user guide <user_guide>` for further details, as the raw specifications of
|
||||||
|
classes and functions may not be enough to give full guidelines on their uses. For
|
||||||
|
reference on concepts repeated across the API, see :ref:`glossary`.
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
:hidden:
|
||||||
|
|
||||||
|
{% for module, _ in API_REFERENCE %}
|
||||||
|
{{ module }}
|
||||||
|
{%- endfor %}
|
||||||
|
{%- if DEPRECATED_API_REFERENCE %}
|
||||||
|
deprecated
|
||||||
|
{%- endif %}
|
||||||
|
|
||||||
|
.. list-table::
|
||||||
|
:header-rows: 1
|
||||||
|
:class: apisearch-table
|
||||||
|
|
||||||
|
* - Object
|
||||||
|
- Description
|
||||||
|
|
||||||
|
{% for module, module_info in API_REFERENCE %}
|
||||||
|
{% for section in module_info["sections"] %}
|
||||||
|
{% for obj in section["autosummary"] %}
|
||||||
|
{% set parts = obj.rsplit(".", 1) %}
|
||||||
|
{% if parts|length > 1 %}
|
||||||
|
{% set full_module = module + "." + parts[0] %}
|
||||||
|
{% else %}
|
||||||
|
{% set full_module = module %}
|
||||||
|
{% endif %}
|
||||||
|
* - :obj:`~{{ module }}.{{ obj }}`
|
||||||
|
|
||||||
|
- .. div:: sk-apisearch-desc
|
||||||
|
|
||||||
|
.. currentmodule:: {{ full_module }}
|
||||||
|
|
||||||
|
.. autoshortsummary:: {{ module }}.{{ obj }}
|
||||||
|
|
||||||
|
.. div:: caption
|
||||||
|
|
||||||
|
:mod:`{{ full_module }}`
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
{% for ver, objs in DEPRECATED_API_REFERENCE %}
|
||||||
|
{% for obj in objs %}
|
||||||
|
{% set parts = obj.rsplit(".", 1) %}
|
||||||
|
{% if parts|length > 1 %}
|
||||||
|
{% set full_module = "sklearn." + parts[0] %}
|
||||||
|
{% else %}
|
||||||
|
{% set full_module = "sklearn" %}
|
||||||
|
{% endif %}
|
||||||
|
* - :obj:`~sklearn.{{ obj }}`
|
||||||
|
|
||||||
|
- .. div:: sk-apisearch-desc
|
||||||
|
|
||||||
|
.. currentmodule:: {{ full_module }}
|
||||||
|
|
||||||
|
.. autoshortsummary:: sklearn.{{ obj }}
|
||||||
|
|
||||||
|
.. div:: caption
|
||||||
|
|
||||||
|
:mod:`{{ full_module }}`
|
||||||
|
:bdg-ref-danger-line:`Deprecated in version {{ ver }} <api_depr_ref-{{ ver|replace(".", "-") }}>`
|
||||||
|
{% endfor %}
|
||||||
|
{% endfor %}
|
|
@ -0,0 +1,46 @@
|
||||||
|
:html_theme.sidebar_secondary.remove:
|
||||||
|
|
||||||
|
{% if module == "sklearn" -%}
|
||||||
|
{%- set module_hook = "sklearn" -%}
|
||||||
|
{%- elif module.startswith("sklearn.") -%}
|
||||||
|
{%- set module_hook = module[8:] -%}
|
||||||
|
{%- else -%}
|
||||||
|
{%- set module_hook = None -%}
|
||||||
|
{%- endif -%}
|
||||||
|
|
||||||
|
{% if module_hook %}
|
||||||
|
.. _{{ module_hook }}_ref:
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{{ module }}
|
||||||
|
{{ "=" * module|length }}
|
||||||
|
|
||||||
|
.. automodule:: {{ module }}
|
||||||
|
|
||||||
|
{% if module_info["description"] %}
|
||||||
|
{{ module_info["description"] }}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{% for section in module_info["sections"] %}
|
||||||
|
{% if section["title"] and module_hook %}
|
||||||
|
.. _{{ module_hook }}_ref-{{ section["title"]|lower|replace(" ", "-") }}:
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{% if section["title"] %}
|
||||||
|
{{ section["title"] }}
|
||||||
|
{{ "-" * section["title"]|length }}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
{% if section["description"] %}
|
||||||
|
{{ section["description"] }}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
.. autosummary::
|
||||||
|
:nosignatures:
|
||||||
|
:toctree: ../modules/generated/
|
||||||
|
:template: base.rst
|
||||||
|
|
||||||
|
{% for obj in section["autosummary"] %}
|
||||||
|
{{ obj }}
|
||||||
|
{%- endfor %}
|
||||||
|
{% endfor %}
|
|
@ -0,0 +1,5 @@
|
||||||
|
# A binder requirement file is required by sphinx-gallery.
|
||||||
|
# We don't really need one since our binder requirement file lives in the
|
||||||
|
# .binder directory.
|
||||||
|
# This file can be removed if 'dependencies' is made an optional key for
|
||||||
|
# binder in sphinx-gallery.
|
|
@ -0,0 +1,574 @@
|
||||||
|
.. _common_pitfalls:
|
||||||
|
|
||||||
|
=========================================
|
||||||
|
Common pitfalls and recommended practices
|
||||||
|
=========================================
|
||||||
|
|
||||||
|
The purpose of this chapter is to illustrate some common pitfalls and
|
||||||
|
anti-patterns that occur when using scikit-learn. It provides
|
||||||
|
examples of what **not** to do, along with a corresponding correct
|
||||||
|
example.
|
||||||
|
|
||||||
|
Inconsistent preprocessing
|
||||||
|
==========================
|
||||||
|
|
||||||
|
scikit-learn provides a library of :ref:`data-transforms`, which
|
||||||
|
may clean (see :ref:`preprocessing`), reduce
|
||||||
|
(see :ref:`data_reduction`), expand (see :ref:`kernel_approximation`)
|
||||||
|
or generate (see :ref:`feature_extraction`) feature representations.
|
||||||
|
If these data transforms are used when training a model, they also
|
||||||
|
must be used on subsequent datasets, whether it's test data or
|
||||||
|
data in a production system. Otherwise, the feature space will change,
|
||||||
|
and the model will not be able to perform effectively.
|
||||||
|
|
||||||
|
For the following example, let's create a synthetic dataset with a
|
||||||
|
single feature::
|
||||||
|
|
||||||
|
>>> from sklearn.datasets import make_regression
|
||||||
|
>>> from sklearn.model_selection import train_test_split
|
||||||
|
|
||||||
|
>>> random_state = 42
|
||||||
|
>>> X, y = make_regression(random_state=random_state, n_features=1, noise=1)
|
||||||
|
>>> X_train, X_test, y_train, y_test = train_test_split(
|
||||||
|
... X, y, test_size=0.4, random_state=random_state)
|
||||||
|
|
||||||
|
**Wrong**
|
||||||
|
|
||||||
|
The train dataset is scaled, but not the test dataset, so model
|
||||||
|
performance on the test dataset is worse than expected::
|
||||||
|
|
||||||
|
>>> from sklearn.metrics import mean_squared_error
|
||||||
|
>>> from sklearn.linear_model import LinearRegression
|
||||||
|
>>> from sklearn.preprocessing import StandardScaler
|
||||||
|
|
||||||
|
>>> scaler = StandardScaler()
|
||||||
|
>>> X_train_transformed = scaler.fit_transform(X_train)
|
||||||
|
>>> model = LinearRegression().fit(X_train_transformed, y_train)
|
||||||
|
>>> mean_squared_error(y_test, model.predict(X_test))
|
||||||
|
62.80...
|
||||||
|
|
||||||
|
**Right**
|
||||||
|
|
||||||
|
Instead of passing the non-transformed `X_test` to `predict`, we should
|
||||||
|
transform the test data, the same way we transformed the training data::
|
||||||
|
|
||||||
|
>>> X_test_transformed = scaler.transform(X_test)
|
||||||
|
>>> mean_squared_error(y_test, model.predict(X_test_transformed))
|
||||||
|
0.90...
|
||||||
|
|
||||||
|
Alternatively, we recommend using a :class:`Pipeline
|
||||||
|
<sklearn.pipeline.Pipeline>`, which makes it easier to chain transformations
|
||||||
|
with estimators, and reduces the possibility of forgetting a transformation::
|
||||||
|
|
||||||
|
>>> from sklearn.pipeline import make_pipeline
|
||||||
|
|
||||||
|
>>> model = make_pipeline(StandardScaler(), LinearRegression())
|
||||||
|
>>> model.fit(X_train, y_train)
|
||||||
|
Pipeline(steps=[('standardscaler', StandardScaler()),
|
||||||
|
('linearregression', LinearRegression())])
|
||||||
|
>>> mean_squared_error(y_test, model.predict(X_test))
|
||||||
|
0.90...
|
||||||
|
|
||||||
|
Pipelines also help avoiding another common pitfall: leaking the test data
|
||||||
|
into the training data.
|
||||||
|
|
||||||
|
.. _data_leakage:
|
||||||
|
|
||||||
|
Data leakage
|
||||||
|
============
|
||||||
|
|
||||||
|
Data leakage occurs when information that would not be available at prediction
|
||||||
|
time is used when building the model. This results in overly optimistic
|
||||||
|
performance estimates, for example from :ref:`cross-validation
|
||||||
|
<cross_validation>`, and thus poorer performance when the model is used
|
||||||
|
on actually novel data, for example during production.
|
||||||
|
|
||||||
|
A common cause is not keeping the test and train data subsets separate.
|
||||||
|
Test data should never be used to make choices about the model.
|
||||||
|
**The general rule is to never call** `fit` **on the test data**. While this
|
||||||
|
may sound obvious, this is easy to miss in some cases, for example when
|
||||||
|
applying certain pre-processing steps.
|
||||||
|
|
||||||
|
Although both train and test data subsets should receive the same
|
||||||
|
preprocessing transformation (as described in the previous section), it is
|
||||||
|
important that these transformations are only learnt from the training data.
|
||||||
|
For example, if you have a
|
||||||
|
normalization step where you divide by the average value, the average should
|
||||||
|
be the average of the train subset, **not** the average of all the data. If the
|
||||||
|
test subset is included in the average calculation, information from the test
|
||||||
|
subset is influencing the model.
|
||||||
|
|
||||||
|
How to avoid data leakage
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
Below are some tips on avoiding data leakage:
|
||||||
|
|
||||||
|
* Always split the data into train and test subsets first, particularly
|
||||||
|
before any preprocessing steps.
|
||||||
|
* Never include test data when using the `fit` and `fit_transform`
|
||||||
|
methods. Using all the data, e.g., `fit(X)`, can result in overly optimistic
|
||||||
|
scores.
|
||||||
|
|
||||||
|
Conversely, the `transform` method should be used on both train and test
|
||||||
|
subsets as the same preprocessing should be applied to all the data.
|
||||||
|
This can be achieved by using `fit_transform` on the train subset and
|
||||||
|
`transform` on the test subset.
|
||||||
|
* The scikit-learn :ref:`pipeline <pipeline>` is a great way to prevent data
|
||||||
|
leakage as it ensures that the appropriate method is performed on the
|
||||||
|
correct data subset. The pipeline is ideal for use in cross-validation
|
||||||
|
and hyper-parameter tuning functions.
|
||||||
|
|
||||||
|
An example of data leakage during preprocessing is detailed below.
|
||||||
|
|
||||||
|
Data leakage during pre-processing
|
||||||
|
----------------------------------
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
We here choose to illustrate data leakage with a feature selection step.
|
||||||
|
This risk of leakage is however relevant with almost all transformations
|
||||||
|
in scikit-learn, including (but not limited to)
|
||||||
|
:class:`~sklearn.preprocessing.StandardScaler`,
|
||||||
|
:class:`~sklearn.impute.SimpleImputer`, and
|
||||||
|
:class:`~sklearn.decomposition.PCA`.
|
||||||
|
|
||||||
|
A number of :ref:`feature_selection` functions are available in scikit-learn.
|
||||||
|
They can help remove irrelevant, redundant and noisy features as well as
|
||||||
|
improve your model build time and performance. As with any other type of
|
||||||
|
preprocessing, feature selection should **only** use the training data.
|
||||||
|
Including the test data in feature selection will optimistically bias your
|
||||||
|
model.
|
||||||
|
|
||||||
|
To demonstrate we will create this binary classification problem with
|
||||||
|
10,000 randomly generated features::
|
||||||
|
|
||||||
|
>>> import numpy as np
|
||||||
|
>>> n_samples, n_features, n_classes = 200, 10000, 2
|
||||||
|
>>> rng = np.random.RandomState(42)
|
||||||
|
>>> X = rng.standard_normal((n_samples, n_features))
|
||||||
|
>>> y = rng.choice(n_classes, n_samples)
|
||||||
|
|
||||||
|
**Wrong**
|
||||||
|
|
||||||
|
Using all the data to perform feature selection results in an accuracy score
|
||||||
|
much higher than chance, even though our targets are completely random.
|
||||||
|
This randomness means that our `X` and `y` are independent and we thus expect
|
||||||
|
the accuracy to be around 0.5. However, since the feature selection step
|
||||||
|
'sees' the test data, the model has an unfair advantage. In the incorrect
|
||||||
|
example below we first use all the data for feature selection and then split
|
||||||
|
the data into training and test subsets for model fitting. The result is a
|
||||||
|
much higher than expected accuracy score::
|
||||||
|
|
||||||
|
>>> from sklearn.model_selection import train_test_split
|
||||||
|
>>> from sklearn.feature_selection import SelectKBest
|
||||||
|
>>> from sklearn.ensemble import GradientBoostingClassifier
|
||||||
|
>>> from sklearn.metrics import accuracy_score
|
||||||
|
|
||||||
|
>>> # Incorrect preprocessing: the entire data is transformed
|
||||||
|
>>> X_selected = SelectKBest(k=25).fit_transform(X, y)
|
||||||
|
|
||||||
|
>>> X_train, X_test, y_train, y_test = train_test_split(
|
||||||
|
... X_selected, y, random_state=42)
|
||||||
|
>>> gbc = GradientBoostingClassifier(random_state=1)
|
||||||
|
>>> gbc.fit(X_train, y_train)
|
||||||
|
GradientBoostingClassifier(random_state=1)
|
||||||
|
|
||||||
|
>>> y_pred = gbc.predict(X_test)
|
||||||
|
>>> accuracy_score(y_test, y_pred)
|
||||||
|
0.76
|
||||||
|
|
||||||
|
**Right**
|
||||||
|
|
||||||
|
To prevent data leakage, it is good practice to split your data into train
|
||||||
|
and test subsets **first**. Feature selection can then be formed using just
|
||||||
|
the train dataset. Notice that whenever we use `fit` or `fit_transform`, we
|
||||||
|
only use the train dataset. The score is now what we would expect for the
|
||||||
|
data, close to chance::
|
||||||
|
|
||||||
|
>>> X_train, X_test, y_train, y_test = train_test_split(
|
||||||
|
... X, y, random_state=42)
|
||||||
|
>>> select = SelectKBest(k=25)
|
||||||
|
>>> X_train_selected = select.fit_transform(X_train, y_train)
|
||||||
|
|
||||||
|
>>> gbc = GradientBoostingClassifier(random_state=1)
|
||||||
|
>>> gbc.fit(X_train_selected, y_train)
|
||||||
|
GradientBoostingClassifier(random_state=1)
|
||||||
|
|
||||||
|
>>> X_test_selected = select.transform(X_test)
|
||||||
|
>>> y_pred = gbc.predict(X_test_selected)
|
||||||
|
>>> accuracy_score(y_test, y_pred)
|
||||||
|
0.46
|
||||||
|
|
||||||
|
Here again, we recommend using a :class:`~sklearn.pipeline.Pipeline` to chain
|
||||||
|
together the feature selection and model estimators. The pipeline ensures
|
||||||
|
that only the training data is used when performing `fit` and the test data
|
||||||
|
is used only for calculating the accuracy score::
|
||||||
|
|
||||||
|
>>> from sklearn.pipeline import make_pipeline
|
||||||
|
>>> X_train, X_test, y_train, y_test = train_test_split(
|
||||||
|
... X, y, random_state=42)
|
||||||
|
>>> pipeline = make_pipeline(SelectKBest(k=25),
|
||||||
|
... GradientBoostingClassifier(random_state=1))
|
||||||
|
>>> pipeline.fit(X_train, y_train)
|
||||||
|
Pipeline(steps=[('selectkbest', SelectKBest(k=25)),
|
||||||
|
('gradientboostingclassifier',
|
||||||
|
GradientBoostingClassifier(random_state=1))])
|
||||||
|
|
||||||
|
>>> y_pred = pipeline.predict(X_test)
|
||||||
|
>>> accuracy_score(y_test, y_pred)
|
||||||
|
0.46
|
||||||
|
|
||||||
|
The pipeline can also be fed into a cross-validation
|
||||||
|
function such as :func:`~sklearn.model_selection.cross_val_score`.
|
||||||
|
Again, the pipeline ensures that the correct data subset and estimator
|
||||||
|
method is used during fitting and predicting::
|
||||||
|
|
||||||
|
>>> from sklearn.model_selection import cross_val_score
|
||||||
|
>>> scores = cross_val_score(pipeline, X, y)
|
||||||
|
>>> print(f"Mean accuracy: {scores.mean():.2f}+/-{scores.std():.2f}")
|
||||||
|
Mean accuracy: 0.46+/-0.07
|
||||||
|
|
||||||
|
|
||||||
|
.. _randomness:
|
||||||
|
|
||||||
|
Controlling randomness
|
||||||
|
======================
|
||||||
|
|
||||||
|
Some scikit-learn objects are inherently random. These are usually estimators
|
||||||
|
(e.g. :class:`~sklearn.ensemble.RandomForestClassifier`) and cross-validation
|
||||||
|
splitters (e.g. :class:`~sklearn.model_selection.KFold`). The randomness of
|
||||||
|
these objects is controlled via their `random_state` parameter, as described
|
||||||
|
in the :term:`Glossary <random_state>`. This section expands on the glossary
|
||||||
|
entry, and describes good practices and common pitfalls w.r.t. this
|
||||||
|
subtle parameter.
|
||||||
|
|
||||||
|
.. note:: Recommendation summary
|
||||||
|
|
||||||
|
For an optimal robustness of cross-validation (CV) results, pass
|
||||||
|
`RandomState` instances when creating estimators, or leave `random_state`
|
||||||
|
to `None`. Passing integers to CV splitters is usually the safest option
|
||||||
|
and is preferable; passing `RandomState` instances to splitters may
|
||||||
|
sometimes be useful to achieve very specific use-cases.
|
||||||
|
For both estimators and splitters, passing an integer vs passing an
|
||||||
|
instance (or `None`) leads to subtle but significant differences,
|
||||||
|
especially for CV procedures. These differences are important to
|
||||||
|
understand when reporting results.
|
||||||
|
|
||||||
|
For reproducible results across executions, remove any use of
|
||||||
|
`random_state=None`.
|
||||||
|
|
||||||
|
Using `None` or `RandomState` instances, and repeated calls to `fit` and `split`
|
||||||
|
--------------------------------------------------------------------------------
|
||||||
|
|
||||||
|
The `random_state` parameter determines whether multiple calls to :term:`fit`
|
||||||
|
(for estimators) or to :term:`split` (for CV splitters) will produce the same
|
||||||
|
results, according to these rules:
|
||||||
|
|
||||||
|
- If an integer is passed, calling `fit` or `split` multiple times always
|
||||||
|
yields the same results.
|
||||||
|
- If `None` or a `RandomState` instance is passed: `fit` and `split` will
|
||||||
|
yield different results each time they are called, and the succession of
|
||||||
|
calls explores all sources of entropy. `None` is the default value for all
|
||||||
|
`random_state` parameters.
|
||||||
|
|
||||||
|
We here illustrate these rules for both estimators and CV splitters.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
Since passing `random_state=None` is equivalent to passing the global
|
||||||
|
`RandomState` instance from `numpy`
|
||||||
|
(`random_state=np.random.mtrand._rand`), we will not explicitly mention
|
||||||
|
`None` here. Everything that applies to instances also applies to using
|
||||||
|
`None`.
|
||||||
|
|
||||||
|
Estimators
|
||||||
|
..........
|
||||||
|
|
||||||
|
Passing instances means that calling `fit` multiple times will not yield the
|
||||||
|
same results, even if the estimator is fitted on the same data and with the
|
||||||
|
same hyper-parameters::
|
||||||
|
|
||||||
|
>>> from sklearn.linear_model import SGDClassifier
|
||||||
|
>>> from sklearn.datasets import make_classification
|
||||||
|
>>> import numpy as np
|
||||||
|
|
||||||
|
>>> rng = np.random.RandomState(0)
|
||||||
|
>>> X, y = make_classification(n_features=5, random_state=rng)
|
||||||
|
>>> sgd = SGDClassifier(random_state=rng)
|
||||||
|
|
||||||
|
>>> sgd.fit(X, y).coef_
|
||||||
|
array([[ 8.85418642, 4.79084103, -3.13077794, 8.11915045, -0.56479934]])
|
||||||
|
|
||||||
|
>>> sgd.fit(X, y).coef_
|
||||||
|
array([[ 6.70814003, 5.25291366, -7.55212743, 5.18197458, 1.37845099]])
|
||||||
|
|
||||||
|
We can see from the snippet above that repeatedly calling `sgd.fit` has
|
||||||
|
produced different models, even if the data was the same. This is because the
|
||||||
|
Random Number Generator (RNG) of the estimator is consumed (i.e. mutated)
|
||||||
|
when `fit` is called, and this mutated RNG will be used in the subsequent
|
||||||
|
calls to `fit`. In addition, the `rng` object is shared across all objects
|
||||||
|
that use it, and as a consequence, these objects become somewhat
|
||||||
|
inter-dependent. For example, two estimators that share the same
|
||||||
|
`RandomState` instance will influence each other, as we will see later when
|
||||||
|
we discuss cloning. This point is important to keep in mind when debugging.
|
||||||
|
|
||||||
|
If we had passed an integer to the `random_state` parameter of the
|
||||||
|
:class:`~sklearn.linear_model.SGDClassifier`, we would have obtained the
|
||||||
|
same models, and thus the same scores each time. When we pass an integer, the
|
||||||
|
same RNG is used across all calls to `fit`. What internally happens is that
|
||||||
|
even though the RNG is consumed when `fit` is called, it is always reset to
|
||||||
|
its original state at the beginning of `fit`.
|
||||||
|
|
||||||
|
CV splitters
|
||||||
|
............
|
||||||
|
|
||||||
|
Randomized CV splitters have a similar behavior when a `RandomState`
|
||||||
|
instance is passed; calling `split` multiple times yields different data
|
||||||
|
splits::
|
||||||
|
|
||||||
|
>>> from sklearn.model_selection import KFold
|
||||||
|
>>> import numpy as np
|
||||||
|
|
||||||
|
>>> X = y = np.arange(10)
|
||||||
|
>>> rng = np.random.RandomState(0)
|
||||||
|
>>> cv = KFold(n_splits=2, shuffle=True, random_state=rng)
|
||||||
|
|
||||||
|
>>> for train, test in cv.split(X, y):
|
||||||
|
... print(train, test)
|
||||||
|
[0 3 5 6 7] [1 2 4 8 9]
|
||||||
|
[1 2 4 8 9] [0 3 5 6 7]
|
||||||
|
|
||||||
|
>>> for train, test in cv.split(X, y):
|
||||||
|
... print(train, test)
|
||||||
|
[0 4 6 7 8] [1 2 3 5 9]
|
||||||
|
[1 2 3 5 9] [0 4 6 7 8]
|
||||||
|
|
||||||
|
We can see that the splits are different from the second time `split` is
|
||||||
|
called. This may lead to unexpected results if you compare the performance of
|
||||||
|
multiple estimators by calling `split` many times, as we will see in the next
|
||||||
|
section.
|
||||||
|
|
||||||
|
Common pitfalls and subtleties
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
While the rules that govern the `random_state` parameter are seemingly simple,
|
||||||
|
they do however have some subtle implications. In some cases, this can even
|
||||||
|
lead to wrong conclusions.
|
||||||
|
|
||||||
|
Estimators
|
||||||
|
..........
|
||||||
|
|
||||||
|
**Different `random_state` types lead to different cross-validation
|
||||||
|
procedures**
|
||||||
|
|
||||||
|
Depending on the type of the `random_state` parameter, estimators will behave
|
||||||
|
differently, especially in cross-validation procedures. Consider the
|
||||||
|
following snippet::
|
||||||
|
|
||||||
|
>>> from sklearn.ensemble import RandomForestClassifier
|
||||||
|
>>> from sklearn.datasets import make_classification
|
||||||
|
>>> from sklearn.model_selection import cross_val_score
|
||||||
|
>>> import numpy as np
|
||||||
|
|
||||||
|
>>> X, y = make_classification(random_state=0)
|
||||||
|
|
||||||
|
>>> rf_123 = RandomForestClassifier(random_state=123)
|
||||||
|
>>> cross_val_score(rf_123, X, y)
|
||||||
|
array([0.85, 0.95, 0.95, 0.9 , 0.9 ])
|
||||||
|
|
||||||
|
>>> rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))
|
||||||
|
>>> cross_val_score(rf_inst, X, y)
|
||||||
|
array([0.9 , 0.95, 0.95, 0.9 , 0.9 ])
|
||||||
|
|
||||||
|
We see that the cross-validated scores of `rf_123` and `rf_inst` are
|
||||||
|
different, as should be expected since we didn't pass the same `random_state`
|
||||||
|
parameter. However, the difference between these scores is more subtle than
|
||||||
|
it looks, and **the cross-validation procedures that were performed by**
|
||||||
|
:func:`~sklearn.model_selection.cross_val_score` **significantly differ in
|
||||||
|
each case**:
|
||||||
|
|
||||||
|
- Since `rf_123` was passed an integer, every call to `fit` uses the same RNG:
|
||||||
|
this means that all random characteristics of the random forest estimator
|
||||||
|
will be the same for each of the 5 folds of the CV procedure. In
|
||||||
|
particular, the (randomly chosen) subset of features of the estimator will
|
||||||
|
be the same across all folds.
|
||||||
|
- Since `rf_inst` was passed a `RandomState` instance, each call to `fit`
|
||||||
|
starts from a different RNG. As a result, the random subset of features
|
||||||
|
will be different for each folds.
|
||||||
|
|
||||||
|
While having a constant estimator RNG across folds isn't inherently wrong, we
|
||||||
|
usually want CV results that are robust w.r.t. the estimator's randomness. As
|
||||||
|
a result, passing an instance instead of an integer may be preferable, since
|
||||||
|
it will allow the estimator RNG to vary for each fold.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
Here, :func:`~sklearn.model_selection.cross_val_score` will use a
|
||||||
|
non-randomized CV splitter (as is the default), so both estimators will
|
||||||
|
be evaluated on the same splits. This section is not about variability in
|
||||||
|
the splits. Also, whether we pass an integer or an instance to
|
||||||
|
:func:`~sklearn.datasets.make_classification` isn't relevant for our
|
||||||
|
illustration purpose: what matters is what we pass to the
|
||||||
|
:class:`~sklearn.ensemble.RandomForestClassifier` estimator.
|
||||||
|
|
||||||
|
.. dropdown:: Cloning
|
||||||
|
|
||||||
|
Another subtle side effect of passing `RandomState` instances is how
|
||||||
|
:func:`~sklearn.base.clone` will work::
|
||||||
|
|
||||||
|
>>> from sklearn import clone
|
||||||
|
>>> from sklearn.ensemble import RandomForestClassifier
|
||||||
|
>>> import numpy as np
|
||||||
|
|
||||||
|
>>> rng = np.random.RandomState(0)
|
||||||
|
>>> a = RandomForestClassifier(random_state=rng)
|
||||||
|
>>> b = clone(a)
|
||||||
|
|
||||||
|
Since a `RandomState` instance was passed to `a`, `a` and `b` are not clones
|
||||||
|
in the strict sense, but rather clones in the statistical sense: `a` and `b`
|
||||||
|
will still be different models, even when calling `fit(X, y)` on the same
|
||||||
|
data. Moreover, `a` and `b` will influence each-other since they share the
|
||||||
|
same internal RNG: calling `a.fit` will consume `b`'s RNG, and calling
|
||||||
|
`b.fit` will consume `a`'s RNG, since they are the same. This bit is true for
|
||||||
|
any estimators that share a `random_state` parameter; it is not specific to
|
||||||
|
clones.
|
||||||
|
|
||||||
|
If an integer were passed, `a` and `b` would be exact clones and they would not
|
||||||
|
influence each other.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
Even though :func:`~sklearn.base.clone` is rarely used in user code, it is
|
||||||
|
called pervasively throughout scikit-learn codebase: in particular, most
|
||||||
|
meta-estimators that accept non-fitted estimators call
|
||||||
|
:func:`~sklearn.base.clone` internally
|
||||||
|
(:class:`~sklearn.model_selection.GridSearchCV`,
|
||||||
|
:class:`~sklearn.ensemble.StackingClassifier`,
|
||||||
|
:class:`~sklearn.calibration.CalibratedClassifierCV`, etc.).
|
||||||
|
|
||||||
|
|
||||||
|
CV splitters
|
||||||
|
............
|
||||||
|
|
||||||
|
When passed a `RandomState` instance, CV splitters yield different splits
|
||||||
|
each time `split` is called. When comparing different estimators, this can
|
||||||
|
lead to overestimating the variance of the difference in performance between
|
||||||
|
the estimators::
|
||||||
|
|
||||||
|
>>> from sklearn.naive_bayes import GaussianNB
|
||||||
|
>>> from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
|
||||||
|
>>> from sklearn.datasets import make_classification
|
||||||
|
>>> from sklearn.model_selection import KFold
|
||||||
|
>>> from sklearn.model_selection import cross_val_score
|
||||||
|
>>> import numpy as np
|
||||||
|
|
||||||
|
>>> rng = np.random.RandomState(0)
|
||||||
|
>>> X, y = make_classification(random_state=rng)
|
||||||
|
>>> cv = KFold(shuffle=True, random_state=rng)
|
||||||
|
>>> lda = LinearDiscriminantAnalysis()
|
||||||
|
>>> nb = GaussianNB()
|
||||||
|
|
||||||
|
>>> for est in (lda, nb):
|
||||||
|
... print(cross_val_score(est, X, y, cv=cv))
|
||||||
|
[0.8 0.75 0.75 0.7 0.85]
|
||||||
|
[0.85 0.95 0.95 0.85 0.95]
|
||||||
|
|
||||||
|
|
||||||
|
Directly comparing the performance of the
|
||||||
|
:class:`~sklearn.discriminant_analysis.LinearDiscriminantAnalysis` estimator
|
||||||
|
vs the :class:`~sklearn.naive_bayes.GaussianNB` estimator **on each fold** would
|
||||||
|
be a mistake: **the splits on which the estimators are evaluated are
|
||||||
|
different**. Indeed, :func:`~sklearn.model_selection.cross_val_score` will
|
||||||
|
internally call `cv.split` on the same
|
||||||
|
:class:`~sklearn.model_selection.KFold` instance, but the splits will be
|
||||||
|
different each time. This is also true for any tool that performs model
|
||||||
|
selection via cross-validation, e.g.
|
||||||
|
:class:`~sklearn.model_selection.GridSearchCV` and
|
||||||
|
:class:`~sklearn.model_selection.RandomizedSearchCV`: scores are not
|
||||||
|
comparable fold-to-fold across different calls to `search.fit`, since
|
||||||
|
`cv.split` would have been called multiple times. Within a single call to
|
||||||
|
`search.fit`, however, fold-to-fold comparison is possible since the search
|
||||||
|
estimator only calls `cv.split` once.
|
||||||
|
|
||||||
|
For comparable fold-to-fold results in all scenarios, one should pass an
|
||||||
|
integer to the CV splitter: `cv = KFold(shuffle=True, random_state=0)`.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
While fold-to-fold comparison is not advisable with `RandomState`
|
||||||
|
instances, one can however expect that average scores allow to conclude
|
||||||
|
whether one estimator is better than another, as long as enough folds and
|
||||||
|
data are used.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
What matters in this example is what was passed to
|
||||||
|
:class:`~sklearn.model_selection.KFold`. Whether we pass a `RandomState`
|
||||||
|
instance or an integer to :func:`~sklearn.datasets.make_classification`
|
||||||
|
is not relevant for our illustration purpose. Also, neither
|
||||||
|
:class:`~sklearn.discriminant_analysis.LinearDiscriminantAnalysis` nor
|
||||||
|
:class:`~sklearn.naive_bayes.GaussianNB` are randomized estimators.
|
||||||
|
|
||||||
|
General recommendations
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
Getting reproducible results across multiple executions
|
||||||
|
.......................................................
|
||||||
|
|
||||||
|
In order to obtain reproducible (i.e. constant) results across multiple
|
||||||
|
*program executions*, we need to remove all uses of `random_state=None`, which
|
||||||
|
is the default. The recommended way is to declare a `rng` variable at the top
|
||||||
|
of the program, and pass it down to any object that accepts a `random_state`
|
||||||
|
parameter::
|
||||||
|
|
||||||
|
>>> from sklearn.ensemble import RandomForestClassifier
|
||||||
|
>>> from sklearn.datasets import make_classification
|
||||||
|
>>> from sklearn.model_selection import train_test_split
|
||||||
|
>>> import numpy as np
|
||||||
|
|
||||||
|
>>> rng = np.random.RandomState(0)
|
||||||
|
>>> X, y = make_classification(random_state=rng)
|
||||||
|
>>> rf = RandomForestClassifier(random_state=rng)
|
||||||
|
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
|
||||||
|
... random_state=rng)
|
||||||
|
>>> rf.fit(X_train, y_train).score(X_test, y_test)
|
||||||
|
0.84
|
||||||
|
|
||||||
|
We are now guaranteed that the result of this script will always be 0.84, no
|
||||||
|
matter how many times we run it. Changing the global `rng` variable to a
|
||||||
|
different value should affect the results, as expected.
|
||||||
|
|
||||||
|
It is also possible to declare the `rng` variable as an integer. This may
|
||||||
|
however lead to less robust cross-validation results, as we will see in the
|
||||||
|
next section.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
We do not recommend setting the global `numpy` seed by calling
|
||||||
|
`np.random.seed(0)`. See `here
|
||||||
|
<https://stackoverflow.com/questions/5836335/consistently-create-same-random-numpy-array/5837352#comment6712034_5837352>`_
|
||||||
|
for a discussion.
|
||||||
|
|
||||||
|
Robustness of cross-validation results
|
||||||
|
......................................
|
||||||
|
|
||||||
|
When we evaluate a randomized estimator performance by cross-validation, we
|
||||||
|
want to make sure that the estimator can yield accurate predictions for new
|
||||||
|
data, but we also want to make sure that the estimator is robust w.r.t. its
|
||||||
|
random initialization. For example, we would like the random weights
|
||||||
|
initialization of a :class:`~sklearn.linear_model.SGDClassifier` to be
|
||||||
|
consistently good across all folds: otherwise, when we train that estimator
|
||||||
|
on new data, we might get unlucky and the random initialization may lead to
|
||||||
|
bad performance. Similarly, we want a random forest to be robust w.r.t the
|
||||||
|
set of randomly selected features that each tree will be using.
|
||||||
|
|
||||||
|
For these reasons, it is preferable to evaluate the cross-validation
|
||||||
|
performance by letting the estimator use a different RNG on each fold. This
|
||||||
|
is done by passing a `RandomState` instance (or `None`) to the estimator
|
||||||
|
initialization.
|
||||||
|
|
||||||
|
When we pass an integer, the estimator will use the same RNG on each fold:
|
||||||
|
if the estimator performs well (or bad), as evaluated by CV, it might just be
|
||||||
|
because we got lucky (or unlucky) with that specific seed. Passing instances
|
||||||
|
leads to more robust CV results, and makes the comparison between various
|
||||||
|
algorithms fairer. It also helps limiting the temptation to treat the
|
||||||
|
estimator's RNG as a hyper-parameter that can be tuned.
|
||||||
|
|
||||||
|
Whether we pass `RandomState` instances or integers to CV splitters has no
|
||||||
|
impact on robustness, as long as `split` is only called once. When `split`
|
||||||
|
is called multiple times, fold-to-fold comparison isn't possible anymore. As
|
||||||
|
a result, passing integer to CV splitters is usually safer and covers most
|
||||||
|
use-cases.
|
|
@ -0,0 +1,16 @@
|
||||||
|
.. raw :: html
|
||||||
|
|
||||||
|
<!-- Generated by generate_authors_table.py -->
|
||||||
|
<div class="sk-authors-container">
|
||||||
|
<style>
|
||||||
|
img.avatar {border-radius: 10px;}
|
||||||
|
</style>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/laurburke'><img src='https://avatars.githubusercontent.com/u/35973528?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Lauren Burke</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/francoisgoupil'><img src='https://avatars.githubusercontent.com/u/98105626?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>François Goupil</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
|
@ -0,0 +1 @@
|
||||||
|
- Reshama Shaikh
|
|
@ -0,0 +1,10 @@
|
||||||
|
============================
|
||||||
|
Computing with scikit-learn
|
||||||
|
============================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
computing/scaling_strategies
|
||||||
|
computing/computational_performance
|
||||||
|
computing/parallelism
|
|
@ -0,0 +1,366 @@
|
||||||
|
.. _computational_performance:
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn
|
||||||
|
|
||||||
|
Computational Performance
|
||||||
|
=========================
|
||||||
|
|
||||||
|
For some applications the performance (mainly latency and throughput at
|
||||||
|
prediction time) of estimators is crucial. It may also be of interest to
|
||||||
|
consider the training throughput but this is often less important in a
|
||||||
|
production setup (where it often takes place offline).
|
||||||
|
|
||||||
|
We will review here the orders of magnitude you can expect from a number of
|
||||||
|
scikit-learn estimators in different contexts and provide some tips and
|
||||||
|
tricks for overcoming performance bottlenecks.
|
||||||
|
|
||||||
|
Prediction latency is measured as the elapsed time necessary to make a
|
||||||
|
prediction (e.g. in micro-seconds). Latency is often viewed as a distribution
|
||||||
|
and operations engineers often focus on the latency at a given percentile of
|
||||||
|
this distribution (e.g. the 90 percentile).
|
||||||
|
|
||||||
|
Prediction throughput is defined as the number of predictions the software can
|
||||||
|
deliver in a given amount of time (e.g. in predictions per second).
|
||||||
|
|
||||||
|
An important aspect of performance optimization is also that it can hurt
|
||||||
|
prediction accuracy. Indeed, simpler models (e.g. linear instead of
|
||||||
|
non-linear, or with fewer parameters) often run faster but are not always able
|
||||||
|
to take into account the same exact properties of the data as more complex ones.
|
||||||
|
|
||||||
|
Prediction Latency
|
||||||
|
------------------
|
||||||
|
|
||||||
|
One of the most straight-forward concerns one may have when using/choosing a
|
||||||
|
machine learning toolkit is the latency at which predictions can be made in a
|
||||||
|
production environment.
|
||||||
|
|
||||||
|
The main factors that influence the prediction latency are
|
||||||
|
|
||||||
|
1. Number of features
|
||||||
|
2. Input data representation and sparsity
|
||||||
|
3. Model complexity
|
||||||
|
4. Feature extraction
|
||||||
|
|
||||||
|
A last major parameter is also the possibility to do predictions in bulk or
|
||||||
|
one-at-a-time mode.
|
||||||
|
|
||||||
|
Bulk versus Atomic mode
|
||||||
|
........................
|
||||||
|
|
||||||
|
In general doing predictions in bulk (many instances at the same time) is
|
||||||
|
more efficient for a number of reasons (branching predictability, CPU cache,
|
||||||
|
linear algebra libraries optimizations etc.). Here we see on a setting
|
||||||
|
with few features that independently of estimator choice the bulk mode is
|
||||||
|
always faster, and for some of them by 1 to 2 orders of magnitude:
|
||||||
|
|
||||||
|
.. |atomic_prediction_latency| image:: ../auto_examples/applications/images/sphx_glr_plot_prediction_latency_001.png
|
||||||
|
:target: ../auto_examples/applications/plot_prediction_latency.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |atomic_prediction_latency|
|
||||||
|
|
||||||
|
.. |bulk_prediction_latency| image:: ../auto_examples/applications/images/sphx_glr_plot_prediction_latency_002.png
|
||||||
|
:target: ../auto_examples/applications/plot_prediction_latency.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |bulk_prediction_latency|
|
||||||
|
|
||||||
|
To benchmark different estimators for your case you can simply change the
|
||||||
|
``n_features`` parameter in this example:
|
||||||
|
:ref:`sphx_glr_auto_examples_applications_plot_prediction_latency.py`. This should give
|
||||||
|
you an estimate of the order of magnitude of the prediction latency.
|
||||||
|
|
||||||
|
Configuring Scikit-learn for reduced validation overhead
|
||||||
|
.........................................................
|
||||||
|
|
||||||
|
Scikit-learn does some validation on data that increases the overhead per
|
||||||
|
call to ``predict`` and similar functions. In particular, checking that
|
||||||
|
features are finite (not NaN or infinite) involves a full pass over the
|
||||||
|
data. If you ensure that your data is acceptable, you may suppress
|
||||||
|
checking for finiteness by setting the environment variable
|
||||||
|
``SKLEARN_ASSUME_FINITE`` to a non-empty string before importing
|
||||||
|
scikit-learn, or configure it in Python with :func:`set_config`.
|
||||||
|
For more control than these global settings, a :func:`config_context`
|
||||||
|
allows you to set this configuration within a specified context::
|
||||||
|
|
||||||
|
>>> import sklearn
|
||||||
|
>>> with sklearn.config_context(assume_finite=True):
|
||||||
|
... pass # do learning/prediction here with reduced validation
|
||||||
|
|
||||||
|
Note that this will affect all uses of
|
||||||
|
:func:`~utils.assert_all_finite` within the context.
|
||||||
|
|
||||||
|
Influence of the Number of Features
|
||||||
|
....................................
|
||||||
|
|
||||||
|
Obviously when the number of features increases so does the memory
|
||||||
|
consumption of each example. Indeed, for a matrix of :math:`M` instances
|
||||||
|
with :math:`N` features, the space complexity is in :math:`O(NM)`.
|
||||||
|
From a computing perspective it also means that the number of basic operations
|
||||||
|
(e.g., multiplications for vector-matrix products in linear models) increases
|
||||||
|
too. Here is a graph of the evolution of the prediction latency with the
|
||||||
|
number of features:
|
||||||
|
|
||||||
|
.. |influence_of_n_features_on_latency| image:: ../auto_examples/applications/images/sphx_glr_plot_prediction_latency_003.png
|
||||||
|
:target: ../auto_examples/applications/plot_prediction_latency.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |influence_of_n_features_on_latency|
|
||||||
|
|
||||||
|
Overall you can expect the prediction time to increase at least linearly with
|
||||||
|
the number of features (non-linear cases can happen depending on the global
|
||||||
|
memory footprint and estimator).
|
||||||
|
|
||||||
|
Influence of the Input Data Representation
|
||||||
|
...........................................
|
||||||
|
|
||||||
|
Scipy provides sparse matrix data structures which are optimized for storing
|
||||||
|
sparse data. The main feature of sparse formats is that you don't store zeros
|
||||||
|
so if your data is sparse then you use much less memory. A non-zero value in
|
||||||
|
a sparse (`CSR or CSC <https://docs.scipy.org/doc/scipy/reference/sparse.html>`_)
|
||||||
|
representation will only take on average one 32bit integer position + the 64
|
||||||
|
bit floating point value + an additional 32bit per row or column in the matrix.
|
||||||
|
Using sparse input on a dense (or sparse) linear model can speedup prediction
|
||||||
|
by quite a bit as only the non zero valued features impact the dot product
|
||||||
|
and thus the model predictions. Hence if you have 100 non zeros in 1e6
|
||||||
|
dimensional space, you only need 100 multiply and add operation instead of 1e6.
|
||||||
|
|
||||||
|
Calculation over a dense representation, however, may leverage highly optimized
|
||||||
|
vector operations and multithreading in BLAS, and tends to result in fewer CPU
|
||||||
|
cache misses. So the sparsity should typically be quite high (10% non-zeros
|
||||||
|
max, to be checked depending on the hardware) for the sparse input
|
||||||
|
representation to be faster than the dense input representation on a machine
|
||||||
|
with many CPUs and an optimized BLAS implementation.
|
||||||
|
|
||||||
|
Here is sample code to test the sparsity of your input::
|
||||||
|
|
||||||
|
def sparsity_ratio(X):
|
||||||
|
return 1.0 - np.count_nonzero(X) / float(X.shape[0] * X.shape[1])
|
||||||
|
print("input sparsity ratio:", sparsity_ratio(X))
|
||||||
|
|
||||||
|
As a rule of thumb you can consider that if the sparsity ratio is greater
|
||||||
|
than 90% you can probably benefit from sparse formats. Check Scipy's sparse
|
||||||
|
matrix formats `documentation <https://docs.scipy.org/doc/scipy/reference/sparse.html>`_
|
||||||
|
for more information on how to build (or convert your data to) sparse matrix
|
||||||
|
formats. Most of the time the ``CSR`` and ``CSC`` formats work best.
|
||||||
|
|
||||||
|
Influence of the Model Complexity
|
||||||
|
..................................
|
||||||
|
|
||||||
|
Generally speaking, when model complexity increases, predictive power and
|
||||||
|
latency are supposed to increase. Increasing predictive power is usually
|
||||||
|
interesting, but for many applications we would better not increase
|
||||||
|
prediction latency too much. We will now review this idea for different
|
||||||
|
families of supervised models.
|
||||||
|
|
||||||
|
For :mod:`sklearn.linear_model` (e.g. Lasso, ElasticNet,
|
||||||
|
SGDClassifier/Regressor, Ridge & RidgeClassifier,
|
||||||
|
PassiveAggressiveClassifier/Regressor, LinearSVC, LogisticRegression...) the
|
||||||
|
decision function that is applied at prediction time is the same (a dot product)
|
||||||
|
, so latency should be equivalent.
|
||||||
|
|
||||||
|
Here is an example using
|
||||||
|
:class:`~linear_model.SGDClassifier` with the
|
||||||
|
``elasticnet`` penalty. The regularization strength is globally controlled by
|
||||||
|
the ``alpha`` parameter. With a sufficiently high ``alpha``,
|
||||||
|
one can then increase the ``l1_ratio`` parameter of ``elasticnet`` to
|
||||||
|
enforce various levels of sparsity in the model coefficients. Higher sparsity
|
||||||
|
here is interpreted as less model complexity as we need fewer coefficients to
|
||||||
|
describe it fully. Of course sparsity influences in turn the prediction time
|
||||||
|
as the sparse dot-product takes time roughly proportional to the number of
|
||||||
|
non-zero coefficients.
|
||||||
|
|
||||||
|
.. |en_model_complexity| image:: ../auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_001.png
|
||||||
|
:target: ../auto_examples/applications/plot_model_complexity_influence.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |en_model_complexity|
|
||||||
|
|
||||||
|
For the :mod:`sklearn.svm` family of algorithms with a non-linear kernel,
|
||||||
|
the latency is tied to the number of support vectors (the fewer the faster).
|
||||||
|
Latency and throughput should (asymptotically) grow linearly with the number
|
||||||
|
of support vectors in a SVC or SVR model. The kernel will also influence the
|
||||||
|
latency as it is used to compute the projection of the input vector once per
|
||||||
|
support vector. In the following graph the ``nu`` parameter of
|
||||||
|
:class:`~svm.NuSVR` was used to influence the number of
|
||||||
|
support vectors.
|
||||||
|
|
||||||
|
.. |nusvr_model_complexity| image:: ../auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_002.png
|
||||||
|
:target: ../auto_examples/applications/plot_model_complexity_influence.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |nusvr_model_complexity|
|
||||||
|
|
||||||
|
For :mod:`sklearn.ensemble` of trees (e.g. RandomForest, GBT,
|
||||||
|
ExtraTrees, etc.) the number of trees and their depth play the most
|
||||||
|
important role. Latency and throughput should scale linearly with the number
|
||||||
|
of trees. In this case we used directly the ``n_estimators`` parameter of
|
||||||
|
:class:`~ensemble.GradientBoostingRegressor`.
|
||||||
|
|
||||||
|
.. |gbt_model_complexity| image:: ../auto_examples/applications/images/sphx_glr_plot_model_complexity_influence_003.png
|
||||||
|
:target: ../auto_examples/applications/plot_model_complexity_influence.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |gbt_model_complexity|
|
||||||
|
|
||||||
|
In any case be warned that decreasing model complexity can hurt accuracy as
|
||||||
|
mentioned above. For instance a non-linearly separable problem can be handled
|
||||||
|
with a speedy linear model but prediction power will very likely suffer in
|
||||||
|
the process.
|
||||||
|
|
||||||
|
Feature Extraction Latency
|
||||||
|
..........................
|
||||||
|
|
||||||
|
Most scikit-learn models are usually pretty fast as they are implemented
|
||||||
|
either with compiled Cython extensions or optimized computing libraries.
|
||||||
|
On the other hand, in many real world applications the feature extraction
|
||||||
|
process (i.e. turning raw data like database rows or network packets into
|
||||||
|
numpy arrays) governs the overall prediction time. For example on the Reuters
|
||||||
|
text classification task the whole preparation (reading and parsing SGML
|
||||||
|
files, tokenizing the text and hashing it into a common vector space) is
|
||||||
|
taking 100 to 500 times more time than the actual prediction code, depending on
|
||||||
|
the chosen model.
|
||||||
|
|
||||||
|
.. |prediction_time| image:: ../auto_examples/applications/images/sphx_glr_plot_out_of_core_classification_004.png
|
||||||
|
:target: ../auto_examples/applications/plot_out_of_core_classification.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |prediction_time|
|
||||||
|
|
||||||
|
In many cases it is thus recommended to carefully time and profile your
|
||||||
|
feature extraction code as it may be a good place to start optimizing when
|
||||||
|
your overall latency is too slow for your application.
|
||||||
|
|
||||||
|
Prediction Throughput
|
||||||
|
----------------------
|
||||||
|
|
||||||
|
Another important metric to care about when sizing production systems is the
|
||||||
|
throughput i.e. the number of predictions you can make in a given amount of
|
||||||
|
time. Here is a benchmark from the
|
||||||
|
:ref:`sphx_glr_auto_examples_applications_plot_prediction_latency.py` example that measures
|
||||||
|
this quantity for a number of estimators on synthetic data:
|
||||||
|
|
||||||
|
.. |throughput_benchmark| image:: ../auto_examples/applications/images/sphx_glr_plot_prediction_latency_004.png
|
||||||
|
:target: ../auto_examples/applications/plot_prediction_latency.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |throughput_benchmark|
|
||||||
|
|
||||||
|
These throughputs are achieved on a single process. An obvious way to
|
||||||
|
increase the throughput of your application is to spawn additional instances
|
||||||
|
(usually processes in Python because of the
|
||||||
|
`GIL <https://wiki.python.org/moin/GlobalInterpreterLock>`_) that share the
|
||||||
|
same model. One might also add machines to spread the load. A detailed
|
||||||
|
explanation on how to achieve this is beyond the scope of this documentation
|
||||||
|
though.
|
||||||
|
|
||||||
|
Tips and Tricks
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Linear algebra libraries
|
||||||
|
.........................
|
||||||
|
|
||||||
|
As scikit-learn relies heavily on Numpy/Scipy and linear algebra in general it
|
||||||
|
makes sense to take explicit care of the versions of these libraries.
|
||||||
|
Basically, you ought to make sure that Numpy is built using an optimized `BLAS
|
||||||
|
<https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms>`_ /
|
||||||
|
`LAPACK <https://en.wikipedia.org/wiki/LAPACK>`_ library.
|
||||||
|
|
||||||
|
Not all models benefit from optimized BLAS and Lapack implementations. For
|
||||||
|
instance models based on (randomized) decision trees typically do not rely on
|
||||||
|
BLAS calls in their inner loops, nor do kernel SVMs (``SVC``, ``SVR``,
|
||||||
|
``NuSVC``, ``NuSVR``). On the other hand a linear model implemented with a
|
||||||
|
BLAS DGEMM call (via ``numpy.dot``) will typically benefit hugely from a tuned
|
||||||
|
BLAS implementation and lead to orders of magnitude speedup over a
|
||||||
|
non-optimized BLAS.
|
||||||
|
|
||||||
|
You can display the BLAS / LAPACK implementation used by your NumPy / SciPy /
|
||||||
|
scikit-learn install with the following command::
|
||||||
|
|
||||||
|
python -c "import sklearn; sklearn.show_versions()"
|
||||||
|
|
||||||
|
Optimized BLAS / LAPACK implementations include:
|
||||||
|
|
||||||
|
- Atlas (need hardware specific tuning by rebuilding on the target machine)
|
||||||
|
- OpenBLAS
|
||||||
|
- MKL
|
||||||
|
- Apple Accelerate and vecLib frameworks (OSX only)
|
||||||
|
|
||||||
|
More information can be found on the `NumPy install page <https://numpy.org/install/>`_
|
||||||
|
and in this
|
||||||
|
`blog post <https://danielnouri.org/notes/2012/12/19/libblas-and-liblapack-issues-and-speed,-with-scipy-and-ubuntu/>`_
|
||||||
|
from Daniel Nouri which has some nice step by step install instructions for
|
||||||
|
Debian / Ubuntu.
|
||||||
|
|
||||||
|
.. _working_memory:
|
||||||
|
|
||||||
|
Limiting Working Memory
|
||||||
|
........................
|
||||||
|
|
||||||
|
Some calculations when implemented using standard numpy vectorized operations
|
||||||
|
involve using a large amount of temporary memory. This may potentially exhaust
|
||||||
|
system memory. Where computations can be performed in fixed-memory chunks, we
|
||||||
|
attempt to do so, and allow the user to hint at the maximum size of this
|
||||||
|
working memory (defaulting to 1GB) using :func:`set_config` or
|
||||||
|
:func:`config_context`. The following suggests to limit temporary working
|
||||||
|
memory to 128 MiB::
|
||||||
|
|
||||||
|
>>> import sklearn
|
||||||
|
>>> with sklearn.config_context(working_memory=128):
|
||||||
|
... pass # do chunked work here
|
||||||
|
|
||||||
|
An example of a chunked operation adhering to this setting is
|
||||||
|
:func:`~metrics.pairwise_distances_chunked`, which facilitates computing
|
||||||
|
row-wise reductions of a pairwise distance matrix.
|
||||||
|
|
||||||
|
Model Compression
|
||||||
|
..................
|
||||||
|
|
||||||
|
Model compression in scikit-learn only concerns linear models for the moment.
|
||||||
|
In this context it means that we want to control the model sparsity (i.e. the
|
||||||
|
number of non-zero coordinates in the model vectors). It is generally a good
|
||||||
|
idea to combine model sparsity with sparse input data representation.
|
||||||
|
|
||||||
|
Here is sample code that illustrates the use of the ``sparsify()`` method::
|
||||||
|
|
||||||
|
clf = SGDRegressor(penalty='elasticnet', l1_ratio=0.25)
|
||||||
|
clf.fit(X_train, y_train).sparsify()
|
||||||
|
clf.predict(X_test)
|
||||||
|
|
||||||
|
In this example we prefer the ``elasticnet`` penalty as it is often a good
|
||||||
|
compromise between model compactness and prediction power. One can also
|
||||||
|
further tune the ``l1_ratio`` parameter (in combination with the
|
||||||
|
regularization strength ``alpha``) to control this tradeoff.
|
||||||
|
|
||||||
|
A typical `benchmark <https://github.com/scikit-learn/scikit-learn/blob/main/benchmarks/bench_sparsify.py>`_
|
||||||
|
on synthetic data yields a >30% decrease in latency when both the model and
|
||||||
|
input are sparse (with 0.000024 and 0.027400 non-zero coefficients ratio
|
||||||
|
respectively). Your mileage may vary depending on the sparsity and size of
|
||||||
|
your data and model.
|
||||||
|
Furthermore, sparsifying can be very useful to reduce the memory usage of
|
||||||
|
predictive models deployed on production servers.
|
||||||
|
|
||||||
|
Model Reshaping
|
||||||
|
................
|
||||||
|
|
||||||
|
Model reshaping consists in selecting only a portion of the available features
|
||||||
|
to fit a model. In other words, if a model discards features during the
|
||||||
|
learning phase we can then strip those from the input. This has several
|
||||||
|
benefits. Firstly it reduces memory (and therefore time) overhead of the
|
||||||
|
model itself. It also allows to discard explicit
|
||||||
|
feature selection components in a pipeline once we know which features to
|
||||||
|
keep from a previous run. Finally, it can help reduce processing time and I/O
|
||||||
|
usage upstream in the data access and feature extraction layers by not
|
||||||
|
collecting and building features that are discarded by the model. For instance
|
||||||
|
if the raw data come from a database, it can make it possible to write simpler
|
||||||
|
and faster queries or reduce I/O usage by making the queries return lighter
|
||||||
|
records.
|
||||||
|
At the moment, reshaping needs to be performed manually in scikit-learn.
|
||||||
|
In the case of sparse input (particularly in ``CSR`` format), it is generally
|
||||||
|
sufficient to not generate the relevant features, leaving their columns empty.
|
||||||
|
|
||||||
|
Links
|
||||||
|
......
|
||||||
|
|
||||||
|
- :ref:`scikit-learn developer performance documentation <performance-howto>`
|
||||||
|
- `Scipy sparse matrix formats documentation <https://docs.scipy.org/doc/scipy/reference/sparse.html>`_
|
|
@ -0,0 +1,340 @@
|
||||||
|
Parallelism, resource management, and configuration
|
||||||
|
===================================================
|
||||||
|
|
||||||
|
.. _parallelism:
|
||||||
|
|
||||||
|
Parallelism
|
||||||
|
-----------
|
||||||
|
|
||||||
|
Some scikit-learn estimators and utilities parallelize costly operations
|
||||||
|
using multiple CPU cores.
|
||||||
|
|
||||||
|
Depending on the type of estimator and sometimes the values of the
|
||||||
|
constructor parameters, this is either done:
|
||||||
|
|
||||||
|
- with higher-level parallelism via `joblib <https://joblib.readthedocs.io/en/latest/>`_.
|
||||||
|
- with lower-level parallelism via OpenMP, used in C or Cython code.
|
||||||
|
- with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations
|
||||||
|
on arrays.
|
||||||
|
|
||||||
|
The `n_jobs` parameters of estimators always controls the amount of parallelism
|
||||||
|
managed by joblib (processes or threads depending on the joblib backend).
|
||||||
|
The thread-level parallelism managed by OpenMP in scikit-learn's own Cython code
|
||||||
|
or by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn
|
||||||
|
is always controlled by environment variables or `threadpoolctl` as explained below.
|
||||||
|
Note that some estimators can leverage all three kinds of parallelism at different
|
||||||
|
points of their training and prediction methods.
|
||||||
|
|
||||||
|
We describe these 3 types of parallelism in the following subsections in more details.
|
||||||
|
|
||||||
|
Higher-level parallelism with joblib
|
||||||
|
....................................
|
||||||
|
|
||||||
|
When the underlying implementation uses joblib, the number of workers
|
||||||
|
(threads or processes) that are spawned in parallel can be controlled via the
|
||||||
|
``n_jobs`` parameter.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Where (and how) parallelization happens in the estimators using joblib by
|
||||||
|
specifying `n_jobs` is currently poorly documented.
|
||||||
|
Please help us by improving our docs and tackle `issue 14228
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/issues/14228>`_!
|
||||||
|
|
||||||
|
Joblib is able to support both multi-processing and multi-threading. Whether
|
||||||
|
joblib chooses to spawn a thread or a process depends on the **backend**
|
||||||
|
that it's using.
|
||||||
|
|
||||||
|
scikit-learn generally relies on the ``loky`` backend, which is joblib's
|
||||||
|
default backend. Loky is a multi-processing backend. When doing
|
||||||
|
multi-processing, in order to avoid duplicating the memory in each process
|
||||||
|
(which isn't reasonable with big datasets), joblib will create a `memmap
|
||||||
|
<https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html>`_
|
||||||
|
that all processes can share, when the data is bigger than 1MB.
|
||||||
|
|
||||||
|
In some specific cases (when the code that is run in parallel releases the
|
||||||
|
GIL), scikit-learn will indicate to ``joblib`` that a multi-threading
|
||||||
|
backend is preferable.
|
||||||
|
|
||||||
|
As a user, you may control the backend that joblib will use (regardless of
|
||||||
|
what scikit-learn recommends) by using a context manager::
|
||||||
|
|
||||||
|
from joblib import parallel_backend
|
||||||
|
|
||||||
|
with parallel_backend('threading', n_jobs=2):
|
||||||
|
# Your scikit-learn code here
|
||||||
|
|
||||||
|
Please refer to the `joblib's docs
|
||||||
|
<https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism>`_
|
||||||
|
for more details.
|
||||||
|
|
||||||
|
In practice, whether parallelism is helpful at improving runtime depends on
|
||||||
|
many factors. It is usually a good idea to experiment rather than assuming
|
||||||
|
that increasing the number of workers is always a good thing. In some cases
|
||||||
|
it can be highly detrimental to performance to run multiple copies of some
|
||||||
|
estimators or functions in parallel (see oversubscription below).
|
||||||
|
|
||||||
|
Lower-level parallelism with OpenMP
|
||||||
|
...................................
|
||||||
|
|
||||||
|
OpenMP is used to parallelize code written in Cython or C, relying on
|
||||||
|
multi-threading exclusively. By default, the implementations using OpenMP
|
||||||
|
will use as many threads as possible, i.e. as many threads as logical cores.
|
||||||
|
|
||||||
|
You can control the exact number of threads that are used either:
|
||||||
|
|
||||||
|
- via the ``OMP_NUM_THREADS`` environment variable, for instance when:
|
||||||
|
running a python script:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
OMP_NUM_THREADS=4 python my_script.py
|
||||||
|
|
||||||
|
- or via `threadpoolctl` as explained by `this piece of documentation
|
||||||
|
<https://github.com/joblib/threadpoolctl/#setting-the-maximum-size-of-thread-pools>`_.
|
||||||
|
|
||||||
|
Parallel NumPy and SciPy routines from numerical libraries
|
||||||
|
..........................................................
|
||||||
|
|
||||||
|
scikit-learn relies heavily on NumPy and SciPy, which internally call
|
||||||
|
multi-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries
|
||||||
|
such as MKL, OpenBLAS or BLIS.
|
||||||
|
|
||||||
|
You can control the exact number of threads used by BLAS for each library
|
||||||
|
using environment variables, namely:
|
||||||
|
|
||||||
|
- ``MKL_NUM_THREADS`` sets the number of thread MKL uses,
|
||||||
|
- ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses
|
||||||
|
- ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses
|
||||||
|
|
||||||
|
Note that BLAS & LAPACK implementations can also be impacted by
|
||||||
|
`OMP_NUM_THREADS`. To check whether this is the case in your environment,
|
||||||
|
you can inspect how the number of threads effectively used by those libraries
|
||||||
|
is affected when running the following command in a bash or zsh terminal
|
||||||
|
for different values of `OMP_NUM_THREADS`:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
OMP_NUM_THREADS=2 python -m threadpoolctl -i numpy scipy
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
At the time of writing (2022), NumPy and SciPy packages which are
|
||||||
|
distributed on pypi.org (i.e. the ones installed via ``pip install``)
|
||||||
|
and on the conda-forge channel (i.e. the ones installed via
|
||||||
|
``conda install --channel conda-forge``) are linked with OpenBLAS, while
|
||||||
|
NumPy and SciPy packages packages shipped on the ``defaults`` conda
|
||||||
|
channel from Anaconda.org (i.e. the ones installed via ``conda install``)
|
||||||
|
are linked by default with MKL.
|
||||||
|
|
||||||
|
|
||||||
|
Oversubscription: spawning too many threads
|
||||||
|
...........................................
|
||||||
|
|
||||||
|
It is generally recommended to avoid using significantly more processes or
|
||||||
|
threads than the number of CPUs on a machine. Over-subscription happens when
|
||||||
|
a program is running too many threads at the same time.
|
||||||
|
|
||||||
|
Suppose you have a machine with 8 CPUs. Consider a case where you're running
|
||||||
|
a :class:`~sklearn.model_selection.GridSearchCV` (parallelized with joblib)
|
||||||
|
with ``n_jobs=8`` over a
|
||||||
|
:class:`~sklearn.ensemble.HistGradientBoostingClassifier` (parallelized with
|
||||||
|
OpenMP). Each instance of
|
||||||
|
:class:`~sklearn.ensemble.HistGradientBoostingClassifier` will spawn 8 threads
|
||||||
|
(since you have 8 CPUs). That's a total of ``8 * 8 = 64`` threads, which
|
||||||
|
leads to oversubscription of threads for physical CPU resources and thus
|
||||||
|
to scheduling overhead.
|
||||||
|
|
||||||
|
Oversubscription can arise in the exact same fashion with parallelized
|
||||||
|
routines from MKL, OpenBLAS or BLIS that are nested in joblib calls.
|
||||||
|
|
||||||
|
Starting from ``joblib >= 0.14``, when the ``loky`` backend is used (which
|
||||||
|
is the default), joblib will tell its child **processes** to limit the
|
||||||
|
number of threads they can use, so as to avoid oversubscription. In practice
|
||||||
|
the heuristic that joblib uses is to tell the processes to use ``max_threads
|
||||||
|
= n_cpus // n_jobs``, via their corresponding environment variable. Back to
|
||||||
|
our example from above, since the joblib backend of
|
||||||
|
:class:`~sklearn.model_selection.GridSearchCV` is ``loky``, each process will
|
||||||
|
only be able to use 1 thread instead of 8, thus mitigating the
|
||||||
|
oversubscription issue.
|
||||||
|
|
||||||
|
Note that:
|
||||||
|
|
||||||
|
- Manually setting one of the environment variables (``OMP_NUM_THREADS``,
|
||||||
|
``MKL_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, or ``BLIS_NUM_THREADS``)
|
||||||
|
will take precedence over what joblib tries to do. The total number of
|
||||||
|
threads will be ``n_jobs * <LIB>_NUM_THREADS``. Note that setting this
|
||||||
|
limit will also impact your computations in the main process, which will
|
||||||
|
only use ``<LIB>_NUM_THREADS``. Joblib exposes a context manager for
|
||||||
|
finer control over the number of threads in its workers (see joblib docs
|
||||||
|
linked below).
|
||||||
|
- When joblib is configured to use the ``threading`` backend, there is no
|
||||||
|
mechanism to avoid oversubscriptions when calling into parallel native
|
||||||
|
libraries in the joblib-managed threads.
|
||||||
|
- All scikit-learn estimators that explicitly rely on OpenMP in their Cython code
|
||||||
|
always use `threadpoolctl` internally to automatically adapt the numbers of
|
||||||
|
threads used by OpenMP and potentially nested BLAS calls so as to avoid
|
||||||
|
oversubscription.
|
||||||
|
|
||||||
|
You will find additional details about joblib mitigation of oversubscription
|
||||||
|
in `joblib documentation
|
||||||
|
<https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources>`_.
|
||||||
|
|
||||||
|
You will find additional details about parallelism in numerical python libraries
|
||||||
|
in `this document from Thomas J. Fan <https://thomasjpfan.github.io/parallelism-python-libraries-design/>`_.
|
||||||
|
|
||||||
|
Configuration switches
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
Python API
|
||||||
|
..........
|
||||||
|
|
||||||
|
:func:`sklearn.set_config` and :func:`sklearn.config_context` can be used to change
|
||||||
|
parameters of the configuration which control aspect of parallelism.
|
||||||
|
|
||||||
|
.. _environment_variable:
|
||||||
|
|
||||||
|
Environment variables
|
||||||
|
.....................
|
||||||
|
|
||||||
|
These environment variables should be set before importing scikit-learn.
|
||||||
|
|
||||||
|
`SKLEARN_ASSUME_FINITE`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Sets the default value for the `assume_finite` argument of
|
||||||
|
:func:`sklearn.set_config`.
|
||||||
|
|
||||||
|
`SKLEARN_WORKING_MEMORY`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Sets the default value for the `working_memory` argument of
|
||||||
|
:func:`sklearn.set_config`.
|
||||||
|
|
||||||
|
`SKLEARN_SEED`
|
||||||
|
~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Sets the seed of the global random generator when running the tests, for
|
||||||
|
reproducibility.
|
||||||
|
|
||||||
|
Note that scikit-learn tests are expected to run deterministically with
|
||||||
|
explicit seeding of their own independent RNG instances instead of relying on
|
||||||
|
the numpy or Python standard library RNG singletons to make sure that test
|
||||||
|
results are independent of the test execution order. However some tests might
|
||||||
|
forget to use explicit seeding and this variable is a way to control the initial
|
||||||
|
state of the aforementioned singletons.
|
||||||
|
|
||||||
|
`SKLEARN_TESTS_GLOBAL_RANDOM_SEED`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Controls the seeding of the random number generator used in tests that rely on
|
||||||
|
the `global_random_seed`` fixture.
|
||||||
|
|
||||||
|
All tests that use this fixture accept the contract that they should
|
||||||
|
deterministically pass for any seed value from 0 to 99 included.
|
||||||
|
|
||||||
|
If the `SKLEARN_TESTS_GLOBAL_RANDOM_SEED` environment variable is set to
|
||||||
|
`"any"` (which should be the case on nightly builds on the CI), the fixture
|
||||||
|
will choose an arbitrary seed in the above range (based on the BUILD_NUMBER or
|
||||||
|
the current day) and all fixtured tests will run for that specific seed. The
|
||||||
|
goal is to ensure that, over time, our CI will run all tests with different
|
||||||
|
seeds while keeping the test duration of a single run of the full test suite
|
||||||
|
limited. This will check that the assertions of tests written to use this
|
||||||
|
fixture are not dependent on a specific seed value.
|
||||||
|
|
||||||
|
The range of admissible seed values is limited to [0, 99] because it is often
|
||||||
|
not possible to write a test that can work for any possible seed and we want to
|
||||||
|
avoid having tests that randomly fail on the CI.
|
||||||
|
|
||||||
|
Valid values for `SKLEARN_TESTS_GLOBAL_RANDOM_SEED`:
|
||||||
|
|
||||||
|
- `SKLEARN_TESTS_GLOBAL_RANDOM_SEED="42"`: run tests with a fixed seed of 42
|
||||||
|
- `SKLEARN_TESTS_GLOBAL_RANDOM_SEED="40-42"`: run the tests with all seeds
|
||||||
|
between 40 and 42 included
|
||||||
|
- `SKLEARN_TESTS_GLOBAL_RANDOM_SEED="any"`: run the tests with an arbitrary
|
||||||
|
seed selected between 0 and 99 included
|
||||||
|
- `SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all"`: run the tests with all seeds
|
||||||
|
between 0 and 99 included. This can take a long time: only use for individual
|
||||||
|
tests, not the full test suite!
|
||||||
|
|
||||||
|
If the variable is not set, then 42 is used as the global seed in a
|
||||||
|
deterministic manner. This ensures that, by default, the scikit-learn test
|
||||||
|
suite is as deterministic as possible to avoid disrupting our friendly
|
||||||
|
third-party package maintainers. Similarly, this variable should not be set in
|
||||||
|
the CI config of pull-requests to make sure that our friendly contributors are
|
||||||
|
not the first people to encounter a seed-sensitivity regression in a test
|
||||||
|
unrelated to the changes of their own PR. Only the scikit-learn maintainers who
|
||||||
|
watch the results of the nightly builds are expected to be annoyed by this.
|
||||||
|
|
||||||
|
When writing a new test function that uses this fixture, please use the
|
||||||
|
following command to make sure that it passes deterministically for all
|
||||||
|
admissible seeds on your local machine:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all" pytest -v -k test_your_test_name
|
||||||
|
|
||||||
|
`SKLEARN_SKIP_NETWORK_TESTS`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
When this environment variable is set to a non zero value, the tests that need
|
||||||
|
network access are skipped. When this environment variable is not set then
|
||||||
|
network tests are skipped.
|
||||||
|
|
||||||
|
`SKLEARN_RUN_FLOAT32_TESTS`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
When this environment variable is set to '1', the tests using the
|
||||||
|
`global_dtype` fixture are also run on float32 data.
|
||||||
|
When this environment variable is not set, the tests are only run on
|
||||||
|
float64 data.
|
||||||
|
|
||||||
|
`SKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
When this environment variable is set to a non zero value, the `Cython`
|
||||||
|
derivative, `boundscheck` is set to `True`. This is useful for finding
|
||||||
|
segfaults.
|
||||||
|
|
||||||
|
`SKLEARN_BUILD_ENABLE_DEBUG_SYMBOLS`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
When this environment variable is set to a non zero value, the debug symbols
|
||||||
|
will be included in the compiled C extensions. Only debug symbols for POSIX
|
||||||
|
systems is configured.
|
||||||
|
|
||||||
|
`SKLEARN_PAIRWISE_DIST_CHUNK_SIZE`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This sets the size of chunk to be used by the underlying `PairwiseDistancesReductions`
|
||||||
|
implementations. The default value is `256` which has been showed to be adequate on
|
||||||
|
most machines.
|
||||||
|
|
||||||
|
Users looking for the best performance might want to tune this variable using
|
||||||
|
powers of 2 so as to get the best parallelism behavior for their hardware,
|
||||||
|
especially with respect to their caches' sizes.
|
||||||
|
|
||||||
|
`SKLEARN_WARNINGS_AS_ERRORS`
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
This environment variable is used to turn warnings into errors in tests and
|
||||||
|
documentation build.
|
||||||
|
|
||||||
|
Some CI (Continuous Integration) builds set `SKLEARN_WARNINGS_AS_ERRORS=1`, for
|
||||||
|
example to make sure that we catch deprecation warnings from our dependencies
|
||||||
|
and that we adapt our code.
|
||||||
|
|
||||||
|
To locally run with the same "warnings as errors" setting as in these CI builds
|
||||||
|
you can set `SKLEARN_WARNINGS_AS_ERRORS=1`.
|
||||||
|
|
||||||
|
By default, warnings are not turned into errors. This is the case if
|
||||||
|
`SKLEARN_WARNINGS_AS_ERRORS` is unset, or `SKLEARN_WARNINGS_AS_ERRORS=0`.
|
||||||
|
|
||||||
|
This environment variable use specific warning filters to ignore some warnings,
|
||||||
|
since sometimes warnings originate from third-party libraries and there is not
|
||||||
|
much we can do about it. You can see the warning filters in the
|
||||||
|
`_get_warnings_filters_info_list` function in `sklearn/utils/_testing.py`.
|
||||||
|
|
||||||
|
Note that for documentation build, `SKLEARN_WARNING_AS_ERRORS=1` is checking
|
||||||
|
that the documentation build, in particular running examples, does not produce
|
||||||
|
any warnings. This is different from the `-W` `sphinx-build` argument that
|
||||||
|
catches syntax warnings in the rst files.
|
|
@ -0,0 +1,136 @@
|
||||||
|
.. _scaling_strategies:
|
||||||
|
|
||||||
|
Strategies to scale computationally: bigger data
|
||||||
|
=================================================
|
||||||
|
|
||||||
|
For some applications the amount of examples, features (or both) and/or the
|
||||||
|
speed at which they need to be processed are challenging for traditional
|
||||||
|
approaches. In these cases scikit-learn has a number of options you can
|
||||||
|
consider to make your system scale.
|
||||||
|
|
||||||
|
Scaling with instances using out-of-core learning
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
Out-of-core (or "external memory") learning is a technique used to learn from
|
||||||
|
data that cannot fit in a computer's main memory (RAM).
|
||||||
|
|
||||||
|
Here is a sketch of a system designed to achieve this goal:
|
||||||
|
|
||||||
|
1. a way to stream instances
|
||||||
|
2. a way to extract features from instances
|
||||||
|
3. an incremental algorithm
|
||||||
|
|
||||||
|
Streaming instances
|
||||||
|
....................
|
||||||
|
|
||||||
|
Basically, 1. may be a reader that yields instances from files on a
|
||||||
|
hard drive, a database, from a network stream etc. However,
|
||||||
|
details on how to achieve this are beyond the scope of this documentation.
|
||||||
|
|
||||||
|
Extracting features
|
||||||
|
...................
|
||||||
|
|
||||||
|
\2. could be any relevant way to extract features among the
|
||||||
|
different :ref:`feature extraction <feature_extraction>` methods supported by
|
||||||
|
scikit-learn. However, when working with data that needs vectorization and
|
||||||
|
where the set of features or values is not known in advance one should take
|
||||||
|
explicit care. A good example is text classification where unknown terms are
|
||||||
|
likely to be found during training. It is possible to use a stateful
|
||||||
|
vectorizer if making multiple passes over the data is reasonable from an
|
||||||
|
application point of view. Otherwise, one can turn up the difficulty by using
|
||||||
|
a stateless feature extractor. Currently the preferred way to do this is to
|
||||||
|
use the so-called :ref:`hashing trick<feature_hashing>` as implemented by
|
||||||
|
:class:`sklearn.feature_extraction.FeatureHasher` for datasets with categorical
|
||||||
|
variables represented as list of Python dicts or
|
||||||
|
:class:`sklearn.feature_extraction.text.HashingVectorizer` for text documents.
|
||||||
|
|
||||||
|
Incremental learning
|
||||||
|
.....................
|
||||||
|
|
||||||
|
Finally, for 3. we have a number of options inside scikit-learn. Although not
|
||||||
|
all algorithms can learn incrementally (i.e. without seeing all the instances
|
||||||
|
at once), all estimators implementing the ``partial_fit`` API are candidates.
|
||||||
|
Actually, the ability to learn incrementally from a mini-batch of instances
|
||||||
|
(sometimes called "online learning") is key to out-of-core learning as it
|
||||||
|
guarantees that at any given time there will be only a small amount of
|
||||||
|
instances in the main memory. Choosing a good size for the mini-batch that
|
||||||
|
balances relevancy and memory footprint could involve some tuning [1]_.
|
||||||
|
|
||||||
|
Here is a list of incremental estimators for different tasks:
|
||||||
|
|
||||||
|
- Classification
|
||||||
|
+ :class:`sklearn.naive_bayes.MultinomialNB`
|
||||||
|
+ :class:`sklearn.naive_bayes.BernoulliNB`
|
||||||
|
+ :class:`sklearn.linear_model.Perceptron`
|
||||||
|
+ :class:`sklearn.linear_model.SGDClassifier`
|
||||||
|
+ :class:`sklearn.linear_model.PassiveAggressiveClassifier`
|
||||||
|
+ :class:`sklearn.neural_network.MLPClassifier`
|
||||||
|
- Regression
|
||||||
|
+ :class:`sklearn.linear_model.SGDRegressor`
|
||||||
|
+ :class:`sklearn.linear_model.PassiveAggressiveRegressor`
|
||||||
|
+ :class:`sklearn.neural_network.MLPRegressor`
|
||||||
|
- Clustering
|
||||||
|
+ :class:`sklearn.cluster.MiniBatchKMeans`
|
||||||
|
+ :class:`sklearn.cluster.Birch`
|
||||||
|
- Decomposition / feature Extraction
|
||||||
|
+ :class:`sklearn.decomposition.MiniBatchDictionaryLearning`
|
||||||
|
+ :class:`sklearn.decomposition.IncrementalPCA`
|
||||||
|
+ :class:`sklearn.decomposition.LatentDirichletAllocation`
|
||||||
|
+ :class:`sklearn.decomposition.MiniBatchNMF`
|
||||||
|
- Preprocessing
|
||||||
|
+ :class:`sklearn.preprocessing.StandardScaler`
|
||||||
|
+ :class:`sklearn.preprocessing.MinMaxScaler`
|
||||||
|
+ :class:`sklearn.preprocessing.MaxAbsScaler`
|
||||||
|
|
||||||
|
For classification, a somewhat important thing to note is that although a
|
||||||
|
stateless feature extraction routine may be able to cope with new/unseen
|
||||||
|
attributes, the incremental learner itself may be unable to cope with
|
||||||
|
new/unseen targets classes. In this case you have to pass all the possible
|
||||||
|
classes to the first ``partial_fit`` call using the ``classes=`` parameter.
|
||||||
|
|
||||||
|
Another aspect to consider when choosing a proper algorithm is that not all of
|
||||||
|
them put the same importance on each example over time. Namely, the
|
||||||
|
``Perceptron`` is still sensitive to badly labeled examples even after many
|
||||||
|
examples whereas the ``SGD*`` and ``PassiveAggressive*`` families are more
|
||||||
|
robust to this kind of artifacts. Conversely, the latter also tend to give less
|
||||||
|
importance to remarkably different, yet properly labeled examples when they
|
||||||
|
come late in the stream as their learning rate decreases over time.
|
||||||
|
|
||||||
|
Examples
|
||||||
|
..........
|
||||||
|
|
||||||
|
Finally, we have a full-fledged example of
|
||||||
|
:ref:`sphx_glr_auto_examples_applications_plot_out_of_core_classification.py`. It is aimed at
|
||||||
|
providing a starting point for people wanting to build out-of-core learning
|
||||||
|
systems and demonstrates most of the notions discussed above.
|
||||||
|
|
||||||
|
Furthermore, it also shows the evolution of the performance of different
|
||||||
|
algorithms with the number of processed examples.
|
||||||
|
|
||||||
|
.. |accuracy_over_time| image:: ../auto_examples/applications/images/sphx_glr_plot_out_of_core_classification_001.png
|
||||||
|
:target: ../auto_examples/applications/plot_out_of_core_classification.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |accuracy_over_time|
|
||||||
|
|
||||||
|
Now looking at the computation time of the different parts, we see that the
|
||||||
|
vectorization is much more expensive than learning itself. From the different
|
||||||
|
algorithms, ``MultinomialNB`` is the most expensive, but its overhead can be
|
||||||
|
mitigated by increasing the size of the mini-batches (exercise: change
|
||||||
|
``minibatch_size`` to 100 and 10000 in the program and compare).
|
||||||
|
|
||||||
|
.. |computation_time| image:: ../auto_examples/applications/images/sphx_glr_plot_out_of_core_classification_003.png
|
||||||
|
:target: ../auto_examples/applications/plot_out_of_core_classification.html
|
||||||
|
:scale: 80
|
||||||
|
|
||||||
|
.. centered:: |computation_time|
|
||||||
|
|
||||||
|
|
||||||
|
Notes
|
||||||
|
......
|
||||||
|
|
||||||
|
.. [1] Depending on the algorithm the mini-batch size can influence results or
|
||||||
|
not. SGD*, PassiveAggressive*, and discrete NaiveBayes are truly online
|
||||||
|
and are not affected by batch size. Conversely, MiniBatchKMeans
|
||||||
|
convergence rate is affected by the batch size. Also, its memory
|
||||||
|
footprint can vary dramatically with batch size.
|
|
@ -0,0 +1,966 @@
|
||||||
|
# scikit-learn documentation build configuration file, created by
|
||||||
|
# sphinx-quickstart on Fri Jan 8 09:13:42 2010.
|
||||||
|
#
|
||||||
|
# This file is execfile()d with the current directory set to its containing
|
||||||
|
# dir.
|
||||||
|
#
|
||||||
|
# Note that not all possible configuration values are present in this
|
||||||
|
# autogenerated file.
|
||||||
|
#
|
||||||
|
# All configuration values have a default; values that are commented out
|
||||||
|
# serve to show the default.
|
||||||
|
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
import warnings
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from sklearn.externals._packaging.version import parse
|
||||||
|
from sklearn.utils._testing import turn_warnings_into_errors
|
||||||
|
|
||||||
|
# If extensions (or modules to document with autodoc) are in another
|
||||||
|
# directory, add these directories to sys.path here. If the directory
|
||||||
|
# is relative to the documentation root, use os.path.abspath to make it
|
||||||
|
# absolute, like shown here.
|
||||||
|
sys.path.insert(0, os.path.abspath("."))
|
||||||
|
sys.path.insert(0, os.path.abspath("sphinxext"))
|
||||||
|
|
||||||
|
import jinja2
|
||||||
|
import sphinx_gallery
|
||||||
|
from github_link import make_linkcode_resolve
|
||||||
|
from sphinx_gallery.notebook import add_code_cell, add_markdown_cell
|
||||||
|
from sphinx_gallery.sorting import ExampleTitleSortKey
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Configure plotly to integrate its output into the HTML pages generated by
|
||||||
|
# sphinx-gallery.
|
||||||
|
import plotly.io as pio
|
||||||
|
|
||||||
|
pio.renderers.default = "sphinx_gallery"
|
||||||
|
except ImportError:
|
||||||
|
# Make it possible to render the doc when not running the examples
|
||||||
|
# that need plotly.
|
||||||
|
pass
|
||||||
|
|
||||||
|
# -- General configuration ---------------------------------------------------
|
||||||
|
|
||||||
|
# Add any Sphinx extension module names here, as strings. They can be
|
||||||
|
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
|
||||||
|
extensions = [
|
||||||
|
"sphinx.ext.autodoc",
|
||||||
|
"sphinx.ext.autosummary",
|
||||||
|
"numpydoc",
|
||||||
|
"sphinx.ext.linkcode",
|
||||||
|
"sphinx.ext.doctest",
|
||||||
|
"sphinx.ext.intersphinx",
|
||||||
|
"sphinx.ext.imgconverter",
|
||||||
|
"sphinx_gallery.gen_gallery",
|
||||||
|
"sphinx-prompt",
|
||||||
|
"sphinx_copybutton",
|
||||||
|
"sphinxext.opengraph",
|
||||||
|
"matplotlib.sphinxext.plot_directive",
|
||||||
|
"sphinxcontrib.sass",
|
||||||
|
"sphinx_remove_toctrees",
|
||||||
|
"sphinx_design",
|
||||||
|
# See sphinxext/
|
||||||
|
"allow_nan_estimators",
|
||||||
|
"autoshortsummary",
|
||||||
|
"doi_role",
|
||||||
|
"dropdown_anchors",
|
||||||
|
"move_gallery_links",
|
||||||
|
"override_pst_pagetoc",
|
||||||
|
"sphinx_issues",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Specify how to identify the prompt when copying code snippets
|
||||||
|
copybutton_prompt_text = r">>> |\.\.\. "
|
||||||
|
copybutton_prompt_is_regexp = True
|
||||||
|
copybutton_exclude = "style"
|
||||||
|
|
||||||
|
try:
|
||||||
|
import jupyterlite_sphinx # noqa: F401
|
||||||
|
|
||||||
|
extensions.append("jupyterlite_sphinx")
|
||||||
|
with_jupyterlite = True
|
||||||
|
except ImportError:
|
||||||
|
# In some cases we don't want to require jupyterlite_sphinx to be installed,
|
||||||
|
# e.g. the doc-min-dependencies build
|
||||||
|
warnings.warn(
|
||||||
|
"jupyterlite_sphinx is not installed, you need to install it "
|
||||||
|
"if you want JupyterLite links to appear in each example"
|
||||||
|
)
|
||||||
|
with_jupyterlite = False
|
||||||
|
|
||||||
|
# Produce `plot::` directives for examples that contain `import matplotlib` or
|
||||||
|
# `from matplotlib import`.
|
||||||
|
numpydoc_use_plots = True
|
||||||
|
|
||||||
|
# Options for the `::plot` directive:
|
||||||
|
# https://matplotlib.org/stable/api/sphinxext_plot_directive_api.html
|
||||||
|
plot_formats = ["png"]
|
||||||
|
plot_include_source = True
|
||||||
|
plot_html_show_formats = False
|
||||||
|
plot_html_show_source_link = False
|
||||||
|
|
||||||
|
# We do not need the table of class members because `sphinxext/override_pst_pagetoc.py`
|
||||||
|
# will show them in the secondary sidebar
|
||||||
|
numpydoc_show_class_members = False
|
||||||
|
numpydoc_show_inherited_class_members = False
|
||||||
|
|
||||||
|
# We want in-page toc of class members instead of a separate page for each entry
|
||||||
|
numpydoc_class_members_toctree = False
|
||||||
|
|
||||||
|
|
||||||
|
# For maths, use mathjax by default and svg if NO_MATHJAX env variable is set
|
||||||
|
# (useful for viewing the doc offline)
|
||||||
|
if os.environ.get("NO_MATHJAX"):
|
||||||
|
extensions.append("sphinx.ext.imgmath")
|
||||||
|
imgmath_image_format = "svg"
|
||||||
|
mathjax_path = ""
|
||||||
|
else:
|
||||||
|
extensions.append("sphinx.ext.mathjax")
|
||||||
|
mathjax_path = "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js"
|
||||||
|
|
||||||
|
# Add any paths that contain templates here, relative to this directory.
|
||||||
|
templates_path = ["templates"]
|
||||||
|
|
||||||
|
# generate autosummary even if no references
|
||||||
|
autosummary_generate = True
|
||||||
|
|
||||||
|
# The suffix of source filenames.
|
||||||
|
source_suffix = ".rst"
|
||||||
|
|
||||||
|
# The encoding of source files.
|
||||||
|
source_encoding = "utf-8"
|
||||||
|
|
||||||
|
# The main toctree document.
|
||||||
|
root_doc = "index"
|
||||||
|
|
||||||
|
# General information about the project.
|
||||||
|
project = "scikit-learn"
|
||||||
|
copyright = f"2007 - {datetime.now().year}, scikit-learn developers (BSD License)"
|
||||||
|
|
||||||
|
# The version info for the project you're documenting, acts as replacement for
|
||||||
|
# |version| and |release|, also used in various other places throughout the
|
||||||
|
# built documents.
|
||||||
|
#
|
||||||
|
# The short X.Y version.
|
||||||
|
import sklearn
|
||||||
|
|
||||||
|
parsed_version = parse(sklearn.__version__)
|
||||||
|
version = ".".join(parsed_version.base_version.split(".")[:2])
|
||||||
|
# The full version, including alpha/beta/rc tags.
|
||||||
|
# Removes post from release name
|
||||||
|
if parsed_version.is_postrelease:
|
||||||
|
release = parsed_version.base_version
|
||||||
|
else:
|
||||||
|
release = sklearn.__version__
|
||||||
|
|
||||||
|
# The language for content autogenerated by Sphinx. Refer to documentation
|
||||||
|
# for a list of supported languages.
|
||||||
|
# language = None
|
||||||
|
|
||||||
|
# There are two options for replacing |today|: either, you set today to some
|
||||||
|
# non-false value, then it is used:
|
||||||
|
# today = ''
|
||||||
|
# Else, today_fmt is used as the format for a strftime call.
|
||||||
|
# today_fmt = '%B %d, %Y'
|
||||||
|
|
||||||
|
# List of patterns, relative to source directory, that match files and
|
||||||
|
# directories to ignore when looking for source files.
|
||||||
|
exclude_patterns = [
|
||||||
|
"_build",
|
||||||
|
"templates",
|
||||||
|
"includes",
|
||||||
|
"**/sg_execution_times.rst",
|
||||||
|
]
|
||||||
|
|
||||||
|
# The reST default role (used for this markup: `text`) to use for all
|
||||||
|
# documents.
|
||||||
|
default_role = "literal"
|
||||||
|
|
||||||
|
# If true, '()' will be appended to :func: etc. cross-reference text.
|
||||||
|
add_function_parentheses = False
|
||||||
|
|
||||||
|
# If true, the current module name will be prepended to all description
|
||||||
|
# unit titles (such as .. function::).
|
||||||
|
# add_module_names = True
|
||||||
|
|
||||||
|
# If true, sectionauthor and moduleauthor directives will be shown in the
|
||||||
|
# output. They are ignored by default.
|
||||||
|
# show_authors = False
|
||||||
|
|
||||||
|
# A list of ignored prefixes for module index sorting.
|
||||||
|
# modindex_common_prefix = []
|
||||||
|
|
||||||
|
|
||||||
|
# -- Options for HTML output -------------------------------------------------
|
||||||
|
|
||||||
|
# The theme to use for HTML and HTML Help pages. Major themes that come with
|
||||||
|
# Sphinx are currently 'default' and 'sphinxdoc'.
|
||||||
|
html_theme = "pydata_sphinx_theme"
|
||||||
|
|
||||||
|
# Theme options are theme-specific and customize the look and feel of a theme
|
||||||
|
# further. For a list of options available for each theme, see the
|
||||||
|
# documentation.
|
||||||
|
html_theme_options = {
|
||||||
|
# -- General configuration ------------------------------------------------
|
||||||
|
"sidebar_includehidden": True,
|
||||||
|
"use_edit_page_button": True,
|
||||||
|
"external_links": [],
|
||||||
|
"icon_links_label": "Icon Links",
|
||||||
|
"icon_links": [
|
||||||
|
{
|
||||||
|
"name": "GitHub",
|
||||||
|
"url": "https://github.com/scikit-learn/scikit-learn",
|
||||||
|
"icon": "fa-brands fa-square-github",
|
||||||
|
"type": "fontawesome",
|
||||||
|
},
|
||||||
|
],
|
||||||
|
"analytics": {
|
||||||
|
"plausible_analytics_domain": "scikit-learn.org",
|
||||||
|
"plausible_analytics_url": "https://views.scientific-python.org/js/script.js",
|
||||||
|
},
|
||||||
|
# If "prev-next" is included in article_footer_items, then setting show_prev_next
|
||||||
|
# to True would repeat prev and next links. See
|
||||||
|
# https://github.com/pydata/pydata-sphinx-theme/blob/b731dc230bc26a3d1d1bb039c56c977a9b3d25d8/src/pydata_sphinx_theme/theme/pydata_sphinx_theme/layout.html#L118-L129
|
||||||
|
"show_prev_next": False,
|
||||||
|
"search_bar_text": "Search the docs ...",
|
||||||
|
"navigation_with_keys": False,
|
||||||
|
"collapse_navigation": False,
|
||||||
|
"navigation_depth": 2,
|
||||||
|
"show_nav_level": 1,
|
||||||
|
"show_toc_level": 1,
|
||||||
|
"navbar_align": "left",
|
||||||
|
"header_links_before_dropdown": 5,
|
||||||
|
"header_dropdown_text": "More",
|
||||||
|
# The switcher requires a JSON file with the list of documentation versions, which
|
||||||
|
# is generated by the script `build_tools/circle/list_versions.py` and placed under
|
||||||
|
# the `js/` static directory; it will then be copied to the `_static` directory in
|
||||||
|
# the built documentation
|
||||||
|
"switcher": {
|
||||||
|
"json_url": "https://scikit-learn.org/dev/_static/versions.json",
|
||||||
|
"version_match": release,
|
||||||
|
},
|
||||||
|
# check_switcher may be set to False if docbuild pipeline fails. See
|
||||||
|
# https://pydata-sphinx-theme.readthedocs.io/en/stable/user_guide/version-dropdown.html#configure-switcher-json-url
|
||||||
|
"check_switcher": True,
|
||||||
|
"pygments_light_style": "tango",
|
||||||
|
"pygments_dark_style": "monokai",
|
||||||
|
"logo": {
|
||||||
|
"alt_text": "scikit-learn homepage",
|
||||||
|
"image_relative": "logos/scikit-learn-logo-small.png",
|
||||||
|
"image_light": "logos/scikit-learn-logo-small.png",
|
||||||
|
"image_dark": "logos/scikit-learn-logo-small.png",
|
||||||
|
},
|
||||||
|
"surface_warnings": True,
|
||||||
|
# -- Template placement in theme layouts ----------------------------------
|
||||||
|
"navbar_start": ["navbar-logo"],
|
||||||
|
# Note that the alignment of navbar_center is controlled by navbar_align
|
||||||
|
"navbar_center": ["navbar-nav"],
|
||||||
|
"navbar_end": ["theme-switcher", "navbar-icon-links", "version-switcher"],
|
||||||
|
# navbar_persistent is persistent right (even when on mobiles)
|
||||||
|
"navbar_persistent": ["search-button"],
|
||||||
|
"article_header_start": ["breadcrumbs"],
|
||||||
|
"article_header_end": [],
|
||||||
|
"article_footer_items": ["prev-next"],
|
||||||
|
"content_footer_items": [],
|
||||||
|
# Use html_sidebars that map page patterns to list of sidebar templates
|
||||||
|
"primary_sidebar_end": [],
|
||||||
|
"footer_start": ["copyright"],
|
||||||
|
"footer_center": [],
|
||||||
|
"footer_end": [],
|
||||||
|
# When specified as a dictionary, the keys should follow glob-style patterns, as in
|
||||||
|
# https://www.sphinx-doc.org/en/master/usage/configuration.html#confval-exclude_patterns
|
||||||
|
# In particular, "**" specifies the default for all pages
|
||||||
|
# Use :html_theme.sidebar_secondary.remove: for file-wide removal
|
||||||
|
"secondary_sidebar_items": {"**": ["page-toc", "sourcelink"]},
|
||||||
|
"show_version_warning_banner": True,
|
||||||
|
"announcement": None,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Add any paths that contain custom themes here, relative to this directory.
|
||||||
|
# html_theme_path = ["themes"]
|
||||||
|
|
||||||
|
# The name for this set of Sphinx documents. If None, it defaults to
|
||||||
|
# "<project> v<release> documentation".
|
||||||
|
# html_title = None
|
||||||
|
|
||||||
|
# A shorter title for the navigation bar. Default is the same as html_title.
|
||||||
|
html_short_title = "scikit-learn"
|
||||||
|
|
||||||
|
# The name of an image file (within the static path) to use as favicon of the
|
||||||
|
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
|
||||||
|
# pixels large.
|
||||||
|
html_favicon = "logos/favicon.ico"
|
||||||
|
|
||||||
|
# Add any paths that contain custom static files (such as style sheets) here,
|
||||||
|
# relative to this directory. They are copied after the builtin static files,
|
||||||
|
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||||
|
html_static_path = ["images", "css", "js"]
|
||||||
|
|
||||||
|
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
|
||||||
|
# using the given strftime format.
|
||||||
|
# html_last_updated_fmt = '%b %d, %Y'
|
||||||
|
|
||||||
|
# Custom sidebar templates, maps document names to template names.
|
||||||
|
# Workaround for removing the left sidebar on pages without TOC
|
||||||
|
# A better solution would be to follow the merge of:
|
||||||
|
# https://github.com/pydata/pydata-sphinx-theme/pull/1682
|
||||||
|
html_sidebars = {
|
||||||
|
"install": [],
|
||||||
|
"getting_started": [],
|
||||||
|
"glossary": [],
|
||||||
|
"faq": [],
|
||||||
|
"support": [],
|
||||||
|
"related_projects": [],
|
||||||
|
"roadmap": [],
|
||||||
|
"governance": [],
|
||||||
|
"about": [],
|
||||||
|
}
|
||||||
|
|
||||||
|
# Additional templates that should be rendered to pages, maps page names to
|
||||||
|
# template names.
|
||||||
|
html_additional_pages = {"index": "index.html"}
|
||||||
|
|
||||||
|
# Additional files to copy
|
||||||
|
# html_extra_path = []
|
||||||
|
|
||||||
|
# Additional JS files
|
||||||
|
html_js_files = [
|
||||||
|
"scripts/dropdown.js",
|
||||||
|
"scripts/version-switcher.js",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Compile scss files into css files using sphinxcontrib-sass
|
||||||
|
sass_src_dir, sass_out_dir = "scss", "css/styles"
|
||||||
|
sass_targets = {
|
||||||
|
f"{file.stem}.scss": f"{file.stem}.css"
|
||||||
|
for file in Path(sass_src_dir).glob("*.scss")
|
||||||
|
}
|
||||||
|
|
||||||
|
# Additional CSS files, should be subset of the values of `sass_targets`
|
||||||
|
html_css_files = ["styles/colors.css", "styles/custom.css"]
|
||||||
|
|
||||||
|
|
||||||
|
def add_js_css_files(app, pagename, templatename, context, doctree):
|
||||||
|
"""Load additional JS and CSS files only for certain pages.
|
||||||
|
|
||||||
|
Note that `html_js_files` and `html_css_files` are included in all pages and
|
||||||
|
should be used for the ones that are used by multiple pages. All page-specific
|
||||||
|
JS and CSS files should be added here instead.
|
||||||
|
"""
|
||||||
|
if pagename == "api/index":
|
||||||
|
# External: jQuery and DataTables
|
||||||
|
app.add_js_file("https://code.jquery.com/jquery-3.7.0.js")
|
||||||
|
app.add_js_file("https://cdn.datatables.net/2.0.0/js/dataTables.min.js")
|
||||||
|
app.add_css_file(
|
||||||
|
"https://cdn.datatables.net/2.0.0/css/dataTables.dataTables.min.css"
|
||||||
|
)
|
||||||
|
# Internal: API search intialization and styling
|
||||||
|
app.add_js_file("scripts/api-search.js")
|
||||||
|
app.add_css_file("styles/api-search.css")
|
||||||
|
elif pagename == "index":
|
||||||
|
app.add_css_file("styles/index.css")
|
||||||
|
elif pagename == "install":
|
||||||
|
app.add_css_file("styles/install.css")
|
||||||
|
elif pagename.startswith("modules/generated/"):
|
||||||
|
app.add_css_file("styles/api.css")
|
||||||
|
|
||||||
|
|
||||||
|
# If false, no module index is generated.
|
||||||
|
html_domain_indices = False
|
||||||
|
|
||||||
|
# If false, no index is generated.
|
||||||
|
html_use_index = False
|
||||||
|
|
||||||
|
# If true, the index is split into individual pages for each letter.
|
||||||
|
# html_split_index = False
|
||||||
|
|
||||||
|
# If true, links to the reST sources are added to the pages.
|
||||||
|
# html_show_sourcelink = True
|
||||||
|
|
||||||
|
# If true, an OpenSearch description file will be output, and all pages will
|
||||||
|
# contain a <link> tag referring to it. The value of this option must be the
|
||||||
|
# base URL from which the finished HTML is served.
|
||||||
|
# html_use_opensearch = ''
|
||||||
|
|
||||||
|
# If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml").
|
||||||
|
# html_file_suffix = ''
|
||||||
|
|
||||||
|
# Output file base name for HTML help builder.
|
||||||
|
htmlhelp_basename = "scikit-learndoc"
|
||||||
|
|
||||||
|
# If true, the reST sources are included in the HTML build as _sources/name.
|
||||||
|
html_copy_source = True
|
||||||
|
|
||||||
|
# Adds variables into templates
|
||||||
|
html_context = {}
|
||||||
|
# finds latest release highlights and places it into HTML context for
|
||||||
|
# index.html
|
||||||
|
release_highlights_dir = Path("..") / "examples" / "release_highlights"
|
||||||
|
# Finds the highlight with the latest version number
|
||||||
|
latest_highlights = sorted(release_highlights_dir.glob("plot_release_highlights_*.py"))[
|
||||||
|
-1
|
||||||
|
]
|
||||||
|
latest_highlights = latest_highlights.with_suffix("").name
|
||||||
|
html_context["release_highlights"] = (
|
||||||
|
f"auto_examples/release_highlights/{latest_highlights}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# get version from highlight name assuming highlights have the form
|
||||||
|
# plot_release_highlights_0_22_0
|
||||||
|
highlight_version = ".".join(latest_highlights.split("_")[-3:-1])
|
||||||
|
html_context["release_highlights_version"] = highlight_version
|
||||||
|
|
||||||
|
|
||||||
|
# redirects dictionary maps from old links to new links
|
||||||
|
redirects = {
|
||||||
|
"documentation": "index",
|
||||||
|
"contents": "index",
|
||||||
|
"preface": "index",
|
||||||
|
"modules/classes": "api/index",
|
||||||
|
"auto_examples/feature_selection/plot_permutation_test_for_classification": (
|
||||||
|
"auto_examples/model_selection/plot_permutation_tests_for_classification"
|
||||||
|
),
|
||||||
|
"modules/model_persistence": "model_persistence",
|
||||||
|
"auto_examples/linear_model/plot_bayesian_ridge": (
|
||||||
|
"auto_examples/linear_model/plot_ard"
|
||||||
|
),
|
||||||
|
"auto_examples/model_selection/grid_search_text_feature_extraction.py": (
|
||||||
|
"auto_examples/model_selection/plot_grid_search_text_feature_extraction.py"
|
||||||
|
),
|
||||||
|
"auto_examples/miscellaneous/plot_changed_only_pprint_parameter": (
|
||||||
|
"auto_examples/miscellaneous/plot_estimator_representation"
|
||||||
|
),
|
||||||
|
"auto_examples/decomposition/plot_beta_divergence": (
|
||||||
|
"auto_examples/applications/plot_topics_extraction_with_nmf_lda"
|
||||||
|
),
|
||||||
|
"auto_examples/svm/plot_svm_nonlinear": "auto_examples/svm/plot_svm_kernels",
|
||||||
|
"auto_examples/ensemble/plot_adaboost_hastie_10_2": (
|
||||||
|
"auto_examples/ensemble/plot_adaboost_multiclass"
|
||||||
|
),
|
||||||
|
"auto_examples/decomposition/plot_pca_3d": (
|
||||||
|
"auto_examples/decomposition/plot_pca_iris"
|
||||||
|
),
|
||||||
|
"auto_examples/exercises/plot_cv_digits.py": (
|
||||||
|
"auto_examples/model_selection/plot_nested_cross_validation_iris.py"
|
||||||
|
),
|
||||||
|
"tutorial/machine_learning_map/index.html": "machine_learning_map/index.html",
|
||||||
|
}
|
||||||
|
html_context["redirects"] = redirects
|
||||||
|
for old_link in redirects:
|
||||||
|
html_additional_pages[old_link] = "redirects.html"
|
||||||
|
|
||||||
|
# See https://github.com/scikit-learn/scikit-learn/pull/22550
|
||||||
|
html_context["is_devrelease"] = parsed_version.is_devrelease
|
||||||
|
|
||||||
|
|
||||||
|
# -- Options for LaTeX output ------------------------------------------------
|
||||||
|
latex_elements = {
|
||||||
|
# The paper size ('letterpaper' or 'a4paper').
|
||||||
|
# 'papersize': 'letterpaper',
|
||||||
|
# The font size ('10pt', '11pt' or '12pt').
|
||||||
|
# 'pointsize': '10pt',
|
||||||
|
# Additional stuff for the LaTeX preamble.
|
||||||
|
"preamble": r"""
|
||||||
|
\usepackage{amsmath}\usepackage{amsfonts}\usepackage{bm}
|
||||||
|
\usepackage{morefloats}\usepackage{enumitem} \setlistdepth{10}
|
||||||
|
\let\oldhref\href
|
||||||
|
\renewcommand{\href}[2]{\oldhref{#1}{\hbox{#2}}}
|
||||||
|
"""
|
||||||
|
}
|
||||||
|
|
||||||
|
# Grouping the document tree into LaTeX files. List of tuples
|
||||||
|
# (source start file, target name, title, author, documentclass
|
||||||
|
# [howto/manual]).
|
||||||
|
latex_documents = [
|
||||||
|
(
|
||||||
|
"contents",
|
||||||
|
"user_guide.tex",
|
||||||
|
"scikit-learn user guide",
|
||||||
|
"scikit-learn developers",
|
||||||
|
"manual",
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
# The name of an image file (relative to this directory) to place at the top of
|
||||||
|
# the title page.
|
||||||
|
latex_logo = "logos/scikit-learn-logo.png"
|
||||||
|
|
||||||
|
# Documents to append as an appendix to all manuals.
|
||||||
|
# latex_appendices = []
|
||||||
|
|
||||||
|
# If false, no module index is generated.
|
||||||
|
latex_domain_indices = False
|
||||||
|
|
||||||
|
trim_doctests_flags = True
|
||||||
|
|
||||||
|
# intersphinx configuration
|
||||||
|
intersphinx_mapping = {
|
||||||
|
"python": ("https://docs.python.org/{.major}".format(sys.version_info), None),
|
||||||
|
"numpy": ("https://numpy.org/doc/stable", None),
|
||||||
|
"scipy": ("https://docs.scipy.org/doc/scipy/", None),
|
||||||
|
"matplotlib": ("https://matplotlib.org/", None),
|
||||||
|
"pandas": ("https://pandas.pydata.org/pandas-docs/stable/", None),
|
||||||
|
"joblib": ("https://joblib.readthedocs.io/en/latest/", None),
|
||||||
|
"seaborn": ("https://seaborn.pydata.org/", None),
|
||||||
|
"skops": ("https://skops.readthedocs.io/en/stable/", None),
|
||||||
|
}
|
||||||
|
|
||||||
|
v = parse(release)
|
||||||
|
if v.release is None:
|
||||||
|
raise ValueError(
|
||||||
|
"Ill-formed version: {!r}. Version should follow PEP440".format(version)
|
||||||
|
)
|
||||||
|
|
||||||
|
if v.is_devrelease:
|
||||||
|
binder_branch = "main"
|
||||||
|
else:
|
||||||
|
major, minor = v.release[:2]
|
||||||
|
binder_branch = "{}.{}.X".format(major, minor)
|
||||||
|
|
||||||
|
|
||||||
|
class SubSectionTitleOrder:
|
||||||
|
"""Sort example gallery by title of subsection.
|
||||||
|
|
||||||
|
Assumes README.txt exists for all subsections and uses the subsection with
|
||||||
|
dashes, '---', as the adornment.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, src_dir):
|
||||||
|
self.src_dir = src_dir
|
||||||
|
self.regex = re.compile(r"^([\w ]+)\n-", re.MULTILINE)
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return "<%s>" % (self.__class__.__name__,)
|
||||||
|
|
||||||
|
def __call__(self, directory):
|
||||||
|
src_path = os.path.normpath(os.path.join(self.src_dir, directory))
|
||||||
|
|
||||||
|
# Forces Release Highlights to the top
|
||||||
|
if os.path.basename(src_path) == "release_highlights":
|
||||||
|
return "0"
|
||||||
|
|
||||||
|
readme = os.path.join(src_path, "README.txt")
|
||||||
|
|
||||||
|
try:
|
||||||
|
with open(readme, "r") as f:
|
||||||
|
content = f.read()
|
||||||
|
except FileNotFoundError:
|
||||||
|
return directory
|
||||||
|
|
||||||
|
title_match = self.regex.search(content)
|
||||||
|
if title_match is not None:
|
||||||
|
return title_match.group(1)
|
||||||
|
return directory
|
||||||
|
|
||||||
|
|
||||||
|
class SKExampleTitleSortKey(ExampleTitleSortKey):
|
||||||
|
"""Sorts release highlights based on version number."""
|
||||||
|
|
||||||
|
def __call__(self, filename):
|
||||||
|
title = super().__call__(filename)
|
||||||
|
prefix = "plot_release_highlights_"
|
||||||
|
|
||||||
|
# Use title to sort if not a release highlight
|
||||||
|
if not str(filename).startswith(prefix):
|
||||||
|
return title
|
||||||
|
|
||||||
|
major_minor = filename[len(prefix) :].split("_")[:2]
|
||||||
|
version_float = float(".".join(major_minor))
|
||||||
|
|
||||||
|
# negate to place the newest version highlights first
|
||||||
|
return -version_float
|
||||||
|
|
||||||
|
|
||||||
|
def notebook_modification_function(notebook_content, notebook_filename):
|
||||||
|
notebook_content_str = str(notebook_content)
|
||||||
|
warning_template = "\n".join(
|
||||||
|
[
|
||||||
|
"<div class='alert alert-{message_class}'>",
|
||||||
|
"",
|
||||||
|
"# JupyterLite warning",
|
||||||
|
"",
|
||||||
|
"{message}",
|
||||||
|
"</div>",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
message_class = "warning"
|
||||||
|
message = (
|
||||||
|
"Running the scikit-learn examples in JupyterLite is experimental and you may"
|
||||||
|
" encounter some unexpected behavior.\n\nThe main difference is that imports"
|
||||||
|
" will take a lot longer than usual, for example the first `import sklearn` can"
|
||||||
|
" take roughly 10-20s.\n\nIf you notice problems, feel free to open an"
|
||||||
|
" [issue](https://github.com/scikit-learn/scikit-learn/issues/new/choose)"
|
||||||
|
" about it."
|
||||||
|
)
|
||||||
|
|
||||||
|
markdown = warning_template.format(message_class=message_class, message=message)
|
||||||
|
|
||||||
|
dummy_notebook_content = {"cells": []}
|
||||||
|
add_markdown_cell(dummy_notebook_content, markdown)
|
||||||
|
|
||||||
|
code_lines = []
|
||||||
|
|
||||||
|
if "seaborn" in notebook_content_str:
|
||||||
|
code_lines.append("%pip install seaborn")
|
||||||
|
if "plotly.express" in notebook_content_str:
|
||||||
|
code_lines.append("%pip install plotly")
|
||||||
|
if "skimage" in notebook_content_str:
|
||||||
|
code_lines.append("%pip install scikit-image")
|
||||||
|
if "polars" in notebook_content_str:
|
||||||
|
code_lines.append("%pip install polars")
|
||||||
|
if "fetch_" in notebook_content_str:
|
||||||
|
code_lines.extend(
|
||||||
|
[
|
||||||
|
"%pip install pyodide-http",
|
||||||
|
"import pyodide_http",
|
||||||
|
"pyodide_http.patch_all()",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
# always import matplotlib and pandas to avoid Pyodide limitation with
|
||||||
|
# imports inside functions
|
||||||
|
code_lines.extend(["import matplotlib", "import pandas"])
|
||||||
|
|
||||||
|
if code_lines:
|
||||||
|
code_lines = ["# JupyterLite-specific code"] + code_lines
|
||||||
|
code = "\n".join(code_lines)
|
||||||
|
add_code_cell(dummy_notebook_content, code)
|
||||||
|
|
||||||
|
notebook_content["cells"] = (
|
||||||
|
dummy_notebook_content["cells"] + notebook_content["cells"]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
default_global_config = sklearn.get_config()
|
||||||
|
|
||||||
|
|
||||||
|
def reset_sklearn_config(gallery_conf, fname):
|
||||||
|
"""Reset sklearn config to default values."""
|
||||||
|
sklearn.set_config(**default_global_config)
|
||||||
|
|
||||||
|
|
||||||
|
sg_examples_dir = "../examples"
|
||||||
|
sg_gallery_dir = "auto_examples"
|
||||||
|
sphinx_gallery_conf = {
|
||||||
|
"doc_module": "sklearn",
|
||||||
|
"backreferences_dir": os.path.join("modules", "generated"),
|
||||||
|
"show_memory": False,
|
||||||
|
"reference_url": {"sklearn": None},
|
||||||
|
"examples_dirs": [sg_examples_dir],
|
||||||
|
"gallery_dirs": [sg_gallery_dir],
|
||||||
|
"subsection_order": SubSectionTitleOrder(sg_examples_dir),
|
||||||
|
"within_subsection_order": SKExampleTitleSortKey,
|
||||||
|
"binder": {
|
||||||
|
"org": "scikit-learn",
|
||||||
|
"repo": "scikit-learn",
|
||||||
|
"binderhub_url": "https://mybinder.org",
|
||||||
|
"branch": binder_branch,
|
||||||
|
"dependencies": "./binder/requirements.txt",
|
||||||
|
"use_jupyter_lab": True,
|
||||||
|
},
|
||||||
|
# avoid generating too many cross links
|
||||||
|
"inspect_global_variables": False,
|
||||||
|
"remove_config_comments": True,
|
||||||
|
"plot_gallery": "True",
|
||||||
|
"recommender": {"enable": True, "n_examples": 4, "min_df": 12},
|
||||||
|
"reset_modules": ("matplotlib", "seaborn", reset_sklearn_config),
|
||||||
|
}
|
||||||
|
if with_jupyterlite:
|
||||||
|
sphinx_gallery_conf["jupyterlite"] = {
|
||||||
|
"notebook_modification_function": notebook_modification_function
|
||||||
|
}
|
||||||
|
|
||||||
|
# Secondary sidebar configuration for pages generated by sphinx-gallery
|
||||||
|
|
||||||
|
# For the index page of the gallery and each nested section, we hide the secondary
|
||||||
|
# sidebar by specifying an empty list (no components), because there is no meaningful
|
||||||
|
# in-page toc for these pages, and they are generated so "sourcelink" is not useful
|
||||||
|
# either.
|
||||||
|
|
||||||
|
# For each example page we keep default ["page-toc", "sourcelink"] specified by the
|
||||||
|
# "**" key. "page-toc" is wanted for these pages. "sourcelink" is also necessary since
|
||||||
|
# otherwise the secondary sidebar will degenerate when "page-toc" is empty, and the
|
||||||
|
# script `sphinxext/move_gallery_links.py` will fail (it assumes the existence of the
|
||||||
|
# secondary sidebar). The script will remove "sourcelink" in the end.
|
||||||
|
|
||||||
|
html_theme_options["secondary_sidebar_items"][f"{sg_gallery_dir}/index"] = []
|
||||||
|
for sub_sg_dir in (Path(".") / sg_examples_dir).iterdir():
|
||||||
|
if sub_sg_dir.is_dir():
|
||||||
|
html_theme_options["secondary_sidebar_items"][
|
||||||
|
f"{sg_gallery_dir}/{sub_sg_dir.name}/index"
|
||||||
|
] = []
|
||||||
|
|
||||||
|
|
||||||
|
# The following dictionary contains the information used to create the
|
||||||
|
# thumbnails for the front page of the scikit-learn home page.
|
||||||
|
# key: first image in set
|
||||||
|
# values: (number of plot in set, height of thumbnail)
|
||||||
|
carousel_thumbs = {"sphx_glr_plot_classifier_comparison_001.png": 600}
|
||||||
|
|
||||||
|
|
||||||
|
# enable experimental module so that experimental estimators can be
|
||||||
|
# discovered properly by sphinx
|
||||||
|
from sklearn.experimental import enable_iterative_imputer # noqa
|
||||||
|
from sklearn.experimental import enable_halving_search_cv # noqa
|
||||||
|
|
||||||
|
|
||||||
|
def make_carousel_thumbs(app, exception):
|
||||||
|
"""produces the final resized carousel images"""
|
||||||
|
if exception is not None:
|
||||||
|
return
|
||||||
|
print("Preparing carousel images")
|
||||||
|
|
||||||
|
image_dir = os.path.join(app.builder.outdir, "_images")
|
||||||
|
for glr_plot, max_width in carousel_thumbs.items():
|
||||||
|
image = os.path.join(image_dir, glr_plot)
|
||||||
|
if os.path.exists(image):
|
||||||
|
c_thumb = os.path.join(image_dir, glr_plot[:-4] + "_carousel.png")
|
||||||
|
sphinx_gallery.gen_rst.scale_image(image, c_thumb, max_width, 190)
|
||||||
|
|
||||||
|
|
||||||
|
def filter_search_index(app, exception):
|
||||||
|
if exception is not None:
|
||||||
|
return
|
||||||
|
|
||||||
|
# searchindex only exist when generating html
|
||||||
|
if app.builder.name != "html":
|
||||||
|
return
|
||||||
|
|
||||||
|
print("Removing methods from search index")
|
||||||
|
|
||||||
|
searchindex_path = os.path.join(app.builder.outdir, "searchindex.js")
|
||||||
|
with open(searchindex_path, "r") as f:
|
||||||
|
searchindex_text = f.read()
|
||||||
|
|
||||||
|
searchindex_text = re.sub(r"{__init__.+?}", "{}", searchindex_text)
|
||||||
|
searchindex_text = re.sub(r"{__call__.+?}", "{}", searchindex_text)
|
||||||
|
|
||||||
|
with open(searchindex_path, "w") as f:
|
||||||
|
f.write(searchindex_text)
|
||||||
|
|
||||||
|
|
||||||
|
# Config for sphinx_issues
|
||||||
|
|
||||||
|
# we use the issues path for PRs since the issues URL will forward
|
||||||
|
issues_github_path = "scikit-learn/scikit-learn"
|
||||||
|
|
||||||
|
|
||||||
|
def disable_plot_gallery_for_linkcheck(app):
|
||||||
|
if app.builder.name == "linkcheck":
|
||||||
|
sphinx_gallery_conf["plot_gallery"] = "False"
|
||||||
|
|
||||||
|
|
||||||
|
def setup(app):
|
||||||
|
# do not run the examples when using linkcheck by using a small priority
|
||||||
|
# (default priority is 500 and sphinx-gallery using builder-inited event too)
|
||||||
|
app.connect("builder-inited", disable_plot_gallery_for_linkcheck, priority=50)
|
||||||
|
|
||||||
|
# triggered just before the HTML for an individual page is created
|
||||||
|
app.connect("html-page-context", add_js_css_files)
|
||||||
|
|
||||||
|
# to hide/show the prompt in code examples
|
||||||
|
app.connect("build-finished", make_carousel_thumbs)
|
||||||
|
app.connect("build-finished", filter_search_index)
|
||||||
|
|
||||||
|
|
||||||
|
# The following is used by sphinx.ext.linkcode to provide links to github
|
||||||
|
linkcode_resolve = make_linkcode_resolve(
|
||||||
|
"sklearn",
|
||||||
|
(
|
||||||
|
"https://github.com/scikit-learn/"
|
||||||
|
"scikit-learn/blob/{revision}/"
|
||||||
|
"{package}/{path}#L{lineno}"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
warnings.filterwarnings(
|
||||||
|
"ignore",
|
||||||
|
category=UserWarning,
|
||||||
|
message=(
|
||||||
|
"Matplotlib is currently using agg, which is a"
|
||||||
|
" non-GUI backend, so cannot show the figure."
|
||||||
|
),
|
||||||
|
)
|
||||||
|
if os.environ.get("SKLEARN_WARNINGS_AS_ERRORS", "0") != "0":
|
||||||
|
turn_warnings_into_errors()
|
||||||
|
|
||||||
|
# maps functions with a class name that is indistinguishable when case is
|
||||||
|
# ignore to another filename
|
||||||
|
autosummary_filename_map = {
|
||||||
|
"sklearn.cluster.dbscan": "dbscan-function",
|
||||||
|
"sklearn.covariance.oas": "oas-function",
|
||||||
|
"sklearn.decomposition.fastica": "fastica-function",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# Config for sphinxext.opengraph
|
||||||
|
|
||||||
|
ogp_site_url = "https://scikit-learn/stable/"
|
||||||
|
ogp_image = "https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png"
|
||||||
|
ogp_use_first_image = True
|
||||||
|
ogp_site_name = "scikit-learn"
|
||||||
|
|
||||||
|
# Config for linkcheck that checks the documentation for broken links
|
||||||
|
|
||||||
|
# ignore all links in 'whats_new' to avoid doing many github requests and
|
||||||
|
# hitting the github rate threshold that makes linkcheck take a lot of time
|
||||||
|
linkcheck_exclude_documents = [r"whats_new/.*"]
|
||||||
|
|
||||||
|
# default timeout to make some sites links fail faster
|
||||||
|
linkcheck_timeout = 10
|
||||||
|
|
||||||
|
# Allow redirects from doi.org
|
||||||
|
linkcheck_allowed_redirects = {r"https://doi.org/.+": r".*"}
|
||||||
|
linkcheck_ignore = [
|
||||||
|
# ignore links to local html files e.g. in image directive :target: field
|
||||||
|
r"^..?/",
|
||||||
|
# ignore links to specific pdf pages because linkcheck does not handle them
|
||||||
|
# ('utf-8' codec can't decode byte error)
|
||||||
|
r"http://www.utstat.toronto.edu/~rsalakhu/sta4273/notes/Lecture2.pdf#page=.*",
|
||||||
|
(
|
||||||
|
"https://www.fordfoundation.org/media/2976/roads-and-bridges"
|
||||||
|
"-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=.*"
|
||||||
|
),
|
||||||
|
# links falsely flagged as broken
|
||||||
|
(
|
||||||
|
"https://www.researchgate.net/publication/"
|
||||||
|
"233096619_A_Dendrite_Method_for_Cluster_Analysis"
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"https://www.researchgate.net/publication/221114584_Random_Fourier"
|
||||||
|
"_Approximations_for_Skewed_Multiplicative_Histogram_Kernels"
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"https://www.researchgate.net/publication/4974606_"
|
||||||
|
"Hedonic_housing_prices_and_the_demand_for_clean_air"
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"https://www.researchgate.net/profile/Anh-Huy-Phan/publication/220241471_Fast_"
|
||||||
|
"Local_Algorithms_for_Large_Scale_Nonnegative_Matrix_and_Tensor_Factorizations"
|
||||||
|
),
|
||||||
|
"https://doi.org/10.13140/RG.2.2.35280.02565",
|
||||||
|
(
|
||||||
|
"https://www.microsoft.com/en-us/research/uploads/prod/2006/01/"
|
||||||
|
"Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf"
|
||||||
|
),
|
||||||
|
"https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-99-87.pdf",
|
||||||
|
"https://microsoft.com/",
|
||||||
|
"https://www.jstor.org/stable/2984099",
|
||||||
|
"https://stat.uw.edu/sites/default/files/files/reports/2000/tr371.pdf",
|
||||||
|
# Broken links from testimonials
|
||||||
|
"http://www.bestofmedia.com",
|
||||||
|
"http://www.data-publica.com/",
|
||||||
|
"https://livelovely.com",
|
||||||
|
"https://www.mars.com/global",
|
||||||
|
"https://www.yhat.com",
|
||||||
|
# Ignore some dynamically created anchors. See
|
||||||
|
# https://github.com/sphinx-doc/sphinx/issues/9016 for more details about
|
||||||
|
# the github example
|
||||||
|
r"https://github.com/conda-forge/miniforge#miniforge",
|
||||||
|
r"https://github.com/joblib/threadpoolctl/"
|
||||||
|
"#setting-the-maximum-size-of-thread-pools",
|
||||||
|
r"https://stackoverflow.com/questions/5836335/"
|
||||||
|
"consistently-create-same-random-numpy-array/5837352#comment6712034_5837352",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Config for sphinx-remove-toctrees
|
||||||
|
|
||||||
|
remove_from_toctrees = ["metadata_routing.rst"]
|
||||||
|
|
||||||
|
# Use a browser-like user agent to avoid some "403 Client Error: Forbidden for
|
||||||
|
# url" errors. This is taken from the variable navigator.userAgent inside a
|
||||||
|
# browser console.
|
||||||
|
user_agent = (
|
||||||
|
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Use Github token from environment variable to avoid Github rate limits when
|
||||||
|
# checking Github links
|
||||||
|
github_token = os.getenv("GITHUB_TOKEN")
|
||||||
|
|
||||||
|
if github_token is None:
|
||||||
|
linkcheck_request_headers = {}
|
||||||
|
else:
|
||||||
|
linkcheck_request_headers = {
|
||||||
|
"https://github.com/": {"Authorization": f"token {github_token}"},
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# -- Convert .rst.template files to .rst ---------------------------------------
|
||||||
|
|
||||||
|
from api_reference import API_REFERENCE, DEPRECATED_API_REFERENCE
|
||||||
|
|
||||||
|
from sklearn._min_dependencies import dependent_packages
|
||||||
|
|
||||||
|
# If development build, link to local page in the top navbar; otherwise link to the
|
||||||
|
# development version; see https://github.com/scikit-learn/scikit-learn/pull/22550
|
||||||
|
if parsed_version.is_devrelease:
|
||||||
|
development_link = "developers/index"
|
||||||
|
else:
|
||||||
|
development_link = "https://scikit-learn.org/dev/developers/index.html"
|
||||||
|
|
||||||
|
# Define the templates and target files for conversion
|
||||||
|
# Each entry is in the format (template name, file name, kwargs for rendering)
|
||||||
|
rst_templates = [
|
||||||
|
("index", "index", {"development_link": development_link}),
|
||||||
|
(
|
||||||
|
"min_dependency_table",
|
||||||
|
"min_dependency_table",
|
||||||
|
{"dependent_packages": dependent_packages},
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"min_dependency_substitutions",
|
||||||
|
"min_dependency_substitutions",
|
||||||
|
{"dependent_packages": dependent_packages},
|
||||||
|
),
|
||||||
|
(
|
||||||
|
"api/index",
|
||||||
|
"api/index",
|
||||||
|
{
|
||||||
|
"API_REFERENCE": sorted(API_REFERENCE.items(), key=lambda x: x[0]),
|
||||||
|
"DEPRECATED_API_REFERENCE": sorted(
|
||||||
|
DEPRECATED_API_REFERENCE.items(), key=lambda x: x[0], reverse=True
|
||||||
|
),
|
||||||
|
},
|
||||||
|
),
|
||||||
|
]
|
||||||
|
|
||||||
|
# Convert each module API reference page
|
||||||
|
for module in API_REFERENCE:
|
||||||
|
rst_templates.append(
|
||||||
|
(
|
||||||
|
"api/module",
|
||||||
|
f"api/{module}",
|
||||||
|
{"module": module, "module_info": API_REFERENCE[module]},
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
# Convert the deprecated API reference page (if there exists any)
|
||||||
|
if DEPRECATED_API_REFERENCE:
|
||||||
|
rst_templates.append(
|
||||||
|
(
|
||||||
|
"api/deprecated",
|
||||||
|
"api/deprecated",
|
||||||
|
{
|
||||||
|
"DEPRECATED_API_REFERENCE": sorted(
|
||||||
|
DEPRECATED_API_REFERENCE.items(), key=lambda x: x[0], reverse=True
|
||||||
|
)
|
||||||
|
},
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
for rst_template_name, rst_target_name, kwargs in rst_templates:
|
||||||
|
# Read the corresponding template file into jinja2
|
||||||
|
with (Path(".") / f"{rst_template_name}.rst.template").open(
|
||||||
|
"r", encoding="utf-8"
|
||||||
|
) as f:
|
||||||
|
t = jinja2.Template(f.read())
|
||||||
|
|
||||||
|
# Render the template and write to the target
|
||||||
|
with (Path(".") / f"{rst_target_name}.rst").open("w", encoding="utf-8") as f:
|
||||||
|
f.write(t.render(**kwargs))
|
|
@ -0,0 +1,194 @@
|
||||||
|
import os
|
||||||
|
import warnings
|
||||||
|
from os import environ
|
||||||
|
from os.path import exists, join
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from _pytest.doctest import DoctestItem
|
||||||
|
|
||||||
|
from sklearn.datasets import get_data_home
|
||||||
|
from sklearn.datasets._base import _pkl_filepath
|
||||||
|
from sklearn.datasets._twenty_newsgroups import CACHE_NAME
|
||||||
|
from sklearn.utils._testing import SkipTest, check_skip_network
|
||||||
|
from sklearn.utils.fixes import _IS_PYPY, np_base_version, parse_version
|
||||||
|
|
||||||
|
|
||||||
|
def setup_labeled_faces():
|
||||||
|
data_home = get_data_home()
|
||||||
|
if not exists(join(data_home, "lfw_home")):
|
||||||
|
raise SkipTest("Skipping dataset loading doctests")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_rcv1():
|
||||||
|
check_skip_network()
|
||||||
|
# skip the test in rcv1.rst if the dataset is not already loaded
|
||||||
|
rcv1_dir = join(get_data_home(), "RCV1")
|
||||||
|
if not exists(rcv1_dir):
|
||||||
|
raise SkipTest("Download RCV1 dataset to run this test.")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_twenty_newsgroups():
|
||||||
|
cache_path = _pkl_filepath(get_data_home(), CACHE_NAME)
|
||||||
|
if not exists(cache_path):
|
||||||
|
raise SkipTest("Skipping dataset loading doctests")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_working_with_text_data():
|
||||||
|
if _IS_PYPY and os.environ.get("CI", None):
|
||||||
|
raise SkipTest("Skipping too slow test with PyPy on CI")
|
||||||
|
check_skip_network()
|
||||||
|
cache_path = _pkl_filepath(get_data_home(), CACHE_NAME)
|
||||||
|
if not exists(cache_path):
|
||||||
|
raise SkipTest("Skipping dataset loading doctests")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_loading_other_datasets():
|
||||||
|
try:
|
||||||
|
import pandas # noqa
|
||||||
|
except ImportError:
|
||||||
|
raise SkipTest("Skipping loading_other_datasets.rst, pandas not installed")
|
||||||
|
|
||||||
|
# checks SKLEARN_SKIP_NETWORK_TESTS to see if test should run
|
||||||
|
run_network_tests = environ.get("SKLEARN_SKIP_NETWORK_TESTS", "1") == "0"
|
||||||
|
if not run_network_tests:
|
||||||
|
raise SkipTest(
|
||||||
|
"Skipping loading_other_datasets.rst, tests can be "
|
||||||
|
"enabled by setting SKLEARN_SKIP_NETWORK_TESTS=0"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def setup_compose():
|
||||||
|
try:
|
||||||
|
import pandas # noqa
|
||||||
|
except ImportError:
|
||||||
|
raise SkipTest("Skipping compose.rst, pandas not installed")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_impute():
|
||||||
|
try:
|
||||||
|
import pandas # noqa
|
||||||
|
except ImportError:
|
||||||
|
raise SkipTest("Skipping impute.rst, pandas not installed")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_grid_search():
|
||||||
|
try:
|
||||||
|
import pandas # noqa
|
||||||
|
except ImportError:
|
||||||
|
raise SkipTest("Skipping grid_search.rst, pandas not installed")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_preprocessing():
|
||||||
|
try:
|
||||||
|
import pandas # noqa
|
||||||
|
|
||||||
|
if parse_version(pandas.__version__) < parse_version("1.1.0"):
|
||||||
|
raise SkipTest("Skipping preprocessing.rst, pandas version < 1.1.0")
|
||||||
|
except ImportError:
|
||||||
|
raise SkipTest("Skipping preprocessing.rst, pandas not installed")
|
||||||
|
|
||||||
|
|
||||||
|
def setup_unsupervised_learning():
|
||||||
|
try:
|
||||||
|
import skimage # noqa
|
||||||
|
except ImportError:
|
||||||
|
raise SkipTest("Skipping unsupervised_learning.rst, scikit-image not installed")
|
||||||
|
# ignore deprecation warnings from scipy.misc.face
|
||||||
|
warnings.filterwarnings(
|
||||||
|
"ignore", "The binary mode of fromstring", DeprecationWarning
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def skip_if_matplotlib_not_installed(fname):
|
||||||
|
try:
|
||||||
|
import matplotlib # noqa
|
||||||
|
except ImportError:
|
||||||
|
basename = os.path.basename(fname)
|
||||||
|
raise SkipTest(f"Skipping doctests for {basename}, matplotlib not installed")
|
||||||
|
|
||||||
|
|
||||||
|
def skip_if_cupy_not_installed(fname):
|
||||||
|
try:
|
||||||
|
import cupy # noqa
|
||||||
|
except ImportError:
|
||||||
|
basename = os.path.basename(fname)
|
||||||
|
raise SkipTest(f"Skipping doctests for {basename}, cupy not installed")
|
||||||
|
|
||||||
|
|
||||||
|
def pytest_runtest_setup(item):
|
||||||
|
fname = item.fspath.strpath
|
||||||
|
# normalize filename to use forward slashes on Windows for easier handling
|
||||||
|
# later
|
||||||
|
fname = fname.replace(os.sep, "/")
|
||||||
|
|
||||||
|
is_index = fname.endswith("datasets/index.rst")
|
||||||
|
if fname.endswith("datasets/labeled_faces.rst") or is_index:
|
||||||
|
setup_labeled_faces()
|
||||||
|
elif fname.endswith("datasets/rcv1.rst") or is_index:
|
||||||
|
setup_rcv1()
|
||||||
|
elif fname.endswith("datasets/twenty_newsgroups.rst") or is_index:
|
||||||
|
setup_twenty_newsgroups()
|
||||||
|
elif fname.endswith("modules/compose.rst") or is_index:
|
||||||
|
setup_compose()
|
||||||
|
elif fname.endswith("datasets/loading_other_datasets.rst"):
|
||||||
|
setup_loading_other_datasets()
|
||||||
|
elif fname.endswith("modules/impute.rst"):
|
||||||
|
setup_impute()
|
||||||
|
elif fname.endswith("modules/grid_search.rst"):
|
||||||
|
setup_grid_search()
|
||||||
|
elif fname.endswith("modules/preprocessing.rst"):
|
||||||
|
setup_preprocessing()
|
||||||
|
elif fname.endswith("statistical_inference/unsupervised_learning.rst"):
|
||||||
|
setup_unsupervised_learning()
|
||||||
|
|
||||||
|
rst_files_requiring_matplotlib = [
|
||||||
|
"modules/partial_dependence.rst",
|
||||||
|
"modules/tree.rst",
|
||||||
|
]
|
||||||
|
for each in rst_files_requiring_matplotlib:
|
||||||
|
if fname.endswith(each):
|
||||||
|
skip_if_matplotlib_not_installed(fname)
|
||||||
|
|
||||||
|
if fname.endswith("array_api.rst"):
|
||||||
|
skip_if_cupy_not_installed(fname)
|
||||||
|
|
||||||
|
|
||||||
|
def pytest_configure(config):
|
||||||
|
# Use matplotlib agg backend during the tests including doctests
|
||||||
|
try:
|
||||||
|
import matplotlib
|
||||||
|
|
||||||
|
matplotlib.use("agg")
|
||||||
|
except ImportError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def pytest_collection_modifyitems(config, items):
|
||||||
|
"""Called after collect is completed.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
config : pytest config
|
||||||
|
items : list of collected items
|
||||||
|
"""
|
||||||
|
skip_doctests = False
|
||||||
|
if np_base_version >= parse_version("2"):
|
||||||
|
# Skip doctests when using numpy 2 for now. See the following discussion
|
||||||
|
# to decide what to do in the longer term:
|
||||||
|
# https://github.com/scikit-learn/scikit-learn/issues/27339
|
||||||
|
reason = "Due to NEP 51 numpy scalar repr has changed in numpy 2"
|
||||||
|
skip_doctests = True
|
||||||
|
|
||||||
|
# Normally doctest has the entire module's scope. Here we set globs to an empty dict
|
||||||
|
# to remove the module's scope:
|
||||||
|
# https://docs.python.org/3/library/doctest.html#what-s-the-execution-context
|
||||||
|
for item in items:
|
||||||
|
if isinstance(item, DoctestItem):
|
||||||
|
item.dtest.globs = {}
|
||||||
|
|
||||||
|
if skip_doctests:
|
||||||
|
skip_marker = pytest.mark.skip(reason=reason)
|
||||||
|
|
||||||
|
for item in items:
|
||||||
|
if isinstance(item, DoctestItem):
|
||||||
|
item.add_marker(skip_marker)
|
|
@ -0,0 +1,44 @@
|
||||||
|
.. raw :: html
|
||||||
|
|
||||||
|
<!-- Generated by generate_authors_table.py -->
|
||||||
|
<div class="sk-authors-container">
|
||||||
|
<style>
|
||||||
|
img.avatar {border-radius: 10px;}
|
||||||
|
</style>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/alfaro96'><img src='https://avatars.githubusercontent.com/u/32649176?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Juan Carlos Alfaro Jiménez</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/lucyleeow'><img src='https://avatars.githubusercontent.com/u/23182829?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Lucy Liu</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/MaxwellLZH'><img src='https://avatars.githubusercontent.com/u/16646940?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Maxwell Liu</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/jmloyola'><img src='https://avatars.githubusercontent.com/u/2133361?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Juan Martin Loyola</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/smarie'><img src='https://avatars.githubusercontent.com/u/3236794?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Sylvain Marié</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/norbusan'><img src='https://avatars.githubusercontent.com/u/1735589?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Norbert Preining</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/reshamas'><img src='https://avatars.githubusercontent.com/u/2507232?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Reshama Shaikh</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/albertcthomas'><img src='https://avatars.githubusercontent.com/u/15966638?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Albert Thomas</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/marenwestermann'><img src='https://avatars.githubusercontent.com/u/17019042?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Maren Westermann</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
|
@ -0,0 +1 @@
|
||||||
|
- Chiara Marmo
|
|
@ -0,0 +1,35 @@
|
||||||
|
.. _data-transforms:
|
||||||
|
|
||||||
|
Dataset transformations
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
scikit-learn provides a library of transformers, which may clean (see
|
||||||
|
:ref:`preprocessing`), reduce (see :ref:`data_reduction`), expand (see
|
||||||
|
:ref:`kernel_approximation`) or generate (see :ref:`feature_extraction`)
|
||||||
|
feature representations.
|
||||||
|
|
||||||
|
Like other estimators, these are represented by classes with a ``fit`` method,
|
||||||
|
which learns model parameters (e.g. mean and standard deviation for
|
||||||
|
normalization) from a training set, and a ``transform`` method which applies
|
||||||
|
this transformation model to unseen data. ``fit_transform`` may be more
|
||||||
|
convenient and efficient for modelling and transforming the training data
|
||||||
|
simultaneously.
|
||||||
|
|
||||||
|
Combining such transformers, either in parallel or series is covered in
|
||||||
|
:ref:`combining_estimators`. :ref:`metrics` covers transforming feature
|
||||||
|
spaces into affinity matrices, while :ref:`preprocessing_targets` considers
|
||||||
|
transformations of the target space (e.g. categorical labels) for use in
|
||||||
|
scikit-learn.
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
modules/compose
|
||||||
|
modules/feature_extraction
|
||||||
|
modules/preprocessing
|
||||||
|
modules/impute
|
||||||
|
modules/unsupervised_reduction
|
||||||
|
modules/random_projection
|
||||||
|
modules/kernel_approximation
|
||||||
|
modules/metrics
|
||||||
|
modules/preprocessing_targets
|
|
@ -0,0 +1,65 @@
|
||||||
|
.. _datasets:
|
||||||
|
|
||||||
|
=========================
|
||||||
|
Dataset loading utilities
|
||||||
|
=========================
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn.datasets
|
||||||
|
|
||||||
|
The ``sklearn.datasets`` package embeds some small toy datasets
|
||||||
|
as introduced in the :ref:`Getting Started <loading_example_dataset>` section.
|
||||||
|
|
||||||
|
This package also features helpers to fetch larger datasets commonly
|
||||||
|
used by the machine learning community to benchmark algorithms on data
|
||||||
|
that comes from the 'real world'.
|
||||||
|
|
||||||
|
To evaluate the impact of the scale of the dataset (``n_samples`` and
|
||||||
|
``n_features``) while controlling the statistical properties of the data
|
||||||
|
(typically the correlation and informativeness of the features), it is
|
||||||
|
also possible to generate synthetic data.
|
||||||
|
|
||||||
|
**General dataset API.** There are three main kinds of dataset interfaces that
|
||||||
|
can be used to get datasets depending on the desired type of dataset.
|
||||||
|
|
||||||
|
**The dataset loaders.** They can be used to load small standard datasets,
|
||||||
|
described in the :ref:`toy_datasets` section.
|
||||||
|
|
||||||
|
**The dataset fetchers.** They can be used to download and load larger datasets,
|
||||||
|
described in the :ref:`real_world_datasets` section.
|
||||||
|
|
||||||
|
Both loaders and fetchers functions return a :class:`~sklearn.utils.Bunch`
|
||||||
|
object holding at least two items:
|
||||||
|
an array of shape ``n_samples`` * ``n_features`` with
|
||||||
|
key ``data`` (except for 20newsgroups) and a numpy array of
|
||||||
|
length ``n_samples``, containing the target values, with key ``target``.
|
||||||
|
|
||||||
|
The Bunch object is a dictionary that exposes its keys as attributes.
|
||||||
|
For more information about Bunch object, see :class:`~sklearn.utils.Bunch`.
|
||||||
|
|
||||||
|
It's also possible for almost all of these function to constrain the output
|
||||||
|
to be a tuple containing only the data and the target, by setting the
|
||||||
|
``return_X_y`` parameter to ``True``.
|
||||||
|
|
||||||
|
The datasets also contain a full description in their ``DESCR`` attribute and
|
||||||
|
some contain ``feature_names`` and ``target_names``. See the dataset
|
||||||
|
descriptions below for details.
|
||||||
|
|
||||||
|
**The dataset generation functions.** They can be used to generate controlled
|
||||||
|
synthetic datasets, described in the :ref:`sample_generators` section.
|
||||||
|
|
||||||
|
These functions return a tuple ``(X, y)`` consisting of a ``n_samples`` *
|
||||||
|
``n_features`` numpy array ``X`` and an array of length ``n_samples``
|
||||||
|
containing the targets ``y``.
|
||||||
|
|
||||||
|
In addition, there are also miscellaneous tools to load datasets of other
|
||||||
|
formats or from other locations, described in the :ref:`loading_other_datasets`
|
||||||
|
section.
|
||||||
|
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
datasets/toy_dataset
|
||||||
|
datasets/real_world
|
||||||
|
datasets/sample_generators
|
||||||
|
datasets/loading_other_datasets
|
|
@ -0,0 +1,312 @@
|
||||||
|
.. _loading_other_datasets:
|
||||||
|
|
||||||
|
Loading other datasets
|
||||||
|
======================
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn.datasets
|
||||||
|
|
||||||
|
.. _sample_images:
|
||||||
|
|
||||||
|
Sample images
|
||||||
|
-------------
|
||||||
|
|
||||||
|
Scikit-learn also embeds a couple of sample JPEG images published under Creative
|
||||||
|
Commons license by their authors. Those images can be useful to test algorithms
|
||||||
|
and pipelines on 2D data.
|
||||||
|
|
||||||
|
.. autosummary::
|
||||||
|
|
||||||
|
load_sample_images
|
||||||
|
load_sample_image
|
||||||
|
|
||||||
|
.. image:: ../auto_examples/cluster/images/sphx_glr_plot_color_quantization_001.png
|
||||||
|
:target: ../auto_examples/cluster/plot_color_quantization.html
|
||||||
|
:scale: 30
|
||||||
|
:align: right
|
||||||
|
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
The default coding of images is based on the ``uint8`` dtype to
|
||||||
|
spare memory. Often machine learning algorithms work best if the
|
||||||
|
input is converted to a floating point representation first. Also,
|
||||||
|
if you plan to use ``matplotlib.pyplpt.imshow``, don't forget to scale to the range
|
||||||
|
0 - 1 as done in the following example.
|
||||||
|
|
||||||
|
.. rubric:: Examples
|
||||||
|
|
||||||
|
* :ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py`
|
||||||
|
|
||||||
|
.. _libsvm_loader:
|
||||||
|
|
||||||
|
Datasets in svmlight / libsvm format
|
||||||
|
------------------------------------
|
||||||
|
|
||||||
|
scikit-learn includes utility functions for loading
|
||||||
|
datasets in the svmlight / libsvm format. In this format, each line
|
||||||
|
takes the form ``<label> <feature-id>:<feature-value>
|
||||||
|
<feature-id>:<feature-value> ...``. This format is especially suitable for sparse datasets.
|
||||||
|
In this module, scipy sparse CSR matrices are used for ``X`` and numpy arrays are used for ``y``.
|
||||||
|
|
||||||
|
You may load a dataset like as follows::
|
||||||
|
|
||||||
|
>>> from sklearn.datasets import load_svmlight_file
|
||||||
|
>>> X_train, y_train = load_svmlight_file("/path/to/train_dataset.txt")
|
||||||
|
... # doctest: +SKIP
|
||||||
|
|
||||||
|
You may also load two (or more) datasets at once::
|
||||||
|
|
||||||
|
>>> X_train, y_train, X_test, y_test = load_svmlight_files(
|
||||||
|
... ("/path/to/train_dataset.txt", "/path/to/test_dataset.txt"))
|
||||||
|
... # doctest: +SKIP
|
||||||
|
|
||||||
|
In this case, ``X_train`` and ``X_test`` are guaranteed to have the same number
|
||||||
|
of features. Another way to achieve the same result is to fix the number of
|
||||||
|
features::
|
||||||
|
|
||||||
|
>>> X_test, y_test = load_svmlight_file(
|
||||||
|
... "/path/to/test_dataset.txt", n_features=X_train.shape[1])
|
||||||
|
... # doctest: +SKIP
|
||||||
|
|
||||||
|
.. rubric:: Related links
|
||||||
|
|
||||||
|
- `Public datasets in svmlight / libsvm format`: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets
|
||||||
|
- `Faster API-compatible implementation`: https://github.com/mblondel/svmlight-loader
|
||||||
|
|
||||||
|
..
|
||||||
|
For doctests:
|
||||||
|
|
||||||
|
>>> import numpy as np
|
||||||
|
>>> import os
|
||||||
|
|
||||||
|
.. _openml:
|
||||||
|
|
||||||
|
Downloading datasets from the openml.org repository
|
||||||
|
---------------------------------------------------
|
||||||
|
|
||||||
|
`openml.org <https://openml.org>`_ is a public repository for machine learning
|
||||||
|
data and experiments, that allows everybody to upload open datasets.
|
||||||
|
|
||||||
|
The ``sklearn.datasets`` package is able to download datasets
|
||||||
|
from the repository using the function
|
||||||
|
:func:`sklearn.datasets.fetch_openml`.
|
||||||
|
|
||||||
|
For example, to download a dataset of gene expressions in mice brains::
|
||||||
|
|
||||||
|
>>> from sklearn.datasets import fetch_openml
|
||||||
|
>>> mice = fetch_openml(name='miceprotein', version=4)
|
||||||
|
|
||||||
|
To fully specify a dataset, you need to provide a name and a version, though
|
||||||
|
the version is optional, see :ref:`openml_versions` below.
|
||||||
|
The dataset contains a total of 1080 examples belonging to 8 different
|
||||||
|
classes::
|
||||||
|
|
||||||
|
>>> mice.data.shape
|
||||||
|
(1080, 77)
|
||||||
|
>>> mice.target.shape
|
||||||
|
(1080,)
|
||||||
|
>>> np.unique(mice.target)
|
||||||
|
array(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object)
|
||||||
|
|
||||||
|
You can get more information on the dataset by looking at the ``DESCR``
|
||||||
|
and ``details`` attributes::
|
||||||
|
|
||||||
|
>>> print(mice.DESCR) # doctest: +SKIP
|
||||||
|
**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios
|
||||||
|
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015
|
||||||
|
**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing
|
||||||
|
Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down
|
||||||
|
Syndrome. PLoS ONE 10(6): e0129126...
|
||||||
|
|
||||||
|
>>> mice.details # doctest: +SKIP
|
||||||
|
{'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF',
|
||||||
|
'upload_date': '2017-11-08T16:00:15', 'licence': 'Public',
|
||||||
|
'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff',
|
||||||
|
'file_id': '17928620', 'default_target_attribute': 'class',
|
||||||
|
'row_id_attribute': 'MouseID',
|
||||||
|
'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],
|
||||||
|
'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],
|
||||||
|
'visibility': 'public', 'status': 'active',
|
||||||
|
'md5_checksum': '3c479a6885bfa0438971388283a1ce32'}
|
||||||
|
|
||||||
|
|
||||||
|
The ``DESCR`` contains a free-text description of the data, while ``details``
|
||||||
|
contains a dictionary of meta-data stored by openml, like the dataset id.
|
||||||
|
For more details, see the `OpenML documentation
|
||||||
|
<https://docs.openml.org/#data>`_ The ``data_id`` of the mice protein dataset
|
||||||
|
is 40966, and you can use this (or the name) to get more information on the
|
||||||
|
dataset on the openml website::
|
||||||
|
|
||||||
|
>>> mice.url
|
||||||
|
'https://www.openml.org/d/40966'
|
||||||
|
|
||||||
|
The ``data_id`` also uniquely identifies a dataset from OpenML::
|
||||||
|
|
||||||
|
>>> mice = fetch_openml(data_id=40966)
|
||||||
|
>>> mice.details # doctest: +SKIP
|
||||||
|
{'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF',
|
||||||
|
'creator': ...,
|
||||||
|
'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url':
|
||||||
|
'https://www.openml.org/data/v1/download/1804243/MiceProtein.ARFF', 'file_id':
|
||||||
|
'1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C,
|
||||||
|
Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins
|
||||||
|
Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6):
|
||||||
|
e0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14',
|
||||||
|
'study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum':
|
||||||
|
'3c479a6885bfa0438971388283a1ce32'}
|
||||||
|
|
||||||
|
.. _openml_versions:
|
||||||
|
|
||||||
|
Dataset Versions
|
||||||
|
~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
A dataset is uniquely specified by its ``data_id``, but not necessarily by its
|
||||||
|
name. Several different "versions" of a dataset with the same name can exist
|
||||||
|
which can contain entirely different datasets.
|
||||||
|
If a particular version of a dataset has been found to contain significant
|
||||||
|
issues, it might be deactivated. Using a name to specify a dataset will yield
|
||||||
|
the earliest version of a dataset that is still active. That means that
|
||||||
|
``fetch_openml(name="miceprotein")`` can yield different results
|
||||||
|
at different times if earlier versions become inactive.
|
||||||
|
You can see that the dataset with ``data_id`` 40966 that we fetched above is
|
||||||
|
the first version of the "miceprotein" dataset::
|
||||||
|
|
||||||
|
>>> mice.details['version'] #doctest: +SKIP
|
||||||
|
'1'
|
||||||
|
|
||||||
|
In fact, this dataset only has one version. The iris dataset on the other hand
|
||||||
|
has multiple versions::
|
||||||
|
|
||||||
|
>>> iris = fetch_openml(name="iris")
|
||||||
|
>>> iris.details['version'] #doctest: +SKIP
|
||||||
|
'1'
|
||||||
|
>>> iris.details['id'] #doctest: +SKIP
|
||||||
|
'61'
|
||||||
|
|
||||||
|
>>> iris_61 = fetch_openml(data_id=61)
|
||||||
|
>>> iris_61.details['version']
|
||||||
|
'1'
|
||||||
|
>>> iris_61.details['id']
|
||||||
|
'61'
|
||||||
|
|
||||||
|
>>> iris_969 = fetch_openml(data_id=969)
|
||||||
|
>>> iris_969.details['version']
|
||||||
|
'3'
|
||||||
|
>>> iris_969.details['id']
|
||||||
|
'969'
|
||||||
|
|
||||||
|
Specifying the dataset by the name "iris" yields the lowest version, version 1,
|
||||||
|
with the ``data_id`` 61. To make sure you always get this exact dataset, it is
|
||||||
|
safest to specify it by the dataset ``data_id``. The other dataset, with
|
||||||
|
``data_id`` 969, is version 3 (version 2 has become inactive), and contains a
|
||||||
|
binarized version of the data::
|
||||||
|
|
||||||
|
>>> np.unique(iris_969.target)
|
||||||
|
array(['N', 'P'], dtype=object)
|
||||||
|
|
||||||
|
You can also specify both the name and the version, which also uniquely
|
||||||
|
identifies the dataset::
|
||||||
|
|
||||||
|
>>> iris_version_3 = fetch_openml(name="iris", version=3)
|
||||||
|
>>> iris_version_3.details['version']
|
||||||
|
'3'
|
||||||
|
>>> iris_version_3.details['id']
|
||||||
|
'969'
|
||||||
|
|
||||||
|
|
||||||
|
.. rubric:: References
|
||||||
|
|
||||||
|
* :arxiv:`Vanschoren, van Rijn, Bischl and Torgo. "OpenML: networked science in
|
||||||
|
machine learning" ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014.
|
||||||
|
<1407.7722>`
|
||||||
|
|
||||||
|
.. _openml_parser:
|
||||||
|
|
||||||
|
ARFF parser
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
From version 1.2, scikit-learn provides a new keyword argument `parser` that
|
||||||
|
provides several options to parse the ARFF files provided by OpenML. The legacy
|
||||||
|
parser (i.e. `parser="liac-arff"`) is based on the project
|
||||||
|
`LIAC-ARFF <https://github.com/renatopp/liac-arff>`_. This parser is however
|
||||||
|
slow and consume more memory than required. A new parser based on pandas
|
||||||
|
(i.e. `parser="pandas"`) is both faster and more memory efficient.
|
||||||
|
However, this parser does not support sparse data.
|
||||||
|
Therefore, we recommend using `parser="auto"` which will use the best parser
|
||||||
|
available for the requested dataset.
|
||||||
|
|
||||||
|
The `"pandas"` and `"liac-arff"` parsers can lead to different data types in
|
||||||
|
the output. The notable differences are the following:
|
||||||
|
|
||||||
|
- The `"liac-arff"` parser always encodes categorical features as `str`
|
||||||
|
objects. To the contrary, the `"pandas"` parser instead infers the type while
|
||||||
|
reading and numerical categories will be casted into integers whenever
|
||||||
|
possible.
|
||||||
|
- The `"liac-arff"` parser uses float64 to encode numerical features tagged as
|
||||||
|
'REAL' and 'NUMERICAL' in the metadata. The `"pandas"` parser instead infers
|
||||||
|
if these numerical features corresponds to integers and uses panda's Integer
|
||||||
|
extension dtype.
|
||||||
|
- In particular, classification datasets with integer categories are typically
|
||||||
|
loaded as such `(0, 1, ...)` with the `"pandas"` parser while `"liac-arff"`
|
||||||
|
will force the use of string encoded class labels such as `"0"`, `"1"` and so
|
||||||
|
on.
|
||||||
|
- The `"pandas"` parser will not strip single quotes - i.e. `'` - from string
|
||||||
|
columns. For instance, a string `'my string'` will be kept as is while the
|
||||||
|
`"liac-arff"` parser will strip the single quotes. For categorical columns,
|
||||||
|
the single quotes are stripped from the values.
|
||||||
|
|
||||||
|
In addition, when `as_frame=False` is used, the `"liac-arff"` parser returns
|
||||||
|
ordinally encoded data where the categories are provided in the attribute
|
||||||
|
`categories` of the `Bunch` instance. Instead, `"pandas"` returns a NumPy array
|
||||||
|
were the categories. Then it's up to the user to design a feature
|
||||||
|
engineering pipeline with an instance of `OneHotEncoder` or
|
||||||
|
`OrdinalEncoder` typically wrapped in a `ColumnTransformer` to
|
||||||
|
preprocess the categorical columns explicitly. See for instance: :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`.
|
||||||
|
|
||||||
|
.. _external_datasets:
|
||||||
|
|
||||||
|
Loading from external datasets
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
scikit-learn works on any numeric data stored as numpy arrays or scipy sparse
|
||||||
|
matrices. Other types that are convertible to numeric arrays such as pandas
|
||||||
|
DataFrame are also acceptable.
|
||||||
|
|
||||||
|
Here are some recommended ways to load standard columnar data into a
|
||||||
|
format usable by scikit-learn:
|
||||||
|
|
||||||
|
* `pandas.io <https://pandas.pydata.org/pandas-docs/stable/io.html>`_
|
||||||
|
provides tools to read data from common formats including CSV, Excel, JSON
|
||||||
|
and SQL. DataFrames may also be constructed from lists of tuples or dicts.
|
||||||
|
Pandas handles heterogeneous data smoothly and provides tools for
|
||||||
|
manipulation and conversion into a numeric array suitable for scikit-learn.
|
||||||
|
* `scipy.io <https://docs.scipy.org/doc/scipy/reference/io.html>`_
|
||||||
|
specializes in binary formats often used in scientific computing
|
||||||
|
context such as .mat and .arff
|
||||||
|
* `numpy/routines.io <https://docs.scipy.org/doc/numpy/reference/routines.io.html>`_
|
||||||
|
for standard loading of columnar data into numpy arrays
|
||||||
|
* scikit-learn's :func:`load_svmlight_file` for the svmlight or libSVM
|
||||||
|
sparse format
|
||||||
|
* scikit-learn's :func:`load_files` for directories of text files where
|
||||||
|
the name of each directory is the name of each category and each file inside
|
||||||
|
of each directory corresponds to one sample from that category
|
||||||
|
|
||||||
|
For some miscellaneous data such as images, videos, and audio, you may wish to
|
||||||
|
refer to:
|
||||||
|
|
||||||
|
* `skimage.io <https://scikit-image.org/docs/dev/api/skimage.io.html>`_ or
|
||||||
|
`Imageio <https://imageio.readthedocs.io/en/stable/reference/core_v3.html>`_
|
||||||
|
for loading images and videos into numpy arrays
|
||||||
|
* `scipy.io.wavfile.read
|
||||||
|
<https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.wavfile.read.html>`_
|
||||||
|
for reading WAV files into a numpy array
|
||||||
|
|
||||||
|
Categorical (or nominal) features stored as strings (common in pandas DataFrames)
|
||||||
|
will need converting to numerical features using :class:`~sklearn.preprocessing.OneHotEncoder`
|
||||||
|
or :class:`~sklearn.preprocessing.OrdinalEncoder` or similar.
|
||||||
|
See :ref:`preprocessing`.
|
||||||
|
|
||||||
|
Note: if you manage your own numerical data it is recommended to use an
|
||||||
|
optimized file format such as HDF5 to reduce data load times. Various libraries
|
||||||
|
such as H5Py, PyTables and pandas provides a Python interface for reading and
|
||||||
|
writing data in that format.
|
|
@ -0,0 +1,40 @@
|
||||||
|
.. _real_world_datasets:
|
||||||
|
|
||||||
|
Real world datasets
|
||||||
|
===================
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn.datasets
|
||||||
|
|
||||||
|
scikit-learn provides tools to load larger datasets, downloading them if
|
||||||
|
necessary.
|
||||||
|
|
||||||
|
They can be loaded using the following functions:
|
||||||
|
|
||||||
|
.. autosummary::
|
||||||
|
|
||||||
|
fetch_olivetti_faces
|
||||||
|
fetch_20newsgroups
|
||||||
|
fetch_20newsgroups_vectorized
|
||||||
|
fetch_lfw_people
|
||||||
|
fetch_lfw_pairs
|
||||||
|
fetch_covtype
|
||||||
|
fetch_rcv1
|
||||||
|
fetch_kddcup99
|
||||||
|
fetch_california_housing
|
||||||
|
fetch_species_distributions
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/olivetti_faces.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/twenty_newsgroups.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/lfw.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/covtype.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/rcv1.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/kddcup99.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/california_housing.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/species_distributions.rst
|
|
@ -0,0 +1,108 @@
|
||||||
|
.. _sample_generators:
|
||||||
|
|
||||||
|
Generated datasets
|
||||||
|
==================
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn.datasets
|
||||||
|
|
||||||
|
In addition, scikit-learn includes various random sample generators that
|
||||||
|
can be used to build artificial datasets of controlled size and complexity.
|
||||||
|
|
||||||
|
Generators for classification and clustering
|
||||||
|
--------------------------------------------
|
||||||
|
|
||||||
|
These generators produce a matrix of features and corresponding discrete
|
||||||
|
targets.
|
||||||
|
|
||||||
|
Single label
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Both :func:`make_blobs` and :func:`make_classification` create multiclass
|
||||||
|
datasets by allocating each class one or more normally-distributed clusters of
|
||||||
|
points. :func:`make_blobs` provides greater control regarding the centers and
|
||||||
|
standard deviations of each cluster, and is used to demonstrate clustering.
|
||||||
|
:func:`make_classification` specializes in introducing noise by way of:
|
||||||
|
correlated, redundant and uninformative features; multiple Gaussian clusters
|
||||||
|
per class; and linear transformations of the feature space.
|
||||||
|
|
||||||
|
:func:`make_gaussian_quantiles` divides a single Gaussian cluster into
|
||||||
|
near-equal-size classes separated by concentric hyperspheres.
|
||||||
|
:func:`make_hastie_10_2` generates a similar binary, 10-dimensional problem.
|
||||||
|
|
||||||
|
.. image:: ../auto_examples/datasets/images/sphx_glr_plot_random_dataset_001.png
|
||||||
|
:target: ../auto_examples/datasets/plot_random_dataset.html
|
||||||
|
:scale: 50
|
||||||
|
:align: center
|
||||||
|
|
||||||
|
:func:`make_circles` and :func:`make_moons` generate 2d binary classification
|
||||||
|
datasets that are challenging to certain algorithms (e.g. centroid-based
|
||||||
|
clustering or linear classification), including optional Gaussian noise.
|
||||||
|
They are useful for visualization. :func:`make_circles` produces Gaussian data
|
||||||
|
with a spherical decision boundary for binary classification, while
|
||||||
|
:func:`make_moons` produces two interleaving half circles.
|
||||||
|
|
||||||
|
Multilabel
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
:func:`make_multilabel_classification` generates random samples with multiple
|
||||||
|
labels, reflecting a bag of words drawn from a mixture of topics. The number of
|
||||||
|
topics for each document is drawn from a Poisson distribution, and the topics
|
||||||
|
themselves are drawn from a fixed random distribution. Similarly, the number of
|
||||||
|
words is drawn from Poisson, with words drawn from a multinomial, where each
|
||||||
|
topic defines a probability distribution over words. Simplifications with
|
||||||
|
respect to true bag-of-words mixtures include:
|
||||||
|
|
||||||
|
* Per-topic word distributions are independently drawn, where in reality all
|
||||||
|
would be affected by a sparse base distribution, and would be correlated.
|
||||||
|
* For a document generated from multiple topics, all topics are weighted
|
||||||
|
equally in generating its bag of words.
|
||||||
|
* Documents without labels words at random, rather than from a base
|
||||||
|
distribution.
|
||||||
|
|
||||||
|
.. image:: ../auto_examples/datasets/images/sphx_glr_plot_random_multilabel_dataset_001.png
|
||||||
|
:target: ../auto_examples/datasets/plot_random_multilabel_dataset.html
|
||||||
|
:scale: 50
|
||||||
|
:align: center
|
||||||
|
|
||||||
|
Biclustering
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
.. autosummary::
|
||||||
|
|
||||||
|
make_biclusters
|
||||||
|
make_checkerboard
|
||||||
|
|
||||||
|
|
||||||
|
Generators for regression
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
:func:`make_regression` produces regression targets as an optionally-sparse
|
||||||
|
random linear combination of random features, with noise. Its informative
|
||||||
|
features may be uncorrelated, or low rank (few features account for most of the
|
||||||
|
variance).
|
||||||
|
|
||||||
|
Other regression generators generate functions deterministically from
|
||||||
|
randomized features. :func:`make_sparse_uncorrelated` produces a target as a
|
||||||
|
linear combination of four features with fixed coefficients.
|
||||||
|
Others encode explicitly non-linear relations:
|
||||||
|
:func:`make_friedman1` is related by polynomial and sine transforms;
|
||||||
|
:func:`make_friedman2` includes feature multiplication and reciprocation; and
|
||||||
|
:func:`make_friedman3` is similar with an arctan transformation on the target.
|
||||||
|
|
||||||
|
Generators for manifold learning
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
.. autosummary::
|
||||||
|
|
||||||
|
make_s_curve
|
||||||
|
make_swiss_roll
|
||||||
|
|
||||||
|
Generators for decomposition
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
.. autosummary::
|
||||||
|
|
||||||
|
make_low_rank_matrix
|
||||||
|
make_sparse_coded_signal
|
||||||
|
make_spd_matrix
|
||||||
|
make_sparse_spd_matrix
|
|
@ -0,0 +1,36 @@
|
||||||
|
.. _toy_datasets:
|
||||||
|
|
||||||
|
Toy datasets
|
||||||
|
============
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn.datasets
|
||||||
|
|
||||||
|
scikit-learn comes with a few small standard datasets that do not require to
|
||||||
|
download any file from some external website.
|
||||||
|
|
||||||
|
They can be loaded using the following functions:
|
||||||
|
|
||||||
|
.. autosummary::
|
||||||
|
|
||||||
|
load_iris
|
||||||
|
load_diabetes
|
||||||
|
load_digits
|
||||||
|
load_linnerud
|
||||||
|
load_wine
|
||||||
|
load_breast_cancer
|
||||||
|
|
||||||
|
These datasets are useful to quickly illustrate the behavior of the
|
||||||
|
various algorithms implemented in scikit-learn. They are however often too
|
||||||
|
small to be representative of real world machine learning tasks.
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/iris.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/diabetes.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/digits.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/linnerud.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/wine_data.rst
|
||||||
|
|
||||||
|
.. include:: ../../sklearn/datasets/descr/breast_cancer.rst
|
|
@ -0,0 +1,523 @@
|
||||||
|
|
||||||
|
.. _advanced-installation:
|
||||||
|
|
||||||
|
.. include:: ../min_dependency_substitutions.rst
|
||||||
|
|
||||||
|
==================================================
|
||||||
|
Installing the development version of scikit-learn
|
||||||
|
==================================================
|
||||||
|
|
||||||
|
This section introduces how to install the **main branch** of scikit-learn.
|
||||||
|
This can be done by either installing a nightly build or building from source.
|
||||||
|
|
||||||
|
.. _install_nightly_builds:
|
||||||
|
|
||||||
|
Installing nightly builds
|
||||||
|
=========================
|
||||||
|
|
||||||
|
The continuous integration servers of the scikit-learn project build, test
|
||||||
|
and upload wheel packages for the most recent Python version on a nightly
|
||||||
|
basis.
|
||||||
|
|
||||||
|
Installing a nightly build is the quickest way to:
|
||||||
|
|
||||||
|
- try a new feature that will be shipped in the next release (that is, a
|
||||||
|
feature from a pull-request that was recently merged to the main branch);
|
||||||
|
|
||||||
|
- check whether a bug you encountered has been fixed since the last release.
|
||||||
|
|
||||||
|
You can install the nightly build of scikit-learn using the `scientific-python-nightly-wheels`
|
||||||
|
index from the PyPI registry of `anaconda.org`:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pip install --pre --extra-index https://pypi.anaconda.org/scientific-python-nightly-wheels/simple scikit-learn
|
||||||
|
|
||||||
|
Note that first uninstalling scikit-learn might be required to be able to
|
||||||
|
install nightly builds of scikit-learn.
|
||||||
|
|
||||||
|
.. _install_bleeding_edge:
|
||||||
|
|
||||||
|
Building from source
|
||||||
|
====================
|
||||||
|
|
||||||
|
Building from source is required to work on a contribution (bug fix, new
|
||||||
|
feature, code or documentation improvement).
|
||||||
|
|
||||||
|
.. _git_repo:
|
||||||
|
|
||||||
|
#. Use `Git <https://git-scm.com/>`_ to check out the latest source from the
|
||||||
|
`scikit-learn repository <https://github.com/scikit-learn/scikit-learn>`_ on
|
||||||
|
Github.:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
git clone git://github.com/scikit-learn/scikit-learn.git # add --depth 1 if your connection is slow
|
||||||
|
cd scikit-learn
|
||||||
|
|
||||||
|
If you plan on submitting a pull-request, you should clone from your fork
|
||||||
|
instead.
|
||||||
|
|
||||||
|
#. Install a recent version of Python (3.9 or later at the time of writing) for
|
||||||
|
instance using Miniforge3_. Miniforge provides a conda-based distribution of
|
||||||
|
Python and the most popular scientific libraries.
|
||||||
|
|
||||||
|
If you installed Python with conda, we recommend to create a dedicated
|
||||||
|
`conda environment`_ with all the build dependencies of scikit-learn
|
||||||
|
(namely NumPy_, SciPy_, Cython_, meson-python_ and Ninja_):
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
conda create -n sklearn-env -c conda-forge python numpy scipy cython meson-python ninja
|
||||||
|
|
||||||
|
It is not always necessary but it is safer to open a new prompt before
|
||||||
|
activating the newly created conda environment.
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
conda activate sklearn-env
|
||||||
|
|
||||||
|
#. **Alternative to conda:** You can use alternative installations of Python
|
||||||
|
provided they are recent enough (3.9 or higher at the time of writing).
|
||||||
|
Here is an example on how to create a build environment for a Linux system's
|
||||||
|
Python. Build dependencies are installed with `pip` in a dedicated virtualenv_
|
||||||
|
to avoid disrupting other Python programs installed on the system:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
python3 -m venv sklearn-env
|
||||||
|
source sklearn-env/bin/activate
|
||||||
|
pip install wheel numpy scipy cython meson-python ninja
|
||||||
|
|
||||||
|
#. Install a compiler with OpenMP_ support for your platform. See instructions
|
||||||
|
for :ref:`compiler_windows`, :ref:`compiler_macos`, :ref:`compiler_linux`
|
||||||
|
and :ref:`compiler_freebsd`.
|
||||||
|
|
||||||
|
#. Build the project with pip:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pip install --editable . \
|
||||||
|
--verbose --no-build-isolation \
|
||||||
|
--config-settings editable-verbose=true
|
||||||
|
|
||||||
|
#. Check that the installed scikit-learn has a version number ending with
|
||||||
|
`.dev0`:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
python -c "import sklearn; sklearn.show_versions()"
|
||||||
|
|
||||||
|
#. Please refer to the :ref:`developers_guide` and :ref:`pytest_tips` to run
|
||||||
|
the tests on the module of your choice.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
`--config-settings editable-verbose=true` is optional but recommended
|
||||||
|
to avoid surprises when you import `sklearn`. `meson-python` implements
|
||||||
|
editable installs by rebuilding `sklearn` when executing `import sklearn`.
|
||||||
|
With the recommended setting you will see a message when this happens,
|
||||||
|
rather than potentially waiting without feed-back and wondering
|
||||||
|
what is taking so long. Bonus: this means you only have to run the `pip
|
||||||
|
install` command once, `sklearn` will automatically be rebuilt when
|
||||||
|
importing `sklearn`.
|
||||||
|
|
||||||
|
Dependencies
|
||||||
|
------------
|
||||||
|
|
||||||
|
Runtime dependencies
|
||||||
|
~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Scikit-learn requires the following dependencies both at build time and at
|
||||||
|
runtime:
|
||||||
|
|
||||||
|
- Python (>= 3.8),
|
||||||
|
- NumPy (>= |NumpyMinVersion|),
|
||||||
|
- SciPy (>= |ScipyMinVersion|),
|
||||||
|
- Joblib (>= |JoblibMinVersion|),
|
||||||
|
- threadpoolctl (>= |ThreadpoolctlMinVersion|).
|
||||||
|
|
||||||
|
Build dependencies
|
||||||
|
~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Building Scikit-learn also requires:
|
||||||
|
|
||||||
|
..
|
||||||
|
# The following places need to be in sync with regard to Cython version:
|
||||||
|
# - .circleci config file
|
||||||
|
# - sklearn/_build_utils/__init__.py
|
||||||
|
# - advanced installation guide
|
||||||
|
|
||||||
|
- Cython >= |CythonMinVersion|
|
||||||
|
- A C/C++ compiler and a matching OpenMP_ runtime library. See the
|
||||||
|
:ref:`platform system specific instructions
|
||||||
|
<platform_specific_instructions>` for more details.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If OpenMP is not supported by the compiler, the build will be done with
|
||||||
|
OpenMP functionalities disabled. This is not recommended since it will force
|
||||||
|
some estimators to run in sequential mode instead of leveraging thread-based
|
||||||
|
parallelism. Setting the ``SKLEARN_FAIL_NO_OPENMP`` environment variable
|
||||||
|
(before cythonization) will force the build to fail if OpenMP is not
|
||||||
|
supported.
|
||||||
|
|
||||||
|
Since version 0.21, scikit-learn automatically detects and uses the linear
|
||||||
|
algebra library used by SciPy **at runtime**. Scikit-learn has therefore no
|
||||||
|
build dependency on BLAS/LAPACK implementations such as OpenBlas, Atlas, Blis
|
||||||
|
or MKL.
|
||||||
|
|
||||||
|
Test dependencies
|
||||||
|
~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Running tests requires:
|
||||||
|
|
||||||
|
- pytest >= |PytestMinVersion|
|
||||||
|
|
||||||
|
Some tests also require `pandas <https://pandas.pydata.org>`_.
|
||||||
|
|
||||||
|
|
||||||
|
Building a specific version from a tag
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
If you want to build a stable version, you can ``git checkout <VERSION>``
|
||||||
|
to get the code for that particular version, or download an zip archive of
|
||||||
|
the version from github.
|
||||||
|
|
||||||
|
.. _platform_specific_instructions:
|
||||||
|
|
||||||
|
Platform-specific instructions
|
||||||
|
==============================
|
||||||
|
|
||||||
|
Here are instructions to install a working C/C++ compiler with OpenMP support
|
||||||
|
to build scikit-learn Cython extensions for each supported platform.
|
||||||
|
|
||||||
|
.. _compiler_windows:
|
||||||
|
|
||||||
|
Windows
|
||||||
|
-------
|
||||||
|
|
||||||
|
First, download the `Build Tools for Visual Studio 2019 installer
|
||||||
|
<https://aka.ms/vs/17/release/vs_buildtools.exe>`_.
|
||||||
|
|
||||||
|
Run the downloaded `vs_buildtools.exe` file, during the installation you will
|
||||||
|
need to make sure you select "Desktop development with C++", similarly to this
|
||||||
|
screenshot:
|
||||||
|
|
||||||
|
.. image:: ../images/visual-studio-build-tools-selection.png
|
||||||
|
|
||||||
|
Secondly, find out if you are running 64-bit or 32-bit Python. The building
|
||||||
|
command depends on the architecture of the Python interpreter. You can check
|
||||||
|
the architecture by running the following in ``cmd`` or ``powershell``
|
||||||
|
console:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
python -c "import struct; print(struct.calcsize('P') * 8)"
|
||||||
|
|
||||||
|
For 64-bit Python, configure the build environment by running the following
|
||||||
|
commands in ``cmd`` or an Anaconda Prompt (if you use Anaconda):
|
||||||
|
|
||||||
|
.. sphinx-prompt 1.3.0 (used in doc-min-dependencies CI task) does not support `batch` prompt type,
|
||||||
|
.. so we work around by using a known prompt type and an explicit prompt text.
|
||||||
|
..
|
||||||
|
.. prompt:: bash C:\>
|
||||||
|
|
||||||
|
SET DISTUTILS_USE_SDK=1
|
||||||
|
"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvarsall.bat" x64
|
||||||
|
|
||||||
|
Replace ``x64`` by ``x86`` to build for 32-bit Python.
|
||||||
|
|
||||||
|
Please be aware that the path above might be different from user to user. The
|
||||||
|
aim is to point to the "vcvarsall.bat" file that will set the necessary
|
||||||
|
environment variables in the current command prompt.
|
||||||
|
|
||||||
|
Finally, build scikit-learn with this command prompt:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pip install --editable . \
|
||||||
|
--verbose --no-build-isolation \
|
||||||
|
--config-settings editable-verbose=true
|
||||||
|
|
||||||
|
.. _compiler_macos:
|
||||||
|
|
||||||
|
macOS
|
||||||
|
-----
|
||||||
|
|
||||||
|
The default C compiler on macOS, Apple clang (confusingly aliased as
|
||||||
|
`/usr/bin/gcc`), does not directly support OpenMP. We present two alternatives
|
||||||
|
to enable OpenMP support:
|
||||||
|
|
||||||
|
- either install `conda-forge::compilers` with conda;
|
||||||
|
|
||||||
|
- or install `libomp` with Homebrew to extend the default Apple clang compiler.
|
||||||
|
|
||||||
|
For Apple Silicon M1 hardware, only the conda-forge method below is known to
|
||||||
|
work at the time of writing (January 2021). You can install the `macos/arm64`
|
||||||
|
distribution of conda using the `miniforge installer
|
||||||
|
<https://github.com/conda-forge/miniforge#miniforge>`_
|
||||||
|
|
||||||
|
macOS compilers from conda-forge
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
If you use the conda package manager (version >= 4.7), you can install the
|
||||||
|
``compilers`` meta-package from the conda-forge channel, which provides
|
||||||
|
OpenMP-enabled C/C++ compilers based on the llvm toolchain.
|
||||||
|
|
||||||
|
First install the macOS command line tools:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
xcode-select --install
|
||||||
|
|
||||||
|
It is recommended to use a dedicated `conda environment`_ to build
|
||||||
|
scikit-learn from source:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
conda create -n sklearn-dev -c conda-forge python numpy scipy cython \
|
||||||
|
joblib threadpoolctl pytest compilers llvm-openmp meson-python ninja
|
||||||
|
|
||||||
|
It is not always necessary but it is safer to open a new prompt before
|
||||||
|
activating the newly created conda environment.
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
conda activate sklearn-dev
|
||||||
|
make clean
|
||||||
|
pip install --editable . \
|
||||||
|
--verbose --no-build-isolation \
|
||||||
|
--config-settings editable-verbose=true
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
If you get any conflicting dependency error message, try commenting out
|
||||||
|
any custom conda configuration in the ``$HOME/.condarc`` file. In
|
||||||
|
particular the ``channel_priority: strict`` directive is known to cause
|
||||||
|
problems for this setup.
|
||||||
|
|
||||||
|
You can check that the custom compilers are properly installed from conda
|
||||||
|
forge using the following command:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
conda list
|
||||||
|
|
||||||
|
which should include ``compilers`` and ``llvm-openmp``.
|
||||||
|
|
||||||
|
The compilers meta-package will automatically set custom environment
|
||||||
|
variables:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
echo $CC
|
||||||
|
echo $CXX
|
||||||
|
echo $CFLAGS
|
||||||
|
echo $CXXFLAGS
|
||||||
|
echo $LDFLAGS
|
||||||
|
|
||||||
|
They point to files and folders from your ``sklearn-dev`` conda environment
|
||||||
|
(in particular in the bin/, include/ and lib/ subfolders). For instance
|
||||||
|
``-L/path/to/conda/envs/sklearn-dev/lib`` should appear in ``LDFLAGS``.
|
||||||
|
|
||||||
|
In the log, you should see the compiled extension being built with the clang
|
||||||
|
and clang++ compilers installed by conda with the ``-fopenmp`` command line
|
||||||
|
flag.
|
||||||
|
|
||||||
|
macOS compilers from Homebrew
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Another solution is to enable OpenMP support for the clang compiler shipped
|
||||||
|
by default on macOS.
|
||||||
|
|
||||||
|
First install the macOS command line tools:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
xcode-select --install
|
||||||
|
|
||||||
|
Install the Homebrew_ package manager for macOS.
|
||||||
|
|
||||||
|
Install the LLVM OpenMP library:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
brew install libomp
|
||||||
|
|
||||||
|
Set the following environment variables:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
export CC=/usr/bin/clang
|
||||||
|
export CXX=/usr/bin/clang++
|
||||||
|
export CPPFLAGS="$CPPFLAGS -Xpreprocessor -fopenmp"
|
||||||
|
export CFLAGS="$CFLAGS -I/usr/local/opt/libomp/include"
|
||||||
|
export CXXFLAGS="$CXXFLAGS -I/usr/local/opt/libomp/include"
|
||||||
|
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/opt/libomp/lib -L/usr/local/opt/libomp/lib -lomp"
|
||||||
|
|
||||||
|
Finally, build scikit-learn in verbose mode (to check for the presence of the
|
||||||
|
``-fopenmp`` flag in the compiler commands):
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
make clean
|
||||||
|
pip install --editable . \
|
||||||
|
--verbose --no-build-isolation \
|
||||||
|
--config-settings editable-verbose=true
|
||||||
|
|
||||||
|
.. _compiler_linux:
|
||||||
|
|
||||||
|
Linux
|
||||||
|
-----
|
||||||
|
|
||||||
|
Linux compilers from the system
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Installing scikit-learn from source without using conda requires you to have
|
||||||
|
installed the scikit-learn Python development headers and a working C/C++
|
||||||
|
compiler with OpenMP support (typically the GCC toolchain).
|
||||||
|
|
||||||
|
Install build dependencies for Debian-based operating systems, e.g.
|
||||||
|
Ubuntu:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
sudo apt-get install build-essential python3-dev python3-pip
|
||||||
|
|
||||||
|
then proceed as usual:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pip3 install cython
|
||||||
|
pip3 install --editable . \
|
||||||
|
--verbose --no-build-isolation \
|
||||||
|
--config-settings editable-verbose=true
|
||||||
|
|
||||||
|
Cython and the pre-compiled wheels for the runtime dependencies (numpy, scipy
|
||||||
|
and joblib) should automatically be installed in
|
||||||
|
``$HOME/.local/lib/pythonX.Y/site-packages``. Alternatively you can run the
|
||||||
|
above commands from a virtualenv_ or a `conda environment`_ to get full
|
||||||
|
isolation from the Python packages installed via the system packager. When
|
||||||
|
using an isolated environment, ``pip3`` should be replaced by ``pip`` in the
|
||||||
|
above commands.
|
||||||
|
|
||||||
|
When precompiled wheels of the runtime dependencies are not available for your
|
||||||
|
architecture (e.g. ARM), you can install the system versions:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
sudo apt-get install cython3 python3-numpy python3-scipy
|
||||||
|
|
||||||
|
On Red Hat and clones (e.g. CentOS), install the dependencies using:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
sudo yum -y install gcc gcc-c++ python3-devel numpy scipy
|
||||||
|
|
||||||
|
Linux compilers from conda-forge
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Alternatively, install a recent version of the GNU C Compiler toolchain (GCC)
|
||||||
|
in the user folder using conda:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
conda create -n sklearn-dev -c conda-forge python numpy scipy cython \
|
||||||
|
joblib threadpoolctl pytest compilers meson-python ninja
|
||||||
|
|
||||||
|
It is not always necessary but it is safer to open a new prompt before
|
||||||
|
activating the newly created conda environment.
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
conda activate sklearn-dev
|
||||||
|
pip install --editable . \
|
||||||
|
--verbose --no-build-isolation \
|
||||||
|
--config-settings editable-verbose=true
|
||||||
|
|
||||||
|
.. _compiler_freebsd:
|
||||||
|
|
||||||
|
FreeBSD
|
||||||
|
-------
|
||||||
|
|
||||||
|
The clang compiler included in FreeBSD 12.0 and 11.2 base systems does not
|
||||||
|
include OpenMP support. You need to install the `openmp` library from packages
|
||||||
|
(or ports):
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
sudo pkg install openmp
|
||||||
|
|
||||||
|
This will install header files in ``/usr/local/include`` and libs in
|
||||||
|
``/usr/local/lib``. Since these directories are not searched by default, you
|
||||||
|
can set the environment variables to these locations:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
export CFLAGS="$CFLAGS -I/usr/local/include"
|
||||||
|
export CXXFLAGS="$CXXFLAGS -I/usr/local/include"
|
||||||
|
export LDFLAGS="$LDFLAGS -Wl,-rpath,/usr/local/lib -L/usr/local/lib -lomp"
|
||||||
|
|
||||||
|
Finally, build the package using the standard command:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pip install --editable . \
|
||||||
|
--verbose --no-build-isolation \
|
||||||
|
--config-settings editable-verbose=true
|
||||||
|
|
||||||
|
For the upcoming FreeBSD 12.1 and 11.3 versions, OpenMP will be included in
|
||||||
|
the base system and these steps will not be necessary.
|
||||||
|
|
||||||
|
.. _OpenMP: https://en.wikipedia.org/wiki/OpenMP
|
||||||
|
.. _Cython: https://cython.org
|
||||||
|
.. _meson-python: https://mesonbuild.com/meson-python
|
||||||
|
.. _Ninja: https://ninja-build.org/
|
||||||
|
.. _NumPy: https://numpy.org
|
||||||
|
.. _SciPy: https://www.scipy.org
|
||||||
|
.. _Homebrew: https://brew.sh
|
||||||
|
.. _virtualenv: https://docs.python.org/3/tutorial/venv.html
|
||||||
|
.. _conda environment: https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html
|
||||||
|
.. _Miniforge3: https://github.com/conda-forge/miniforge#miniforge3
|
||||||
|
|
||||||
|
Alternative compilers
|
||||||
|
=====================
|
||||||
|
|
||||||
|
The following command will build scikit-learn using your default C/C++ compiler.
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pip install --editable . \
|
||||||
|
--verbose --no-build-isolation \
|
||||||
|
--config-settings editable-verbose=true
|
||||||
|
|
||||||
|
If you want to build scikit-learn with another compiler handled by ``setuptools``,
|
||||||
|
use the following command:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
python setup.py build_ext --compiler=<compiler> -i build_clib --compiler=<compiler>
|
||||||
|
|
||||||
|
To see the list of available compilers run:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
python setup.py build_ext --help-compiler
|
||||||
|
|
||||||
|
If your compiler is not listed here, you can specify it through some environment
|
||||||
|
variables (does not work on windows). This `section
|
||||||
|
<https://setuptools.pypa.io/en/stable/userguide/ext_modules.html#compiler-and-linker-options>`_
|
||||||
|
of the setuptools documentation explains in details which environment variables
|
||||||
|
are used by ``setuptools``, and at which stage of the compilation, to set the
|
||||||
|
compiler and linker options.
|
||||||
|
|
||||||
|
When setting these environment variables, it is advised to first check their
|
||||||
|
``sysconfig`` counterparts variables and adapt them to your compiler. For instance::
|
||||||
|
|
||||||
|
import sysconfig
|
||||||
|
print(sysconfig.get_config_var('CC'))
|
||||||
|
print(sysconfig.get_config_var('LDFLAGS'))
|
||||||
|
|
||||||
|
In addition, since Scikit-learn uses OpenMP, you need to include the appropriate OpenMP
|
||||||
|
flag of your compiler into the ``CFLAGS`` and ``CPPFLAGS`` environment variables.
|
|
@ -0,0 +1,159 @@
|
||||||
|
.. _bug_triaging:
|
||||||
|
|
||||||
|
Bug triaging and issue curation
|
||||||
|
===============================
|
||||||
|
|
||||||
|
The `issue tracker <https://github.com/scikit-learn/scikit-learn/issues>`_
|
||||||
|
is important to the communication in the project: it helps
|
||||||
|
developers identify major projects to work on, as well as to discuss
|
||||||
|
priorities. For this reason, it is important to curate it, adding labels
|
||||||
|
to issues and closing issues that are not necessary.
|
||||||
|
|
||||||
|
Working on issues to improve them
|
||||||
|
---------------------------------
|
||||||
|
|
||||||
|
Improving issues increases their chances of being successfully resolved.
|
||||||
|
Guidelines on submitting good issues can be found :ref:`here
|
||||||
|
<filing_bugs>`.
|
||||||
|
A third party can give useful feedback or even add
|
||||||
|
comments on the issue.
|
||||||
|
The following actions are typically useful:
|
||||||
|
|
||||||
|
- documenting issues that are missing elements to reproduce the problem
|
||||||
|
such as code samples
|
||||||
|
|
||||||
|
- suggesting better use of code formatting
|
||||||
|
|
||||||
|
- suggesting to reformulate the title and description to make them more
|
||||||
|
explicit about the problem to be solved
|
||||||
|
|
||||||
|
- linking to related issues or discussions while briefly describing how
|
||||||
|
they are related, for instance "See also #xyz for a similar attempt
|
||||||
|
at this" or "See also #xyz where the same thing happened in
|
||||||
|
SomeEstimator" provides context and helps the discussion.
|
||||||
|
|
||||||
|
.. topic:: Fruitful discussions
|
||||||
|
|
||||||
|
Online discussions may be harder than it seems at first glance, in
|
||||||
|
particular given that a person new to open-source may have a very
|
||||||
|
different understanding of the process than a seasoned maintainer.
|
||||||
|
|
||||||
|
Overall, it is useful to stay positive and assume good will. `The
|
||||||
|
following article
|
||||||
|
<https://gael-varoquaux.info/programming/technical-discussions-are-hard-a-few-tips.html>`_
|
||||||
|
explores how to lead online discussions in the context of open source.
|
||||||
|
|
||||||
|
Working on PRs to help review
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
Reviewing code is also encouraged. Contributors and users are welcome to
|
||||||
|
participate to the review process following our :ref:`review guidelines
|
||||||
|
<code_review>`.
|
||||||
|
|
||||||
|
Triaging operations for members of the core and contributor experience teams
|
||||||
|
----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
In addition to the above, members of the core team and the contributor experience team
|
||||||
|
can do the following important tasks:
|
||||||
|
|
||||||
|
- Update :ref:`labels for issues and PRs <issue_tracker_tags>`: see the list of
|
||||||
|
the `available github labels
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/labels>`_.
|
||||||
|
|
||||||
|
- :ref:`Determine if a PR must be relabeled as stalled <stalled_pull_request>`
|
||||||
|
or needs help (this is typically very important in the context
|
||||||
|
of sprints, where the risk is to create many unfinished PRs)
|
||||||
|
|
||||||
|
- If a stalled PR is taken over by a newer PR, then label the stalled PR as
|
||||||
|
"Superseded", leave a comment on the stalled PR linking to the new PR, and
|
||||||
|
likely close the stalled PR.
|
||||||
|
|
||||||
|
- Triage issues:
|
||||||
|
|
||||||
|
- **close usage questions** and politely point the reporter to use
|
||||||
|
Stack Overflow instead.
|
||||||
|
|
||||||
|
- **close duplicate issues**, after checking that they are
|
||||||
|
indeed duplicate. Ideally, the original submitter moves the
|
||||||
|
discussion to the older, duplicate issue
|
||||||
|
|
||||||
|
- **close issues that cannot be replicated**, after leaving time (at
|
||||||
|
least a week) to add extra information
|
||||||
|
|
||||||
|
:ref:`Saved replies <saved_replies>` are useful to gain time and yet be
|
||||||
|
welcoming and polite when triaging.
|
||||||
|
|
||||||
|
See the github description for `roles in the organization
|
||||||
|
<https://docs.github.com/en/github/setting-up-and-managing-organizations-and-teams/repository-permission-levels-for-an-organization>`_.
|
||||||
|
|
||||||
|
.. topic:: Closing issues: a tough call
|
||||||
|
|
||||||
|
When uncertain on whether an issue should be closed or not, it is
|
||||||
|
best to strive for consensus with the original poster, and possibly
|
||||||
|
to seek relevant expertise. However, when the issue is a usage
|
||||||
|
question, or when it has been considered as unclear for many years it
|
||||||
|
should be closed.
|
||||||
|
|
||||||
|
A typical workflow for triaging issues
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
The following workflow [1]_ is a good way to approach issue triaging:
|
||||||
|
|
||||||
|
#. Thank the reporter for opening an issue
|
||||||
|
|
||||||
|
The issue tracker is many people's first interaction with the
|
||||||
|
scikit-learn project itself, beyond just using the library. As such,
|
||||||
|
we want it to be a welcoming, pleasant experience.
|
||||||
|
|
||||||
|
#. Is this a usage question? If so close it with a polite message
|
||||||
|
(:ref:`here is an example <saved_replies>`).
|
||||||
|
|
||||||
|
#. Is the necessary information provided?
|
||||||
|
|
||||||
|
If crucial information (like the version of scikit-learn used), is
|
||||||
|
missing feel free to ask for that and label the issue with "Needs
|
||||||
|
info".
|
||||||
|
|
||||||
|
#. Is this a duplicate issue?
|
||||||
|
|
||||||
|
We have many open issues. If a new issue seems to be a duplicate,
|
||||||
|
point to the original issue. If it is a clear duplicate, or consensus
|
||||||
|
is that it is redundant, close it. Make sure to still thank the
|
||||||
|
reporter, and encourage them to chime in on the original issue, and
|
||||||
|
perhaps try to fix it.
|
||||||
|
|
||||||
|
If the new issue provides relevant information, such as a better or
|
||||||
|
slightly different example, add it to the original issue as a comment
|
||||||
|
or an edit to the original post.
|
||||||
|
|
||||||
|
#. Make sure that the title accurately reflects the issue. If you have the
|
||||||
|
necessary permissions edit it yourself if it's not clear.
|
||||||
|
|
||||||
|
#. Is the issue minimal and reproducible?
|
||||||
|
|
||||||
|
For bug reports, we ask that the reporter provide a minimal
|
||||||
|
reproducible example. See `this useful post
|
||||||
|
<https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports>`_
|
||||||
|
by Matthew Rocklin for a good explanation. If the example is not
|
||||||
|
reproducible, or if it's clearly not minimal, feel free to ask the reporter
|
||||||
|
if they can provide and example or simplify the provided one.
|
||||||
|
Do acknowledge that writing minimal reproducible examples is hard work.
|
||||||
|
If the reporter is struggling, you can try to write one yourself.
|
||||||
|
|
||||||
|
If a reproducible example is provided, but you see a simplification,
|
||||||
|
add your simpler reproducible example.
|
||||||
|
|
||||||
|
#. Add the relevant labels, such as "Documentation" when the issue is
|
||||||
|
about documentation, "Bug" if it is clearly a bug, "Enhancement" if it
|
||||||
|
is an enhancement request, ...
|
||||||
|
|
||||||
|
If the issue is clearly defined and the fix seems relatively
|
||||||
|
straightforward, label the issue as “Good first issue”.
|
||||||
|
|
||||||
|
An additional useful step can be to tag the corresponding module e.g.
|
||||||
|
`sklearn.linear_models` when relevant.
|
||||||
|
|
||||||
|
#. Remove the "Needs Triage" label from the issue if the label exists.
|
||||||
|
|
||||||
|
.. [1] Adapted from the pandas project `maintainers guide
|
||||||
|
<https://pandas.pydata.org/docs/development/maintaining.html>`_
|
|
@ -0,0 +1,156 @@
|
||||||
|
.. _cython:
|
||||||
|
|
||||||
|
Cython Best Practices, Conventions and Knowledge
|
||||||
|
================================================
|
||||||
|
|
||||||
|
This documents tips to develop Cython code in scikit-learn.
|
||||||
|
|
||||||
|
Tips for developing with Cython in scikit-learn
|
||||||
|
-----------------------------------------------
|
||||||
|
|
||||||
|
Tips to ease development
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
* Time spent reading `Cython's documentation <https://cython.readthedocs.io/en/latest/>`_ is not time lost.
|
||||||
|
|
||||||
|
* If you intend to use OpenMP: On MacOS, system's distribution of ``clang`` does not implement OpenMP.
|
||||||
|
You can install the ``compilers`` package available on ``conda-forge`` which comes with an implementation of OpenMP.
|
||||||
|
|
||||||
|
* Activating `checks <https://github.com/scikit-learn/scikit-learn/blob/62a017efa047e9581ae7df8bbaa62cf4c0544ee4/sklearn/_build_utils/__init__.py#L68-L87>`_ might help. E.g. for activating boundscheck use:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
export SKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES=1
|
||||||
|
|
||||||
|
* `Start from scratch in a notebook <https://cython.readthedocs.io/en/latest/src/quickstart/build.html#using-the-jupyter-notebook>`_ to understand how to use Cython and to get feedback on your work quickly.
|
||||||
|
If you plan to use OpenMP for your implementations in your Jupyter Notebook, do add extra compiler and linkers arguments in the Cython magic.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
# For GCC and for clang
|
||||||
|
%%cython --compile-args=-fopenmp --link-args=-fopenmp
|
||||||
|
# For Microsoft's compilers
|
||||||
|
%%cython --compile-args=/openmp --link-args=/openmp
|
||||||
|
|
||||||
|
* To debug C code (e.g. a segfault), do use ``gdb`` with:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
gdb --ex r --args python ./entrypoint_to_bug_reproducer.py
|
||||||
|
|
||||||
|
* To have access to some value in place to debug in ``cdef (nogil)`` context, use:
|
||||||
|
|
||||||
|
.. code-block:: cython
|
||||||
|
|
||||||
|
with gil:
|
||||||
|
print(state_to_print)
|
||||||
|
|
||||||
|
* Note that Cython cannot parse f-strings with ``{var=}`` expressions, e.g.
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
print(f"{test_val=}")
|
||||||
|
|
||||||
|
* scikit-learn codebase has a lot of non-unified (fused) types (re)definitions.
|
||||||
|
There currently is `ongoing work to simplify and unify that across the codebase
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/issues/25572>`_.
|
||||||
|
For now, make sure you understand which concrete types are used ultimately.
|
||||||
|
|
||||||
|
* You might find this alias to compile individual Cython extension handy:
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
# You might want to add this alias to your shell script config.
|
||||||
|
alias cythonX="cython -X language_level=3 -X boundscheck=False -X wraparound=False -X initializedcheck=False -X nonecheck=False -X cdivision=True"
|
||||||
|
|
||||||
|
# This generates `source.c` as if you had recompiled scikit-learn entirely.
|
||||||
|
cythonX --annotate source.pyx
|
||||||
|
|
||||||
|
* Using the ``--annotate`` option with this flag allows generating a HTML report of code annotation.
|
||||||
|
This report indicates interactions with the CPython interpreter on a line-by-line basis.
|
||||||
|
Interactions with the CPython interpreter must be avoided as much as possible in
|
||||||
|
the computationally intensive sections of the algorithms.
|
||||||
|
For more information, please refer to `this section of Cython's tutorial <https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html#primes>`_
|
||||||
|
|
||||||
|
.. code-block::
|
||||||
|
|
||||||
|
# This generates a HTML report (`source.html`) for `source.c`.
|
||||||
|
cythonX --annotate source.pyx
|
||||||
|
|
||||||
|
Tips for performance
|
||||||
|
^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
* Understand the GIL in context for CPython (which problems it solves, what are its limitations)
|
||||||
|
and get a good understanding of when Cython will be mapped to C code free of interactions with
|
||||||
|
CPython, when it will not, and when it cannot (e.g. presence of interactions with Python
|
||||||
|
objects, which include functions). In this regard, `PEP073 <https://peps.python.org/pep-0703/>`_
|
||||||
|
provides a good overview and context and pathways for removal.
|
||||||
|
|
||||||
|
* Make sure you have deactivated `checks <https://github.com/scikit-learn/scikit-learn/blob/62a017efa047e9581ae7df8bbaa62cf4c0544ee4/sklearn/_build_utils/__init__.py#L68-L87>`_.
|
||||||
|
|
||||||
|
* Always prefer memoryviews instead over ``cnp.ndarray`` when possible: memoryviews are lightweight.
|
||||||
|
|
||||||
|
* Avoid memoryview slicing: memoryview slicing might be costly or misleading in some cases and
|
||||||
|
we better not use it, even if handling fewer dimensions in some context would be preferable.
|
||||||
|
|
||||||
|
* Decorate final classes or methods with ``@final`` (this allows removing virtual tables when needed)
|
||||||
|
|
||||||
|
* Inline methods and function when it makes sense
|
||||||
|
|
||||||
|
* Make sure your Cython compilation units `use NumPy recent C API <https://github.com/scikit-learn/scikit-learn/blob/62a017efa047e9581ae7df8bbaa62cf4c0544ee4/setup.py#L64-L70>`_.
|
||||||
|
|
||||||
|
* In doubt, read the generated C or C++ code if you can: "The fewer C instructions and indirections
|
||||||
|
for a line of Cython code, the better" is a good rule of thumb.
|
||||||
|
|
||||||
|
* ``nogil`` declarations are just hints: when declaring the ``cdef`` functions
|
||||||
|
as nogil, it means that they can be called without holding the GIL, but it does not release
|
||||||
|
the GIL when entering them. You have to do that yourself either by passing ``nogil=True`` to
|
||||||
|
``cython.parallel.prange`` explicitly, or by using an explicit context manager:
|
||||||
|
|
||||||
|
.. code-block:: cython
|
||||||
|
|
||||||
|
cdef inline void my_func(self) nogil:
|
||||||
|
|
||||||
|
# Some logic interacting with CPython, e.g. allocating arrays via NumPy.
|
||||||
|
|
||||||
|
with nogil:
|
||||||
|
# The code here is run as is it were written in C.
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
This item is based on `this comment from Stéfan's Benhel <https://github.com/cython/cython/issues/2798#issuecomment-459971828>`_
|
||||||
|
|
||||||
|
* Direct calls to BLAS routines are possible via interfaces defined in ``sklearn.utils._cython_blas``.
|
||||||
|
|
||||||
|
Using OpenMP
|
||||||
|
^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Since scikit-learn can be built without OpenMP, it's necessary to protect each
|
||||||
|
direct call to OpenMP.
|
||||||
|
|
||||||
|
The `_openmp_helpers` module, available in
|
||||||
|
`sklearn/utils/_openmp_helpers.pyx <https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_openmp_helpers.pyx>`_
|
||||||
|
provides protected versions of the OpenMP routines. To use OpenMP routines, they
|
||||||
|
must be ``cimported`` from this module and not from the OpenMP library directly:
|
||||||
|
|
||||||
|
.. code-block:: cython
|
||||||
|
|
||||||
|
from sklearn.utils._openmp_helpers cimport omp_get_max_threads
|
||||||
|
max_threads = omp_get_max_threads()
|
||||||
|
|
||||||
|
|
||||||
|
The parallel loop, `prange`, is already protected by cython and can be used directly
|
||||||
|
from `cython.parallel`.
|
||||||
|
|
||||||
|
Types
|
||||||
|
~~~~~
|
||||||
|
|
||||||
|
Cython code requires to use explicit types. This is one of the reasons you get a
|
||||||
|
performance boost. In order to avoid code duplication, we have a central place
|
||||||
|
for the most used types in
|
||||||
|
`sklearn/utils/_typedefs.pyd <https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_typedefs.pyd>`_.
|
||||||
|
Ideally you start by having a look there and `cimport` types you need, for example
|
||||||
|
|
||||||
|
.. code-block:: cython
|
||||||
|
|
||||||
|
from sklear.utils._typedefs cimport float32, float64
|
|
@ -0,0 +1,920 @@
|
||||||
|
.. _develop:
|
||||||
|
|
||||||
|
==================================
|
||||||
|
Developing scikit-learn estimators
|
||||||
|
==================================
|
||||||
|
|
||||||
|
Whether you are proposing an estimator for inclusion in scikit-learn,
|
||||||
|
developing a separate package compatible with scikit-learn, or
|
||||||
|
implementing custom components for your own projects, this chapter
|
||||||
|
details how to develop objects that safely interact with scikit-learn
|
||||||
|
Pipelines and model selection tools.
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn
|
||||||
|
|
||||||
|
.. _api_overview:
|
||||||
|
|
||||||
|
APIs of scikit-learn objects
|
||||||
|
============================
|
||||||
|
|
||||||
|
To have a uniform API, we try to have a common basic API for all the
|
||||||
|
objects. In addition, to avoid the proliferation of framework code, we
|
||||||
|
try to adopt simple conventions and limit to a minimum the number of
|
||||||
|
methods an object must implement.
|
||||||
|
|
||||||
|
Elements of the scikit-learn API are described more definitively in the
|
||||||
|
:ref:`glossary`.
|
||||||
|
|
||||||
|
Different objects
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
The main objects in scikit-learn are (one class can implement
|
||||||
|
multiple interfaces):
|
||||||
|
|
||||||
|
:Estimator:
|
||||||
|
|
||||||
|
The base object, implements a ``fit`` method to learn from data, either::
|
||||||
|
|
||||||
|
estimator = estimator.fit(data, targets)
|
||||||
|
|
||||||
|
or::
|
||||||
|
|
||||||
|
estimator = estimator.fit(data)
|
||||||
|
|
||||||
|
:Predictor:
|
||||||
|
|
||||||
|
For supervised learning, or some unsupervised problems, implements::
|
||||||
|
|
||||||
|
prediction = predictor.predict(data)
|
||||||
|
|
||||||
|
Classification algorithms usually also offer a way to quantify certainty
|
||||||
|
of a prediction, either using ``decision_function`` or ``predict_proba``::
|
||||||
|
|
||||||
|
probability = predictor.predict_proba(data)
|
||||||
|
|
||||||
|
:Transformer:
|
||||||
|
|
||||||
|
For modifying the data in a supervised or unsupervised way (e.g. by adding, changing,
|
||||||
|
or removing columns, but not by adding or removing rows). Implements::
|
||||||
|
|
||||||
|
new_data = transformer.transform(data)
|
||||||
|
|
||||||
|
When fitting and transforming can be performed much more efficiently
|
||||||
|
together than separately, implements::
|
||||||
|
|
||||||
|
new_data = transformer.fit_transform(data)
|
||||||
|
|
||||||
|
:Model:
|
||||||
|
|
||||||
|
A model that can give a `goodness of fit <https://en.wikipedia.org/wiki/Goodness_of_fit>`_
|
||||||
|
measure or a likelihood of unseen data, implements (higher is better)::
|
||||||
|
|
||||||
|
score = model.score(data)
|
||||||
|
|
||||||
|
Estimators
|
||||||
|
----------
|
||||||
|
|
||||||
|
The API has one predominant object: the estimator. An estimator is an
|
||||||
|
object that fits a model based on some training data and is capable of
|
||||||
|
inferring some properties on new data. It can be, for instance, a
|
||||||
|
classifier or a regressor. All estimators implement the fit method::
|
||||||
|
|
||||||
|
estimator.fit(X, y)
|
||||||
|
|
||||||
|
All built-in estimators also have a ``set_params`` method, which sets
|
||||||
|
data-independent parameters (overriding previous parameter values passed
|
||||||
|
to ``__init__``).
|
||||||
|
|
||||||
|
All estimators in the main scikit-learn codebase should inherit from
|
||||||
|
``sklearn.base.BaseEstimator``.
|
||||||
|
|
||||||
|
Instantiation
|
||||||
|
^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
This concerns the creation of an object. The object's ``__init__`` method
|
||||||
|
might accept constants as arguments that determine the estimator's behavior
|
||||||
|
(like the C constant in SVMs). It should not, however, take the actual training
|
||||||
|
data as an argument, as this is left to the ``fit()`` method::
|
||||||
|
|
||||||
|
clf2 = SVC(C=2.3)
|
||||||
|
clf3 = SVC([[1, 2], [2, 3]], [-1, 1]) # WRONG!
|
||||||
|
|
||||||
|
|
||||||
|
The arguments accepted by ``__init__`` should all be keyword arguments
|
||||||
|
with a default value. In other words, a user should be able to instantiate
|
||||||
|
an estimator without passing any arguments to it. The arguments should all
|
||||||
|
correspond to hyperparameters describing the model or the optimisation
|
||||||
|
problem the estimator tries to solve. These initial arguments (or parameters)
|
||||||
|
are always remembered by the estimator.
|
||||||
|
Also note that they should not be documented under the "Attributes" section,
|
||||||
|
but rather under the "Parameters" section for that estimator.
|
||||||
|
|
||||||
|
In addition, **every keyword argument accepted by** ``__init__`` **should
|
||||||
|
correspond to an attribute on the instance**. Scikit-learn relies on this to
|
||||||
|
find the relevant attributes to set on an estimator when doing model selection.
|
||||||
|
|
||||||
|
To summarize, an ``__init__`` should look like::
|
||||||
|
|
||||||
|
def __init__(self, param1=1, param2=2):
|
||||||
|
self.param1 = param1
|
||||||
|
self.param2 = param2
|
||||||
|
|
||||||
|
There should be no logic, not even input validation,
|
||||||
|
and the parameters should not be changed.
|
||||||
|
The corresponding logic should be put where the parameters are used,
|
||||||
|
typically in ``fit``.
|
||||||
|
The following is wrong::
|
||||||
|
|
||||||
|
def __init__(self, param1=1, param2=2, param3=3):
|
||||||
|
# WRONG: parameters should not be modified
|
||||||
|
if param1 > 1:
|
||||||
|
param2 += 1
|
||||||
|
self.param1 = param1
|
||||||
|
# WRONG: the object's attributes should have exactly the name of
|
||||||
|
# the argument in the constructor
|
||||||
|
self.param3 = param2
|
||||||
|
|
||||||
|
The reason for postponing the validation is that the same validation
|
||||||
|
would have to be performed in ``set_params``,
|
||||||
|
which is used in algorithms like ``GridSearchCV``.
|
||||||
|
|
||||||
|
Fitting
|
||||||
|
^^^^^^^
|
||||||
|
|
||||||
|
The next thing you will probably want to do is to estimate some
|
||||||
|
parameters in the model. This is implemented in the ``fit()`` method.
|
||||||
|
|
||||||
|
The ``fit()`` method takes the training data as arguments, which can be one
|
||||||
|
array in the case of unsupervised learning, or two arrays in the case
|
||||||
|
of supervised learning.
|
||||||
|
|
||||||
|
Note that the model is fitted using ``X`` and ``y``, but the object holds no
|
||||||
|
reference to ``X`` and ``y``. There are, however, some exceptions to this, as in
|
||||||
|
the case of precomputed kernels where this data must be stored for use by
|
||||||
|
the predict method.
|
||||||
|
|
||||||
|
============= ======================================================
|
||||||
|
Parameters
|
||||||
|
============= ======================================================
|
||||||
|
X array-like of shape (n_samples, n_features)
|
||||||
|
|
||||||
|
y array-like of shape (n_samples,)
|
||||||
|
|
||||||
|
kwargs optional data-dependent parameters
|
||||||
|
============= ======================================================
|
||||||
|
|
||||||
|
``X.shape[0]`` should be the same as ``y.shape[0]``. If this requisite
|
||||||
|
is not met, an exception of type ``ValueError`` should be raised.
|
||||||
|
|
||||||
|
``y`` might be ignored in the case of unsupervised learning. However, to
|
||||||
|
make it possible to use the estimator as part of a pipeline that can
|
||||||
|
mix both supervised and unsupervised transformers, even unsupervised
|
||||||
|
estimators need to accept a ``y=None`` keyword argument in
|
||||||
|
the second position that is just ignored by the estimator.
|
||||||
|
For the same reason, ``fit_predict``, ``fit_transform``, ``score``
|
||||||
|
and ``partial_fit`` methods need to accept a ``y`` argument in
|
||||||
|
the second place if they are implemented.
|
||||||
|
|
||||||
|
The method should return the object (``self``). This pattern is useful
|
||||||
|
to be able to implement quick one liners in an IPython session such as::
|
||||||
|
|
||||||
|
y_predicted = SVC(C=100).fit(X_train, y_train).predict(X_test)
|
||||||
|
|
||||||
|
Depending on the nature of the algorithm, ``fit`` can sometimes also
|
||||||
|
accept additional keywords arguments. However, any parameter that can
|
||||||
|
have a value assigned prior to having access to the data should be an
|
||||||
|
``__init__`` keyword argument. **fit parameters should be restricted
|
||||||
|
to directly data dependent variables**. For instance a Gram matrix or
|
||||||
|
an affinity matrix which are precomputed from the data matrix ``X`` are
|
||||||
|
data dependent. A tolerance stopping criterion ``tol`` is not directly
|
||||||
|
data dependent (although the optimal value according to some scoring
|
||||||
|
function probably is).
|
||||||
|
|
||||||
|
When ``fit`` is called, any previous call to ``fit`` should be ignored. In
|
||||||
|
general, calling ``estimator.fit(X1)`` and then ``estimator.fit(X2)`` should
|
||||||
|
be the same as only calling ``estimator.fit(X2)``. However, this may not be
|
||||||
|
true in practice when ``fit`` depends on some random process, see
|
||||||
|
:term:`random_state`. Another exception to this rule is when the
|
||||||
|
hyper-parameter ``warm_start`` is set to ``True`` for estimators that
|
||||||
|
support it. ``warm_start=True`` means that the previous state of the
|
||||||
|
trainable parameters of the estimator are reused instead of using the
|
||||||
|
default initialization strategy.
|
||||||
|
|
||||||
|
Estimated Attributes
|
||||||
|
^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Attributes that have been estimated from the data must always have a name
|
||||||
|
ending with trailing underscore, for example the coefficients of
|
||||||
|
some regression estimator would be stored in a ``coef_`` attribute after
|
||||||
|
``fit`` has been called.
|
||||||
|
|
||||||
|
The estimated attributes are expected to be overridden when you call ``fit``
|
||||||
|
a second time.
|
||||||
|
|
||||||
|
Optional Arguments
|
||||||
|
^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
In iterative algorithms, the number of iterations should be specified by
|
||||||
|
an integer called ``n_iter``.
|
||||||
|
|
||||||
|
Universal attributes
|
||||||
|
^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Estimators that expect tabular input should set a `n_features_in_`
|
||||||
|
attribute at `fit` time to indicate the number of features that the estimator
|
||||||
|
expects for subsequent calls to `predict` or `transform`.
|
||||||
|
See
|
||||||
|
`SLEP010
|
||||||
|
<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html>`_
|
||||||
|
for details.
|
||||||
|
|
||||||
|
.. _rolling_your_own_estimator:
|
||||||
|
|
||||||
|
Rolling your own estimator
|
||||||
|
==========================
|
||||||
|
If you want to implement a new estimator that is scikit-learn-compatible,
|
||||||
|
whether it is just for you or for contributing it to scikit-learn, there are
|
||||||
|
several internals of scikit-learn that you should be aware of in addition to
|
||||||
|
the scikit-learn API outlined above. You can check whether your estimator
|
||||||
|
adheres to the scikit-learn interface and standards by running
|
||||||
|
:func:`~sklearn.utils.estimator_checks.check_estimator` on an instance. The
|
||||||
|
:func:`~sklearn.utils.estimator_checks.parametrize_with_checks` pytest
|
||||||
|
decorator can also be used (see its docstring for details and possible
|
||||||
|
interactions with `pytest`)::
|
||||||
|
|
||||||
|
>>> from sklearn.utils.estimator_checks import check_estimator
|
||||||
|
>>> from sklearn.svm import LinearSVC
|
||||||
|
>>> check_estimator(LinearSVC()) # passes
|
||||||
|
|
||||||
|
The main motivation to make a class compatible to the scikit-learn estimator
|
||||||
|
interface might be that you want to use it together with model evaluation and
|
||||||
|
selection tools such as :class:`model_selection.GridSearchCV` and
|
||||||
|
:class:`pipeline.Pipeline`.
|
||||||
|
|
||||||
|
Before detailing the required interface below, we describe two ways to achieve
|
||||||
|
the correct interface more easily.
|
||||||
|
|
||||||
|
.. topic:: Project template:
|
||||||
|
|
||||||
|
We provide a `project template <https://github.com/scikit-learn-contrib/project-template/>`_
|
||||||
|
which helps in the creation of Python packages containing scikit-learn compatible estimators.
|
||||||
|
It provides:
|
||||||
|
|
||||||
|
* an initial git repository with Python package directory structure
|
||||||
|
* a template of a scikit-learn estimator
|
||||||
|
* an initial test suite including use of ``check_estimator``
|
||||||
|
* directory structures and scripts to compile documentation and example
|
||||||
|
galleries
|
||||||
|
* scripts to manage continuous integration (testing on Linux and Windows)
|
||||||
|
* instructions from getting started to publishing on `PyPi <https://pypi.org/>`_
|
||||||
|
|
||||||
|
.. topic:: ``BaseEstimator`` and mixins:
|
||||||
|
|
||||||
|
We tend to use "duck typing", so building an estimator which follows
|
||||||
|
the API suffices for compatibility, without needing to inherit from or
|
||||||
|
even import any scikit-learn classes.
|
||||||
|
|
||||||
|
However, if a dependency on scikit-learn is acceptable in your code,
|
||||||
|
you can prevent a lot of boilerplate code
|
||||||
|
by deriving a class from ``BaseEstimator``
|
||||||
|
and optionally the mixin classes in ``sklearn.base``.
|
||||||
|
For example, below is a custom classifier, with more examples included
|
||||||
|
in the scikit-learn-contrib
|
||||||
|
`project template <https://github.com/scikit-learn-contrib/project-template/blob/master/skltemplate/_template.py>`__.
|
||||||
|
|
||||||
|
It is particularly important to notice that mixins should be "on the left" while
|
||||||
|
the ``BaseEstimator`` should be "on the right" in the inheritance list for proper
|
||||||
|
MRO.
|
||||||
|
|
||||||
|
>>> import numpy as np
|
||||||
|
>>> from sklearn.base import BaseEstimator, ClassifierMixin
|
||||||
|
>>> from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
|
||||||
|
>>> from sklearn.utils.multiclass import unique_labels
|
||||||
|
>>> from sklearn.metrics import euclidean_distances
|
||||||
|
>>> class TemplateClassifier(ClassifierMixin, BaseEstimator):
|
||||||
|
...
|
||||||
|
... def __init__(self, demo_param='demo'):
|
||||||
|
... self.demo_param = demo_param
|
||||||
|
...
|
||||||
|
... def fit(self, X, y):
|
||||||
|
...
|
||||||
|
... # Check that X and y have correct shape
|
||||||
|
... X, y = check_X_y(X, y)
|
||||||
|
... # Store the classes seen during fit
|
||||||
|
... self.classes_ = unique_labels(y)
|
||||||
|
...
|
||||||
|
... self.X_ = X
|
||||||
|
... self.y_ = y
|
||||||
|
... # Return the classifier
|
||||||
|
... return self
|
||||||
|
...
|
||||||
|
... def predict(self, X):
|
||||||
|
...
|
||||||
|
... # Check if fit has been called
|
||||||
|
... check_is_fitted(self)
|
||||||
|
...
|
||||||
|
... # Input validation
|
||||||
|
... X = check_array(X)
|
||||||
|
...
|
||||||
|
... closest = np.argmin(euclidean_distances(X, self.X_), axis=1)
|
||||||
|
... return self.y_[closest]
|
||||||
|
|
||||||
|
|
||||||
|
get_params and set_params
|
||||||
|
-------------------------
|
||||||
|
All scikit-learn estimators have ``get_params`` and ``set_params`` functions.
|
||||||
|
The ``get_params`` function takes no arguments and returns a dict of the
|
||||||
|
``__init__`` parameters of the estimator, together with their values.
|
||||||
|
|
||||||
|
It must take one keyword argument, ``deep``, which receives a boolean value
|
||||||
|
that determines whether the method should return the parameters of
|
||||||
|
sub-estimators (for most estimators, this can be ignored). The default value
|
||||||
|
for ``deep`` should be `True`. For instance considering the following
|
||||||
|
estimator::
|
||||||
|
|
||||||
|
>>> from sklearn.base import BaseEstimator
|
||||||
|
>>> from sklearn.linear_model import LogisticRegression
|
||||||
|
>>> class MyEstimator(BaseEstimator):
|
||||||
|
... def __init__(self, subestimator=None, my_extra_param="random"):
|
||||||
|
... self.subestimator = subestimator
|
||||||
|
... self.my_extra_param = my_extra_param
|
||||||
|
|
||||||
|
The parameter `deep` will control whether or not the parameters of the
|
||||||
|
`subestimator` should be reported. Thus when `deep=True`, the output will be::
|
||||||
|
|
||||||
|
>>> my_estimator = MyEstimator(subestimator=LogisticRegression())
|
||||||
|
>>> for param, value in my_estimator.get_params(deep=True).items():
|
||||||
|
... print(f"{param} -> {value}")
|
||||||
|
my_extra_param -> random
|
||||||
|
subestimator__C -> 1.0
|
||||||
|
subestimator__class_weight -> None
|
||||||
|
subestimator__dual -> False
|
||||||
|
subestimator__fit_intercept -> True
|
||||||
|
subestimator__intercept_scaling -> 1
|
||||||
|
subestimator__l1_ratio -> None
|
||||||
|
subestimator__max_iter -> 100
|
||||||
|
subestimator__multi_class -> deprecated
|
||||||
|
subestimator__n_jobs -> None
|
||||||
|
subestimator__penalty -> l2
|
||||||
|
subestimator__random_state -> None
|
||||||
|
subestimator__solver -> lbfgs
|
||||||
|
subestimator__tol -> 0.0001
|
||||||
|
subestimator__verbose -> 0
|
||||||
|
subestimator__warm_start -> False
|
||||||
|
subestimator -> LogisticRegression()
|
||||||
|
|
||||||
|
Often, the `subestimator` has a name (as e.g. named steps in a
|
||||||
|
:class:`~sklearn.pipeline.Pipeline` object), in which case the key should
|
||||||
|
become `<name>__C`, `<name>__class_weight`, etc.
|
||||||
|
|
||||||
|
While when `deep=False`, the output will be::
|
||||||
|
|
||||||
|
>>> for param, value in my_estimator.get_params(deep=False).items():
|
||||||
|
... print(f"{param} -> {value}")
|
||||||
|
my_extra_param -> random
|
||||||
|
subestimator -> LogisticRegression()
|
||||||
|
|
||||||
|
On the other hand, ``set_params`` takes the parameters of ``__init__``
|
||||||
|
as keyword arguments, unpacks them into a dict of the form
|
||||||
|
``'parameter': value`` and sets the parameters of the estimator using this dict.
|
||||||
|
Return value must be the estimator itself.
|
||||||
|
|
||||||
|
While the ``get_params`` mechanism is not essential (see :ref:`cloning` below),
|
||||||
|
the ``set_params`` function is necessary as it is used to set parameters during
|
||||||
|
grid searches.
|
||||||
|
|
||||||
|
The easiest way to implement these functions, and to get a sensible
|
||||||
|
``__repr__`` method, is to inherit from ``sklearn.base.BaseEstimator``. If you
|
||||||
|
do not want to make your code dependent on scikit-learn, the easiest way to
|
||||||
|
implement the interface is::
|
||||||
|
|
||||||
|
def get_params(self, deep=True):
|
||||||
|
# suppose this estimator has parameters "alpha" and "recursive"
|
||||||
|
return {"alpha": self.alpha, "recursive": self.recursive}
|
||||||
|
|
||||||
|
def set_params(self, **parameters):
|
||||||
|
for parameter, value in parameters.items():
|
||||||
|
setattr(self, parameter, value)
|
||||||
|
return self
|
||||||
|
|
||||||
|
|
||||||
|
Parameters and init
|
||||||
|
-------------------
|
||||||
|
As :class:`model_selection.GridSearchCV` uses ``set_params``
|
||||||
|
to apply parameter setting to estimators,
|
||||||
|
it is essential that calling ``set_params`` has the same effect
|
||||||
|
as setting parameters using the ``__init__`` method.
|
||||||
|
The easiest and recommended way to accomplish this is to
|
||||||
|
**not do any parameter validation in** ``__init__``.
|
||||||
|
All logic behind estimator parameters,
|
||||||
|
like translating string arguments into functions, should be done in ``fit``.
|
||||||
|
|
||||||
|
Also it is expected that parameters with trailing ``_`` are **not to be set
|
||||||
|
inside the** ``__init__`` **method**. All and only the public attributes set by
|
||||||
|
fit have a trailing ``_``. As a result the existence of parameters with
|
||||||
|
trailing ``_`` is used to check if the estimator has been fitted.
|
||||||
|
|
||||||
|
.. _cloning:
|
||||||
|
|
||||||
|
Cloning
|
||||||
|
-------
|
||||||
|
For use with the :mod:`~sklearn.model_selection` module,
|
||||||
|
an estimator must support the ``base.clone`` function to replicate an estimator.
|
||||||
|
This can be done by providing a ``get_params`` method.
|
||||||
|
If ``get_params`` is present, then ``clone(estimator)`` will be an instance of
|
||||||
|
``type(estimator)`` on which ``set_params`` has been called with clones of
|
||||||
|
the result of ``estimator.get_params()``.
|
||||||
|
|
||||||
|
Objects that do not provide this method will be deep-copied
|
||||||
|
(using the Python standard function ``copy.deepcopy``)
|
||||||
|
if ``safe=False`` is passed to ``clone``.
|
||||||
|
|
||||||
|
Estimators can customize the behavior of :func:`base.clone` by defining a
|
||||||
|
`__sklearn_clone__` method. `__sklearn_clone__` must return an instance of the
|
||||||
|
estimator. `__sklearn_clone__` is useful when an estimator needs to hold on to
|
||||||
|
some state when :func:`base.clone` is called on the estimator. For example, a
|
||||||
|
frozen meta-estimator for transformers can be defined as follows::
|
||||||
|
|
||||||
|
class FrozenTransformer(BaseEstimator):
|
||||||
|
def __init__(self, fitted_transformer):
|
||||||
|
self.fitted_transformer = fitted_transformer
|
||||||
|
|
||||||
|
def __getattr__(self, name):
|
||||||
|
# `fitted_transformer`'s attributes are now accessible
|
||||||
|
return getattr(self.fitted_transformer, name)
|
||||||
|
|
||||||
|
def __sklearn_clone__(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
def fit(self, X, y):
|
||||||
|
# Fitting does not change the state of the estimator
|
||||||
|
return self
|
||||||
|
|
||||||
|
def fit_transform(self, X, y=None):
|
||||||
|
# fit_transform only transforms the data
|
||||||
|
return self.fitted_transformer.transform(X, y)
|
||||||
|
|
||||||
|
Pipeline compatibility
|
||||||
|
----------------------
|
||||||
|
For an estimator to be usable together with ``pipeline.Pipeline`` in any but the
|
||||||
|
last step, it needs to provide a ``fit`` or ``fit_transform`` function.
|
||||||
|
To be able to evaluate the pipeline on any data but the training set,
|
||||||
|
it also needs to provide a ``transform`` function.
|
||||||
|
There are no special requirements for the last step in a pipeline, except that
|
||||||
|
it has a ``fit`` function. All ``fit`` and ``fit_transform`` functions must
|
||||||
|
take arguments ``X, y``, even if y is not used. Similarly, for ``score`` to be
|
||||||
|
usable, the last step of the pipeline needs to have a ``score`` function that
|
||||||
|
accepts an optional ``y``.
|
||||||
|
|
||||||
|
Estimator types
|
||||||
|
---------------
|
||||||
|
Some common functionality depends on the kind of estimator passed.
|
||||||
|
For example, cross-validation in :class:`model_selection.GridSearchCV` and
|
||||||
|
:func:`model_selection.cross_val_score` defaults to being stratified when used
|
||||||
|
on a classifier, but not otherwise. Similarly, scorers for average precision
|
||||||
|
that take a continuous prediction need to call ``decision_function`` for classifiers,
|
||||||
|
but ``predict`` for regressors. This distinction between classifiers and regressors
|
||||||
|
is implemented using the ``_estimator_type`` attribute, which takes a string value.
|
||||||
|
It should be ``"classifier"`` for classifiers and ``"regressor"`` for
|
||||||
|
regressors and ``"clusterer"`` for clustering methods, to work as expected.
|
||||||
|
Inheriting from ``ClassifierMixin``, ``RegressorMixin`` or ``ClusterMixin``
|
||||||
|
will set the attribute automatically. When a meta-estimator needs to distinguish
|
||||||
|
among estimator types, instead of checking ``_estimator_type`` directly, helpers
|
||||||
|
like :func:`base.is_classifier` should be used.
|
||||||
|
|
||||||
|
Specific models
|
||||||
|
---------------
|
||||||
|
|
||||||
|
Classifiers should accept ``y`` (target) arguments to ``fit`` that are
|
||||||
|
sequences (lists, arrays) of either strings or integers. They should not
|
||||||
|
assume that the class labels are a contiguous range of integers; instead, they
|
||||||
|
should store a list of classes in a ``classes_`` attribute or property. The
|
||||||
|
order of class labels in this attribute should match the order in which
|
||||||
|
``predict_proba``, ``predict_log_proba`` and ``decision_function`` return their
|
||||||
|
values. The easiest way to achieve this is to put::
|
||||||
|
|
||||||
|
self.classes_, y = np.unique(y, return_inverse=True)
|
||||||
|
|
||||||
|
in ``fit``. This returns a new ``y`` that contains class indexes, rather than
|
||||||
|
labels, in the range [0, ``n_classes``).
|
||||||
|
|
||||||
|
A classifier's ``predict`` method should return
|
||||||
|
arrays containing class labels from ``classes_``.
|
||||||
|
In a classifier that implements ``decision_function``,
|
||||||
|
this can be achieved with::
|
||||||
|
|
||||||
|
def predict(self, X):
|
||||||
|
D = self.decision_function(X)
|
||||||
|
return self.classes_[np.argmax(D, axis=1)]
|
||||||
|
|
||||||
|
In linear models, coefficients are stored in an array called ``coef_``, and the
|
||||||
|
independent term is stored in ``intercept_``. ``sklearn.linear_model._base``
|
||||||
|
contains a few base classes and mixins that implement common linear model
|
||||||
|
patterns.
|
||||||
|
|
||||||
|
The :mod:`~sklearn.utils.multiclass` module contains useful functions
|
||||||
|
for working with multiclass and multilabel problems.
|
||||||
|
|
||||||
|
.. _estimator_tags:
|
||||||
|
|
||||||
|
Estimator Tags
|
||||||
|
--------------
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
The estimator tags are experimental and the API is subject to change.
|
||||||
|
|
||||||
|
Scikit-learn introduced estimator tags in version 0.21. These are annotations
|
||||||
|
of estimators that allow programmatic inspection of their capabilities, such as
|
||||||
|
sparse matrix support, supported output types and supported methods. The
|
||||||
|
estimator tags are a dictionary returned by the method ``_get_tags()``. These
|
||||||
|
tags are used in the common checks run by the
|
||||||
|
:func:`~sklearn.utils.estimator_checks.check_estimator` function and the
|
||||||
|
:func:`~sklearn.utils.estimator_checks.parametrize_with_checks` decorator.
|
||||||
|
Tags determine which checks to run and what input data is appropriate. Tags
|
||||||
|
can depend on estimator parameters or even system architecture and can in
|
||||||
|
general only be determined at runtime.
|
||||||
|
|
||||||
|
The current set of estimator tags are:
|
||||||
|
|
||||||
|
allow_nan (default=False)
|
||||||
|
whether the estimator supports data with missing values encoded as np.nan
|
||||||
|
|
||||||
|
array_api_support (default=False)
|
||||||
|
whether the estimator supports Array API compatible inputs.
|
||||||
|
|
||||||
|
binary_only (default=False)
|
||||||
|
whether estimator supports binary classification but lacks multi-class
|
||||||
|
classification support.
|
||||||
|
|
||||||
|
multilabel (default=False)
|
||||||
|
whether the estimator supports multilabel output
|
||||||
|
|
||||||
|
multioutput (default=False)
|
||||||
|
whether a regressor supports multi-target outputs or a classifier supports
|
||||||
|
multi-class multi-output.
|
||||||
|
|
||||||
|
multioutput_only (default=False)
|
||||||
|
whether estimator supports only multi-output classification or regression.
|
||||||
|
|
||||||
|
no_validation (default=False)
|
||||||
|
whether the estimator skips input-validation. This is only meant for
|
||||||
|
stateless and dummy transformers!
|
||||||
|
|
||||||
|
non_deterministic (default=False)
|
||||||
|
whether the estimator is not deterministic given a fixed ``random_state``
|
||||||
|
|
||||||
|
pairwise (default=False)
|
||||||
|
This boolean attribute indicates whether the data (`X`) :term:`fit` and
|
||||||
|
similar methods consists of pairwise measures over samples rather than a
|
||||||
|
feature representation for each sample. It is usually `True` where an
|
||||||
|
estimator has a `metric` or `affinity` or `kernel` parameter with value
|
||||||
|
'precomputed'. Its primary purpose is to support a :term:`meta-estimator`
|
||||||
|
or a cross validation procedure that extracts a sub-sample of data intended
|
||||||
|
for a pairwise estimator, where the data needs to be indexed on both axes.
|
||||||
|
Specifically, this tag is used by
|
||||||
|
`sklearn.utils.metaestimators._safe_split` to slice rows and
|
||||||
|
columns.
|
||||||
|
|
||||||
|
preserves_dtype (default=``[np.float64]``)
|
||||||
|
applies only on transformers. It corresponds to the data types which will
|
||||||
|
be preserved such that `X_trans.dtype` is the same as `X.dtype` after
|
||||||
|
calling `transformer.transform(X)`. If this list is empty, then the
|
||||||
|
transformer is not expected to preserve the data type. The first value in
|
||||||
|
the list is considered as the default data type, corresponding to the data
|
||||||
|
type of the output when the input data type is not going to be preserved.
|
||||||
|
|
||||||
|
poor_score (default=False)
|
||||||
|
whether the estimator fails to provide a "reasonable" test-set score, which
|
||||||
|
currently for regression is an R2 of 0.5 on ``make_regression(n_samples=200,
|
||||||
|
n_features=10, n_informative=1, bias=5.0, noise=20, random_state=42)``, and
|
||||||
|
for classification an accuracy of 0.83 on
|
||||||
|
``make_blobs(n_samples=300, random_state=0)``. These datasets and values
|
||||||
|
are based on current estimators in sklearn and might be replaced by
|
||||||
|
something more systematic.
|
||||||
|
|
||||||
|
requires_fit (default=True)
|
||||||
|
whether the estimator requires to be fitted before calling one of
|
||||||
|
`transform`, `predict`, `predict_proba`, or `decision_function`.
|
||||||
|
|
||||||
|
requires_positive_X (default=False)
|
||||||
|
whether the estimator requires positive X.
|
||||||
|
|
||||||
|
requires_y (default=False)
|
||||||
|
whether the estimator requires y to be passed to `fit`, `fit_predict` or
|
||||||
|
`fit_transform` methods. The tag is True for estimators inheriting from
|
||||||
|
`~sklearn.base.RegressorMixin` and `~sklearn.base.ClassifierMixin`.
|
||||||
|
|
||||||
|
requires_positive_y (default=False)
|
||||||
|
whether the estimator requires a positive y (only applicable for regression).
|
||||||
|
|
||||||
|
_skip_test (default=False)
|
||||||
|
whether to skip common tests entirely. Don't use this unless you have a
|
||||||
|
*very good* reason.
|
||||||
|
|
||||||
|
_xfail_checks (default=False)
|
||||||
|
dictionary ``{check_name: reason}`` of common checks that will be marked
|
||||||
|
as `XFAIL` for pytest, when using
|
||||||
|
:func:`~sklearn.utils.estimator_checks.parametrize_with_checks`. These
|
||||||
|
checks will be simply ignored and not run by
|
||||||
|
:func:`~sklearn.utils.estimator_checks.check_estimator`, but a
|
||||||
|
`SkipTestWarning` will be raised.
|
||||||
|
Don't use this unless there is a *very good* reason for your estimator
|
||||||
|
not to pass the check.
|
||||||
|
Also note that the usage of this tag is highly subject to change because
|
||||||
|
we are trying to make it more flexible: be prepared for breaking changes
|
||||||
|
in the future.
|
||||||
|
|
||||||
|
stateless (default=False)
|
||||||
|
whether the estimator needs access to data for fitting. Even though an
|
||||||
|
estimator is stateless, it might still need a call to ``fit`` for
|
||||||
|
initialization.
|
||||||
|
|
||||||
|
X_types (default=['2darray'])
|
||||||
|
Supported input types for X as list of strings. Tests are currently only
|
||||||
|
run if '2darray' is contained in the list, signifying that the estimator
|
||||||
|
takes continuous 2d numpy arrays as input. The default value is
|
||||||
|
['2darray']. Other possible types are ``'string'``, ``'sparse'``,
|
||||||
|
``'categorical'``, ``dict``, ``'1dlabels'`` and ``'2dlabels'``. The goal is
|
||||||
|
that in the future the supported input type will determine the data used
|
||||||
|
during testing, in particular for ``'string'``, ``'sparse'`` and
|
||||||
|
``'categorical'`` data. For now, the test for sparse data do not make use
|
||||||
|
of the ``'sparse'`` tag.
|
||||||
|
|
||||||
|
It is unlikely that the default values for each tag will suit the needs of your
|
||||||
|
specific estimator. Additional tags can be created or default tags can be
|
||||||
|
overridden by defining a `_more_tags()` method which returns a dict with the
|
||||||
|
desired overridden tags or new tags. For example::
|
||||||
|
|
||||||
|
class MyMultiOutputEstimator(BaseEstimator):
|
||||||
|
|
||||||
|
def _more_tags(self):
|
||||||
|
return {'multioutput_only': True,
|
||||||
|
'non_deterministic': True}
|
||||||
|
|
||||||
|
Any tag that is not in `_more_tags()` will just fall-back to the default values
|
||||||
|
documented above.
|
||||||
|
|
||||||
|
Even if it is not recommended, it is possible to override the method
|
||||||
|
`_get_tags()`. Note however that **all tags must be present in the dict**. If
|
||||||
|
any of the keys documented above is not present in the output of `_get_tags()`,
|
||||||
|
an error will occur.
|
||||||
|
|
||||||
|
In addition to the tags, estimators also need to declare any non-optional
|
||||||
|
parameters to ``__init__`` in the ``_required_parameters`` class attribute,
|
||||||
|
which is a list or tuple. If ``_required_parameters`` is only
|
||||||
|
``["estimator"]`` or ``["base_estimator"]``, then the estimator will be
|
||||||
|
instantiated with an instance of ``LogisticRegression`` (or
|
||||||
|
``RidgeRegression`` if the estimator is a regressor) in the tests. The choice
|
||||||
|
of these two models is somewhat idiosyncratic but both should provide robust
|
||||||
|
closed-form solutions.
|
||||||
|
|
||||||
|
.. _developer_api_set_output:
|
||||||
|
|
||||||
|
Developer API for `set_output`
|
||||||
|
==============================
|
||||||
|
|
||||||
|
With
|
||||||
|
`SLEP018 <https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep018/proposal.html>`__,
|
||||||
|
scikit-learn introduces the `set_output` API for configuring transformers to
|
||||||
|
output pandas DataFrames. The `set_output` API is automatically defined if the
|
||||||
|
transformer defines :term:`get_feature_names_out` and subclasses
|
||||||
|
:class:`base.TransformerMixin`. :term:`get_feature_names_out` is used to get the
|
||||||
|
column names of pandas output.
|
||||||
|
|
||||||
|
:class:`base.OneToOneFeatureMixin` and
|
||||||
|
:class:`base.ClassNamePrefixFeaturesOutMixin` are helpful mixins for defining
|
||||||
|
:term:`get_feature_names_out`. :class:`base.OneToOneFeatureMixin` is useful when
|
||||||
|
the transformer has a one-to-one correspondence between input features and output
|
||||||
|
features, such as :class:`~preprocessing.StandardScaler`.
|
||||||
|
:class:`base.ClassNamePrefixFeaturesOutMixin` is useful when the transformer
|
||||||
|
needs to generate its own feature names out, such as :class:`~decomposition.PCA`.
|
||||||
|
|
||||||
|
You can opt-out of the `set_output` API by setting `auto_wrap_output_keys=None`
|
||||||
|
when defining a custom subclass::
|
||||||
|
|
||||||
|
class MyTransformer(TransformerMixin, BaseEstimator, auto_wrap_output_keys=None):
|
||||||
|
|
||||||
|
def fit(self, X, y=None):
|
||||||
|
return self
|
||||||
|
def transform(self, X, y=None):
|
||||||
|
return X
|
||||||
|
def get_feature_names_out(self, input_features=None):
|
||||||
|
...
|
||||||
|
|
||||||
|
The default value for `auto_wrap_output_keys` is `("transform",)`, which automatically
|
||||||
|
wraps `fit_transform` and `transform`. The `TransformerMixin` uses the
|
||||||
|
`__init_subclass__` mechanism to consume `auto_wrap_output_keys` and pass all other
|
||||||
|
keyword arguments to it's super class. Super classes' `__init_subclass__` should
|
||||||
|
**not** depend on `auto_wrap_output_keys`.
|
||||||
|
|
||||||
|
For transformers that return multiple arrays in `transform`, auto wrapping will
|
||||||
|
only wrap the first array and not alter the other arrays.
|
||||||
|
|
||||||
|
See :ref:`sphx_glr_auto_examples_miscellaneous_plot_set_output.py`
|
||||||
|
for an example on how to use the API.
|
||||||
|
|
||||||
|
.. _developer_api_check_is_fitted:
|
||||||
|
|
||||||
|
Developer API for `check_is_fitted`
|
||||||
|
===================================
|
||||||
|
|
||||||
|
By default :func:`~sklearn.utils.validation.check_is_fitted` checks if there
|
||||||
|
are any attributes in the instance with a trailing underscore, e.g. `coef_`.
|
||||||
|
An estimator can change the behavior by implementing a `__sklearn_is_fitted__`
|
||||||
|
method taking no input and returning a boolean. If this method exists,
|
||||||
|
:func:`~sklearn.utils.validation.check_is_fitted` simply returns its output.
|
||||||
|
|
||||||
|
See :ref:`sphx_glr_auto_examples_developing_estimators_sklearn_is_fitted.py`
|
||||||
|
for an example on how to use the API.
|
||||||
|
|
||||||
|
Developer API for HTML representation
|
||||||
|
=====================================
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
|
||||||
|
The HTML representation API is experimental and the API is subject to change.
|
||||||
|
|
||||||
|
Estimators inheriting from :class:`~sklearn.base.BaseEstimator` display
|
||||||
|
a HTML representation of themselves in interactive programming
|
||||||
|
environments such as Jupyter notebooks. For instance, we can display this HTML
|
||||||
|
diagram::
|
||||||
|
|
||||||
|
from sklearn.base import BaseEstimator
|
||||||
|
|
||||||
|
BaseEstimator()
|
||||||
|
|
||||||
|
The raw HTML representation is obtained by invoking the function
|
||||||
|
:func:`~sklearn.utils.estimator_html_repr` on an estimator instance.
|
||||||
|
|
||||||
|
To customize the URL linking to an estimator's documentation (i.e. when clicking on the
|
||||||
|
"?" icon), override the `_doc_link_module` and `_doc_link_template` attributes. In
|
||||||
|
addition, you can provide a `_doc_link_url_param_generator` method. Set
|
||||||
|
`_doc_link_module` to the name of the (top level) module that contains your estimator.
|
||||||
|
If the value does not match the top level module name, the HTML representation will not
|
||||||
|
contain a link to the documentation. For scikit-learn estimators this is set to
|
||||||
|
`"sklearn"`.
|
||||||
|
|
||||||
|
The `_doc_link_template` is used to construct the final URL. By default, it can contain
|
||||||
|
two variables: `estimator_module` (the full name of the module containing the estimator)
|
||||||
|
and `estimator_name` (the class name of the estimator). If you need more variables you
|
||||||
|
should implement the `_doc_link_url_param_generator` method which should return a
|
||||||
|
dictionary of the variables and their values. This dictionary will be used to render the
|
||||||
|
`_doc_link_template`.
|
||||||
|
|
||||||
|
.. _coding-guidelines:
|
||||||
|
|
||||||
|
Coding guidelines
|
||||||
|
=================
|
||||||
|
|
||||||
|
The following are some guidelines on how new code should be written for
|
||||||
|
inclusion in scikit-learn, and which may be appropriate to adopt in external
|
||||||
|
projects. Of course, there are special cases and there will be exceptions to
|
||||||
|
these rules. However, following these rules when submitting new code makes
|
||||||
|
the review easier so new code can be integrated in less time.
|
||||||
|
|
||||||
|
Uniformly formatted code makes it easier to share code ownership. The
|
||||||
|
scikit-learn project tries to closely follow the official Python guidelines
|
||||||
|
detailed in `PEP8 <https://www.python.org/dev/peps/pep-0008>`_ that
|
||||||
|
detail how code should be formatted and indented. Please read it and
|
||||||
|
follow it.
|
||||||
|
|
||||||
|
In addition, we add the following guidelines:
|
||||||
|
|
||||||
|
* Use underscores to separate words in non class names: ``n_samples``
|
||||||
|
rather than ``nsamples``.
|
||||||
|
|
||||||
|
* Avoid multiple statements on one line. Prefer a line return after
|
||||||
|
a control flow statement (``if``/``for``).
|
||||||
|
|
||||||
|
* Use relative imports for references inside scikit-learn.
|
||||||
|
|
||||||
|
* Unit tests are an exception to the previous rule;
|
||||||
|
they should use absolute imports, exactly as client code would.
|
||||||
|
A corollary is that, if ``sklearn.foo`` exports a class or function
|
||||||
|
that is implemented in ``sklearn.foo.bar.baz``,
|
||||||
|
the test should import it from ``sklearn.foo``.
|
||||||
|
|
||||||
|
* **Please don't use** ``import *`` **in any case**. It is considered harmful
|
||||||
|
by the `official Python recommendations
|
||||||
|
<https://docs.python.org/3.1/howto/doanddont.html#at-module-level>`_.
|
||||||
|
It makes the code harder to read as the origin of symbols is no
|
||||||
|
longer explicitly referenced, but most important, it prevents
|
||||||
|
using a static analysis tool like `pyflakes
|
||||||
|
<https://divmod.readthedocs.io/en/latest/products/pyflakes.html>`_ to automatically
|
||||||
|
find bugs in scikit-learn.
|
||||||
|
|
||||||
|
* Use the `numpy docstring standard
|
||||||
|
<https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard>`_
|
||||||
|
in all your docstrings.
|
||||||
|
|
||||||
|
|
||||||
|
A good example of code that we like can be found `here
|
||||||
|
<https://gist.github.com/nateGeorge/5455d2c57fb33c1ae04706f2dc4fee01>`_.
|
||||||
|
|
||||||
|
Input validation
|
||||||
|
----------------
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn.utils
|
||||||
|
|
||||||
|
The module :mod:`sklearn.utils` contains various functions for doing input
|
||||||
|
validation and conversion. Sometimes, ``np.asarray`` suffices for validation;
|
||||||
|
do *not* use ``np.asanyarray`` or ``np.atleast_2d``, since those let NumPy's
|
||||||
|
``np.matrix`` through, which has a different API
|
||||||
|
(e.g., ``*`` means dot product on ``np.matrix``,
|
||||||
|
but Hadamard product on ``np.ndarray``).
|
||||||
|
|
||||||
|
In other cases, be sure to call :func:`check_array` on any array-like argument
|
||||||
|
passed to a scikit-learn API function. The exact parameters to use depends
|
||||||
|
mainly on whether and which ``scipy.sparse`` matrices must be accepted.
|
||||||
|
|
||||||
|
For more information, refer to the :ref:`developers-utils` page.
|
||||||
|
|
||||||
|
Random Numbers
|
||||||
|
--------------
|
||||||
|
|
||||||
|
If your code depends on a random number generator, do not use
|
||||||
|
``numpy.random.random()`` or similar routines. To ensure
|
||||||
|
repeatability in error checking, the routine should accept a keyword
|
||||||
|
``random_state`` and use this to construct a
|
||||||
|
``numpy.random.RandomState`` object.
|
||||||
|
See :func:`sklearn.utils.check_random_state` in :ref:`developers-utils`.
|
||||||
|
|
||||||
|
Here's a simple example of code using some of the above guidelines::
|
||||||
|
|
||||||
|
from sklearn.utils import check_array, check_random_state
|
||||||
|
|
||||||
|
def choose_random_sample(X, random_state=0):
|
||||||
|
"""Choose a random point from X.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
X : array-like of shape (n_samples, n_features)
|
||||||
|
An array representing the data.
|
||||||
|
random_state : int or RandomState instance, default=0
|
||||||
|
The seed of the pseudo random number generator that selects a
|
||||||
|
random sample. Pass an int for reproducible output across multiple
|
||||||
|
function calls.
|
||||||
|
See :term:`Glossary <random_state>`.
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
x : ndarray of shape (n_features,)
|
||||||
|
A random point selected from X.
|
||||||
|
"""
|
||||||
|
X = check_array(X)
|
||||||
|
random_state = check_random_state(random_state)
|
||||||
|
i = random_state.randint(X.shape[0])
|
||||||
|
return X[i]
|
||||||
|
|
||||||
|
If you use randomness in an estimator instead of a freestanding function,
|
||||||
|
some additional guidelines apply.
|
||||||
|
|
||||||
|
First off, the estimator should take a ``random_state`` argument to its
|
||||||
|
``__init__`` with a default value of ``None``.
|
||||||
|
It should store that argument's value, **unmodified**,
|
||||||
|
in an attribute ``random_state``.
|
||||||
|
``fit`` can call ``check_random_state`` on that attribute
|
||||||
|
to get an actual random number generator.
|
||||||
|
If, for some reason, randomness is needed after ``fit``,
|
||||||
|
the RNG should be stored in an attribute ``random_state_``.
|
||||||
|
The following example should make this clear::
|
||||||
|
|
||||||
|
class GaussianNoise(BaseEstimator, TransformerMixin):
|
||||||
|
"""This estimator ignores its input and returns random Gaussian noise.
|
||||||
|
|
||||||
|
It also does not adhere to all scikit-learn conventions,
|
||||||
|
but showcases how to handle randomness.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, n_components=100, random_state=None):
|
||||||
|
self.random_state = random_state
|
||||||
|
self.n_components = n_components
|
||||||
|
|
||||||
|
# the arguments are ignored anyway, so we make them optional
|
||||||
|
def fit(self, X=None, y=None):
|
||||||
|
self.random_state_ = check_random_state(self.random_state)
|
||||||
|
|
||||||
|
def transform(self, X):
|
||||||
|
n_samples = X.shape[0]
|
||||||
|
return self.random_state_.randn(n_samples, self.n_components)
|
||||||
|
|
||||||
|
The reason for this setup is reproducibility:
|
||||||
|
when an estimator is ``fit`` twice to the same data,
|
||||||
|
it should produce an identical model both times,
|
||||||
|
hence the validation in ``fit``, not ``__init__``.
|
||||||
|
|
||||||
|
Numerical assertions in tests
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
When asserting the quasi-equality of arrays of continuous values,
|
||||||
|
do use `sklearn.utils._testing.assert_allclose`.
|
||||||
|
|
||||||
|
The relative tolerance is automatically inferred from the provided arrays
|
||||||
|
dtypes (for float32 and float64 dtypes in particular) but you can override
|
||||||
|
via ``rtol``.
|
||||||
|
|
||||||
|
When comparing arrays of zero-elements, please do provide a non-zero value for
|
||||||
|
the absolute tolerance via ``atol``.
|
||||||
|
|
||||||
|
For more information, please refer to the docstring of
|
||||||
|
`sklearn.utils._testing.assert_allclose`.
|
|
@ -0,0 +1,19 @@
|
||||||
|
.. _developers_guide:
|
||||||
|
|
||||||
|
=================
|
||||||
|
Developer's Guide
|
||||||
|
=================
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
|
||||||
|
contributing
|
||||||
|
minimal_reproducer
|
||||||
|
develop
|
||||||
|
tips
|
||||||
|
utilities
|
||||||
|
performance
|
||||||
|
cython
|
||||||
|
advanced_installation
|
||||||
|
bug_triaging
|
||||||
|
maintainer
|
||||||
|
plotting
|
|
@ -0,0 +1,458 @@
|
||||||
|
Maintainer/Core-Developer Information
|
||||||
|
======================================
|
||||||
|
|
||||||
|
Releasing
|
||||||
|
---------
|
||||||
|
|
||||||
|
This section is about preparing a major release, incrementing the minor
|
||||||
|
version, or a bug fix release incrementing the patch version. Our convention is
|
||||||
|
that we release one or more release candidates (0.RRrcN) before releasing the
|
||||||
|
final distributions. We follow the `PEP101
|
||||||
|
<https://www.python.org/dev/peps/pep-0101/>`_ to indicate release candidates,
|
||||||
|
post, and minor releases.
|
||||||
|
|
||||||
|
Before a release
|
||||||
|
................
|
||||||
|
|
||||||
|
1. Update authors table:
|
||||||
|
|
||||||
|
Create a `classic token on GitHub <https://github.com/settings/tokens/new>`_
|
||||||
|
with the ``read:org`` following permission.
|
||||||
|
|
||||||
|
Run the following script, entering the token in:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
cd build_tools; make authors; cd ..
|
||||||
|
|
||||||
|
and commit. This is only needed if the authors have changed since the last
|
||||||
|
release. This step is sometimes done independent of the release. This
|
||||||
|
updates the maintainer list and is not the contributor list for the release.
|
||||||
|
|
||||||
|
2. Confirm any blockers tagged for the milestone are resolved, and that other
|
||||||
|
issues tagged for the milestone can be postponed.
|
||||||
|
|
||||||
|
3. Ensure the change log and commits correspond (within reason!), and that the
|
||||||
|
change log is reasonably well curated. Some tools for these tasks include:
|
||||||
|
|
||||||
|
- ``maint_tools/sort_whats_new.py`` can put what's new entries into
|
||||||
|
sections. It's not perfect, and requires manual checking of the changes.
|
||||||
|
If the what's new list is well curated, it may not be necessary.
|
||||||
|
|
||||||
|
- The ``maint_tools/whats_missing.sh`` script may be used to identify pull
|
||||||
|
requests that were merged but likely missing from What's New.
|
||||||
|
|
||||||
|
4. Make sure the deprecations, FIXME and TODOs tagged for the release have
|
||||||
|
been taken care of.
|
||||||
|
|
||||||
|
**Permissions**
|
||||||
|
|
||||||
|
The release manager must be a *maintainer* of the ``scikit-learn/scikit-learn``
|
||||||
|
repository to be able to publish on ``pypi.org`` and ``test.pypi.org``
|
||||||
|
(via a manual trigger of a dedicated Github Actions workflow).
|
||||||
|
|
||||||
|
The release manager does not need extra permissions on ``pypi.org`` to publish a
|
||||||
|
release in particular.
|
||||||
|
|
||||||
|
The release manager must be a *maintainer* of the ``conda-forge/scikit-learn-feedstock``
|
||||||
|
repository. This can be changed by editing the ``recipe/meta.yaml`` file in the
|
||||||
|
first release pull-request.
|
||||||
|
|
||||||
|
.. _preparing_a_release_pr:
|
||||||
|
|
||||||
|
Preparing a release PR
|
||||||
|
......................
|
||||||
|
|
||||||
|
Major version release
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Prior to branching please do not forget to prepare a Release Highlights page as
|
||||||
|
a runnable example and check that its HTML rendering looks correct. These
|
||||||
|
release highlights should be linked from the ``doc/whats_new/v0.99.rst`` file
|
||||||
|
for the new version of scikit-learn.
|
||||||
|
|
||||||
|
Releasing the first RC of e.g. version `0.99.0` involves creating the release
|
||||||
|
branch `0.99.X` directly on the main repo, where `X` really is the letter X,
|
||||||
|
**not a placeholder**. The development for the major and minor releases of `0.99`
|
||||||
|
should **also** happen under `0.99.X`. Each release (rc, major, or minor) is a
|
||||||
|
tag under that branch.
|
||||||
|
|
||||||
|
This is done only once, as the major and minor releases happen on the same
|
||||||
|
branch:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
# Assuming upstream is an alias for the main scikit-learn repo:
|
||||||
|
git fetch upstream main
|
||||||
|
git checkout upstream/main
|
||||||
|
git checkout -b 0.99.X
|
||||||
|
git push --set-upstream upstream 0.99.X
|
||||||
|
|
||||||
|
Again, `X` is literal here, and `99` is replaced by the release number.
|
||||||
|
The branches are called ``0.19.X``, ``0.20.X``, etc.
|
||||||
|
|
||||||
|
In terms of including changes, the first RC ideally counts as a *feature
|
||||||
|
freeze*. Each coming release candidate and the final release afterwards will
|
||||||
|
include only minor documentation changes and bug fixes. Any major enhancement
|
||||||
|
or feature should be excluded.
|
||||||
|
|
||||||
|
Then you can prepare a local branch for the release itself, for instance:
|
||||||
|
``release-0.99.0rc1``, push it to your github fork and open a PR **to the**
|
||||||
|
`scikit-learn/0.99.X` **branch**. Copy the :ref:`release_checklist` templates
|
||||||
|
in the description of the Pull Request to track progress.
|
||||||
|
|
||||||
|
This PR will be used to push commits related to the release as explained in
|
||||||
|
:ref:`making_a_release`.
|
||||||
|
|
||||||
|
You can also create a second PR from main and targeting main to increment the
|
||||||
|
``__version__`` variable in `sklearn/__init__.py` and in `pyproject.toml` to increment
|
||||||
|
the dev version. This means while we're in the release candidate period, the latest
|
||||||
|
stable is two versions behind the main branch, instead of one. In this PR targeting
|
||||||
|
main you should also include a new file for the matching version under the
|
||||||
|
``doc/whats_new/`` folder so PRs that target the next version can contribute their
|
||||||
|
changelog entries to this file in parallel to the release process.
|
||||||
|
|
||||||
|
Minor version release (also known as bug-fix release)
|
||||||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
The minor releases should include bug fixes and some relevant documentation
|
||||||
|
changes only. Any PR resulting in a behavior change which is not a bug fix
|
||||||
|
should be excluded. As an example, instructions are given for the `1.2.2` release.
|
||||||
|
|
||||||
|
- Create a branch, **on your own fork** (here referred to as `fork`) for the release
|
||||||
|
from `upstream/main`.
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
git fetch upstream/main
|
||||||
|
git checkout -b release-1.2.2 upstream/main
|
||||||
|
git push -u fork release-1.2.2:release-1.2.2
|
||||||
|
|
||||||
|
- Create a **draft** PR to the `upstream/1.2.X` branch (not to `upstream/main`)
|
||||||
|
with all the desired changes.
|
||||||
|
|
||||||
|
- Do not push anything on that branch yet.
|
||||||
|
|
||||||
|
- Locally rebase `release-1.2.2` from the `upstream/1.2.X` branch using:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
git rebase -i upstream/1.2.X
|
||||||
|
|
||||||
|
This will open an interactive rebase with the `git-rebase-todo` containing all
|
||||||
|
the latest commit on `main`. At this stage, you have to perform
|
||||||
|
this interactive rebase with at least someone else (being three people rebasing
|
||||||
|
is better not to forget something and to avoid any doubt).
|
||||||
|
|
||||||
|
- **Do not remove lines but drop commit by replace** ``pick`` **with** ``drop``
|
||||||
|
|
||||||
|
- Commits to pick for bug-fix release *generally* are prefixed with: `FIX`, `CI`,
|
||||||
|
`DOC`. They should at least include all the commits of the merged PRs
|
||||||
|
that were milestoned for this release on GitHub and/or documented as such in
|
||||||
|
the changelog. It's likely that some bugfixes were documented in the
|
||||||
|
changelog of the main major release instead of the next bugfix release,
|
||||||
|
in which case, the matching changelog entries will need to be moved,
|
||||||
|
first in the `main` branch then backported in the release PR.
|
||||||
|
|
||||||
|
- Commits to drop for bug-fix release *generally* are prefixed with: `FEAT`,
|
||||||
|
`MAINT`, `ENH`, `API`. Reasons for not including them is to prevent change of
|
||||||
|
behavior (which only must feature in breaking or major releases).
|
||||||
|
|
||||||
|
- After having dropped or picked commit, **do no exit** but paste the content
|
||||||
|
of the `git-rebase-todo` message in the PR.
|
||||||
|
This file is located at `.git/rebase-merge/git-rebase-todo`.
|
||||||
|
|
||||||
|
- Save and exit, starting the interactive rebase.
|
||||||
|
|
||||||
|
- Resolve merge conflicts when they happen.
|
||||||
|
|
||||||
|
- Force push the result of the rebase and the extra release commits to the release PR:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
git push -f fork release-1.2.2:release-1.2.2
|
||||||
|
|
||||||
|
- Copy the :ref:`release_checklist` template and paste it in the description of the
|
||||||
|
Pull Request to track progress.
|
||||||
|
|
||||||
|
- Review all the commits included in the release to make sure that they do not
|
||||||
|
introduce any new feature. We should not blindly trust the commit message prefixes.
|
||||||
|
|
||||||
|
- Remove the draft status of the release PR and invite other maintainers to review the
|
||||||
|
list of included commits.
|
||||||
|
|
||||||
|
.. _making_a_release:
|
||||||
|
|
||||||
|
Making a release
|
||||||
|
................
|
||||||
|
|
||||||
|
0. Ensure that you have checked out the branch of the release PR as explained
|
||||||
|
in :ref:`preparing_a_release_pr` above.
|
||||||
|
|
||||||
|
1. Update docs. Note that this is for the final release, not necessarily for
|
||||||
|
the RC releases. These changes should be made in main and cherry-picked
|
||||||
|
into the release branch, only before the final release.
|
||||||
|
|
||||||
|
- Edit the ``doc/whats_new/v0.99.rst`` file to add release title and list of
|
||||||
|
contributors.
|
||||||
|
You can retrieve the list of contributor names with:
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
$ git shortlog -s 0.98.33.. | cut -f2- | sort --ignore-case | tr '\n' ';' | sed 's/;/, /g;s/, $//' | fold -s
|
||||||
|
|
||||||
|
- For major releases, link the release highlights example from the ``doc/whats_new/v0.99.rst`` file.
|
||||||
|
|
||||||
|
- Update the release date in ``whats_new.rst``
|
||||||
|
|
||||||
|
- Edit the ``doc/templates/index.html`` to change the 'News' entry of the
|
||||||
|
front page (with the release month as well). Do not forget to remove
|
||||||
|
the old entries (two years or three releases are typically good
|
||||||
|
enough) and to update the on-going development entry.
|
||||||
|
|
||||||
|
2. On the branch for releasing, update the version number in ``sklearn/__init__.py``,
|
||||||
|
the ``__version__`` variable, and in `pyproject.toml`.
|
||||||
|
|
||||||
|
For major releases, please add a 0 at the end: `0.99.0` instead of `0.99`.
|
||||||
|
|
||||||
|
For the first release candidate, use the `rc1` suffix on the expected final
|
||||||
|
release number: `0.99.0rc1`.
|
||||||
|
|
||||||
|
3. Trigger the wheel builder with the ``[cd build]`` commit marker using
|
||||||
|
the command:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
git commit --allow-empty -m "Trigger wheel builder workflow: [cd build]"
|
||||||
|
|
||||||
|
The wheel building workflow is managed by GitHub Actions and the results be browsed at:
|
||||||
|
https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Wheel+builder%22
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
Before building the wheels, make sure that the ``pyproject.toml`` file is
|
||||||
|
up to date and using the oldest version of ``numpy`` for each Python version
|
||||||
|
to avoid `ABI <https://en.wikipedia.org/wiki/Application_binary_interface>`_
|
||||||
|
incompatibility issues. Moreover, a new line have to be included in the
|
||||||
|
``pyproject.toml`` file for each new supported version of Python.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The acronym CD in `[cd build]` stands for `Continuous Delivery
|
||||||
|
<https://en.wikipedia.org/wiki/Continuous_delivery>`_ and refers to the
|
||||||
|
automation used to generate the release artifacts (binary and source
|
||||||
|
packages). This can be seen as an extension to CI which stands for
|
||||||
|
`Continuous Integration
|
||||||
|
<https://en.wikipedia.org/wiki/Continuous_integration>`_. The CD workflow on
|
||||||
|
GitHub Actions is also used to automatically create nightly builds and
|
||||||
|
publish packages for the development branch of scikit-learn. See
|
||||||
|
:ref:`install_nightly_builds`.
|
||||||
|
|
||||||
|
4. Once all the CD jobs have completed successfully in the PR, merge it,
|
||||||
|
again with the `[cd build]` marker in the commit message. This time
|
||||||
|
the results will be uploaded to the staging area.
|
||||||
|
|
||||||
|
You should then be able to upload the generated artifacts (.tar.gz and .whl
|
||||||
|
files) to https://test.pypi.org using the "Run workflow" form for the
|
||||||
|
following GitHub Actions workflow:
|
||||||
|
|
||||||
|
https://github.com/scikit-learn/scikit-learn/actions?query=workflow%3A%22Publish+to+Pypi%22
|
||||||
|
|
||||||
|
5. If this went fine, you can proceed with tagging. Proceed with caution.
|
||||||
|
Ideally, tags should be created when you're almost certain that the release
|
||||||
|
is ready, since adding a tag to the main repo can trigger certain automated
|
||||||
|
processes.
|
||||||
|
|
||||||
|
Create the tag and push it (if it's an RC, it can be ``0.xx.0rc1`` for
|
||||||
|
instance):
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
git tag -a 0.99.0 # in the 0.99.X branch
|
||||||
|
git push git@github.com:scikit-learn/scikit-learn.git 0.99.0
|
||||||
|
|
||||||
|
6. Confirm that the bot has detected the tag on the conda-forge feedstock repo:
|
||||||
|
https://github.com/conda-forge/scikit-learn-feedstock. If not, submit a PR for the
|
||||||
|
release. If you want to publish an RC release on conda-forge, the PR should target
|
||||||
|
the `rc` branch as opposed to the `main` branch. The two branches need to be kept
|
||||||
|
sync together otherwise.
|
||||||
|
|
||||||
|
7. Trigger the GitHub Actions workflow again but this time to upload the artifacts
|
||||||
|
to the real https://pypi.org (replace "testpypi" by "pypi" in the "Run
|
||||||
|
workflow" form).
|
||||||
|
|
||||||
|
8. **Alternative to step 7**: it's possible to collect locally the generated binary
|
||||||
|
wheel packages and source tarball and upload them all to PyPI by running the
|
||||||
|
following commands in the scikit-learn source folder (checked out at the
|
||||||
|
release tag):
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
rm -r dist
|
||||||
|
pip install -U wheelhouse_uploader twine
|
||||||
|
python -m wheelhouse_uploader fetch \
|
||||||
|
--version 0.99.0 \
|
||||||
|
--local-folder dist \
|
||||||
|
scikit-learn \
|
||||||
|
https://pypi.anaconda.org/scikit-learn-wheels-staging/simple/scikit-learn/
|
||||||
|
|
||||||
|
This command will download all the binary packages accumulated in the
|
||||||
|
`staging area on the anaconda.org hosting service
|
||||||
|
<https://anaconda.org/scikit-learn-wheels-staging/scikit-learn/files>`_ and
|
||||||
|
put them in your local `./dist` folder.
|
||||||
|
|
||||||
|
Check the content of the `./dist` folder: it should contain all the wheels
|
||||||
|
along with the source tarball ("scikit-learn-RRR.tar.gz").
|
||||||
|
|
||||||
|
Make sure that you do not have developer versions or older versions of
|
||||||
|
the scikit-learn package in that folder.
|
||||||
|
|
||||||
|
Before uploading to pypi, you can test upload to test.pypi.org:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
twine upload --verbose --repository-url https://test.pypi.org/legacy/ dist/*
|
||||||
|
|
||||||
|
Upload everything at once to https://pypi.org:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
twine upload dist/*
|
||||||
|
|
||||||
|
9. For major/minor (not bug-fix release or release candidates), update the symlink for
|
||||||
|
``stable`` and the ``latestStable`` variable in
|
||||||
|
https://github.com/scikit-learn/scikit-learn.github.io:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
cd /tmp
|
||||||
|
git clone --depth 1 --no-checkout git@github.com:scikit-learn/scikit-learn.github.io.git
|
||||||
|
cd scikit-learn.github.io
|
||||||
|
echo stable > .git/info/sparse-checkout
|
||||||
|
git checkout main
|
||||||
|
rm stable
|
||||||
|
ln -s 0.999 stable
|
||||||
|
sed -i "s/latestStable = '.*/latestStable = '0.999';/" versionwarning.js
|
||||||
|
git add stable versionwarning.js
|
||||||
|
git commit -m "Update stable to point to 0.999"
|
||||||
|
git push origin main
|
||||||
|
|
||||||
|
10. Update ``SECURITY.md`` to reflect the latest supported version.
|
||||||
|
|
||||||
|
.. _release_checklist:
|
||||||
|
|
||||||
|
Release checklist
|
||||||
|
.................
|
||||||
|
|
||||||
|
The following GitHub checklist might be helpful in a release PR::
|
||||||
|
|
||||||
|
* [ ] update news and what's new date in release branch
|
||||||
|
* [ ] update news and what's new date and sklearn dev0 version in main branch
|
||||||
|
* [ ] check that the wheels for the release can be built successfully
|
||||||
|
* [ ] merge the PR with `[cd build]` commit message to upload wheels to the staging repo
|
||||||
|
* [ ] upload the wheels and source tarball to https://test.pypi.org
|
||||||
|
* [ ] create tag on the main github repo
|
||||||
|
* [ ] confirm bot detected at
|
||||||
|
https://github.com/conda-forge/scikit-learn-feedstock and wait for merge
|
||||||
|
* [ ] upload the wheels and source tarball to PyPI
|
||||||
|
* [ ] https://github.com/scikit-learn/scikit-learn/releases publish (except for RC)
|
||||||
|
* [ ] announce on mailing list and on Twitter, and LinkedIn
|
||||||
|
* [ ] update symlink for stable in
|
||||||
|
https://github.com/scikit-learn/scikit-learn.github.io (only major/minor)
|
||||||
|
* [ ] update SECURITY.md in main branch (except for RC)
|
||||||
|
|
||||||
|
Merging Pull Requests
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
Individual commits are squashed when a Pull Request (PR) is merged on Github.
|
||||||
|
Before merging,
|
||||||
|
|
||||||
|
- the resulting commit title can be edited if necessary. Note
|
||||||
|
that this will rename the PR title by default.
|
||||||
|
- the detailed description, containing the titles of all the commits, can
|
||||||
|
be edited or deleted.
|
||||||
|
- for PRs with multiple code contributors care must be taken to keep
|
||||||
|
the `Co-authored-by: name <name@example.com>` tags in the detailed
|
||||||
|
description. This will mark the PR as having `multiple co-authors
|
||||||
|
<https://help.github.com/en/github/committing-changes-to-your-project/creating-a-commit-with-multiple-authors>`_.
|
||||||
|
Whether code contributions are significantly enough to merit co-authorship is
|
||||||
|
left to the maintainer's discretion, same as for the "what's new" entry.
|
||||||
|
|
||||||
|
|
||||||
|
The scikit-learn.org web site
|
||||||
|
-----------------------------
|
||||||
|
|
||||||
|
The scikit-learn web site (https://scikit-learn.org) is hosted at GitHub,
|
||||||
|
but should rarely be updated manually by pushing to the
|
||||||
|
https://github.com/scikit-learn/scikit-learn.github.io repository. Most
|
||||||
|
updates can be made by pushing to master (for /dev) or a release branch
|
||||||
|
like 0.99.X, from which Circle CI builds and uploads the documentation
|
||||||
|
automatically.
|
||||||
|
|
||||||
|
Experimental features
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
The :mod:`sklearn.experimental` module was introduced in 0.21 and contains
|
||||||
|
experimental features / estimators that are subject to change without
|
||||||
|
deprecation cycle.
|
||||||
|
|
||||||
|
To create an experimental module, you can just copy and modify the content of
|
||||||
|
`enable_halving_search_cv.py
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/362cb92bb2f5b878229ea4f59519ad31c2fcee76/sklearn/experimental/enable_halving_search_cv.py>`__,
|
||||||
|
or
|
||||||
|
`enable_iterative_imputer.py
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/c9c89cfc85dd8dfefd7921c16c87327d03140a06/sklearn/experimental/enable_iterative_imputer.py>`_.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
These are permalink as in 0.24, where these estimators are still
|
||||||
|
experimental. They might be stable at the time of reading - hence the
|
||||||
|
permalink. See below for instructions on the transition from experimental
|
||||||
|
to stable.
|
||||||
|
|
||||||
|
Note that the public import path must be to a public subpackage (like
|
||||||
|
``sklearn/ensemble`` or ``sklearn/impute``), not just a ``.py`` module.
|
||||||
|
Also, the (private) experimental features that are imported must be in a
|
||||||
|
submodule/subpackage of the public subpackage, e.g.
|
||||||
|
``sklearn/ensemble/_hist_gradient_boosting/`` or
|
||||||
|
``sklearn/impute/_iterative.py``. This is needed so that pickles still work
|
||||||
|
in the future when the features aren't experimental anymore.
|
||||||
|
|
||||||
|
To avoid type checker (e.g. mypy) errors a direct import of experimental
|
||||||
|
estimators should be done in the parent module, protected by the
|
||||||
|
``if typing.TYPE_CHECKING`` check. See `sklearn/ensemble/__init__.py
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/c9c89cfc85dd8dfefd7921c16c87327d03140a06/sklearn/ensemble/__init__.py>`_,
|
||||||
|
or `sklearn/impute/__init__.py
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/c9c89cfc85dd8dfefd7921c16c87327d03140a06/sklearn/impute/__init__.py>`_
|
||||||
|
for an example.
|
||||||
|
|
||||||
|
Please also write basic tests following those in
|
||||||
|
`test_enable_hist_gradient_boosting.py
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/c9c89cfc85dd8dfefd7921c16c87327d03140a06/sklearn/experimental/tests/test_enable_hist_gradient_boosting.py>`__.
|
||||||
|
|
||||||
|
|
||||||
|
Make sure every user-facing code you write explicitly mentions that the feature
|
||||||
|
is experimental, and add a ``# noqa`` comment to avoid pep8-related warnings::
|
||||||
|
|
||||||
|
# To use this experimental feature, we need to explicitly ask for it:
|
||||||
|
from sklearn.experimental import enable_hist_gradient_boosting # noqa
|
||||||
|
from sklearn.ensemble import HistGradientBoostingRegressor
|
||||||
|
|
||||||
|
For the docs to render properly, please also import
|
||||||
|
``enable_my_experimental_feature`` in ``doc/conf.py``, else sphinx won't be
|
||||||
|
able to import the corresponding modules. Note that using ``from
|
||||||
|
sklearn.experimental import *`` **does not work**.
|
||||||
|
|
||||||
|
Note that some experimental classes / functions are not included in the
|
||||||
|
:mod:`sklearn.experimental` module: ``sklearn.datasets.fetch_openml``.
|
||||||
|
|
||||||
|
Once the feature become stable, remove all `enable_my_experimental_feature`
|
||||||
|
in the scikit-learn code (even feature highlights etc.) and make the
|
||||||
|
`enable_my_experimental_feature` a no-op that just raises a warning:
|
||||||
|
`enable_hist_gradient_boosting.py
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/experimental/enable_hist_gradient_boosting.py>`__.
|
||||||
|
The file should stay there indefinitely as we don't want to break users code:
|
||||||
|
we just incentivize them to remove that import with the warning.
|
||||||
|
|
||||||
|
Also update the tests accordingly: `test_enable_hist_gradient_boosting.py
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/experimental/tests/test_enable_hist_gradient_boosting.py>`__.
|
|
@ -0,0 +1,434 @@
|
||||||
|
.. _minimal_reproducer:
|
||||||
|
|
||||||
|
==============================================
|
||||||
|
Crafting a minimal reproducer for scikit-learn
|
||||||
|
==============================================
|
||||||
|
|
||||||
|
|
||||||
|
Whether submitting a bug report, designing a suite of tests, or simply posting a
|
||||||
|
question in the discussions, being able to craft minimal, reproducible examples
|
||||||
|
(or minimal, workable examples) is the key to communicating effectively and
|
||||||
|
efficiently with the community.
|
||||||
|
|
||||||
|
There are very good guidelines on the internet such as `this StackOverflow
|
||||||
|
document <https://stackoverflow.com/help/mcve>`_ or `this blogpost by Matthew
|
||||||
|
Rocklin <https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports>`_
|
||||||
|
on crafting Minimal Complete Verifiable Examples (referred below as MCVE).
|
||||||
|
Our goal is not to be repetitive with those references but rather to provide a
|
||||||
|
step-by-step guide on how to narrow down a bug until you have reached the
|
||||||
|
shortest possible code to reproduce it.
|
||||||
|
|
||||||
|
The first step before submitting a bug report to scikit-learn is to read the
|
||||||
|
`Issue template
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/main/.github/ISSUE_TEMPLATE/bug_report.yml>`_.
|
||||||
|
It is already quite informative about the information you will be asked to
|
||||||
|
provide.
|
||||||
|
|
||||||
|
|
||||||
|
.. _good_practices:
|
||||||
|
|
||||||
|
Good practices
|
||||||
|
==============
|
||||||
|
|
||||||
|
In this section we will focus on the **Steps/Code to Reproduce** section of the
|
||||||
|
`Issue template
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/main/.github/ISSUE_TEMPLATE/bug_report.yml>`_.
|
||||||
|
We will start with a snippet of code that already provides a failing example but
|
||||||
|
that has room for readability improvement. We then craft a MCVE from it.
|
||||||
|
|
||||||
|
**Example**
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
# I am currently working in a ML project and when I tried to fit a
|
||||||
|
# GradientBoostingRegressor instance to my_data.csv I get a UserWarning:
|
||||||
|
# "X has feature names, but DecisionTreeRegressor was fitted without
|
||||||
|
# feature names". You can get a copy of my dataset from
|
||||||
|
# https://example.com/my_data.csv and verify my features do have
|
||||||
|
# names. The problem seems to arise during fit when I pass an integer
|
||||||
|
# to the n_iter_no_change parameter.
|
||||||
|
|
||||||
|
df = pd.read_csv('my_data.csv')
|
||||||
|
X = df[["feature_name"]] # my features do have names
|
||||||
|
y = df["target"]
|
||||||
|
|
||||||
|
# We set random_state=42 for the train_test_split
|
||||||
|
X_train, X_test, y_train, y_test = train_test_split(
|
||||||
|
X, y, test_size=0.33, random_state=42
|
||||||
|
)
|
||||||
|
|
||||||
|
scaler = StandardScaler(with_mean=False)
|
||||||
|
X_train = scaler.fit_transform(X_train)
|
||||||
|
X_test = scaler.transform(X_test)
|
||||||
|
|
||||||
|
# An instance with default n_iter_no_change raises no error nor warnings
|
||||||
|
gbdt = GradientBoostingRegressor(random_state=0)
|
||||||
|
gbdt.fit(X_train, y_train)
|
||||||
|
default_score = gbdt.score(X_test, y_test)
|
||||||
|
|
||||||
|
# the bug appears when I change the value for n_iter_no_change
|
||||||
|
gbdt = GradientBoostingRegressor(random_state=0, n_iter_no_change=5)
|
||||||
|
gbdt.fit(X_train, y_train)
|
||||||
|
other_score = gbdt.score(X_test, y_test)
|
||||||
|
|
||||||
|
other_score = gbdt.score(X_test, y_test)
|
||||||
|
|
||||||
|
|
||||||
|
Provide a failing code example with minimal comments
|
||||||
|
----------------------------------------------------
|
||||||
|
|
||||||
|
Writing instructions to reproduce the problem in English is often ambiguous.
|
||||||
|
Better make sure that all the necessary details to reproduce the problem are
|
||||||
|
illustrated in the Python code snippet to avoid any ambiguity. Besides, by this
|
||||||
|
point you already provided a concise description in the **Describe the bug**
|
||||||
|
section of the `Issue template
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/main/.github/ISSUE_TEMPLATE/bug_report.yml>`_.
|
||||||
|
|
||||||
|
The following code, while **still not minimal**, is already **much better**
|
||||||
|
because it can be copy-pasted in a Python terminal to reproduce the problem in
|
||||||
|
one step. In particular:
|
||||||
|
|
||||||
|
- it contains **all necessary imports statements**;
|
||||||
|
- it can fetch the public dataset without having to manually download a
|
||||||
|
file and put it in the expected location on the disk.
|
||||||
|
|
||||||
|
**Improved example**
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
df = pd.read_csv("https://example.com/my_data.csv")
|
||||||
|
X = df[["feature_name"]]
|
||||||
|
y = df["target"]
|
||||||
|
|
||||||
|
from sklearn.model_selection import train_test_split
|
||||||
|
|
||||||
|
X_train, X_test, y_train, y_test = train_test_split(
|
||||||
|
X, y, test_size=0.33, random_state=42
|
||||||
|
)
|
||||||
|
|
||||||
|
from sklearn.preprocessing import StandardScaler
|
||||||
|
|
||||||
|
scaler = StandardScaler(with_mean=False)
|
||||||
|
X_train = scaler.fit_transform(X_train)
|
||||||
|
X_test = scaler.transform(X_test)
|
||||||
|
|
||||||
|
from sklearn.ensemble import GradientBoostingRegressor
|
||||||
|
|
||||||
|
gbdt = GradientBoostingRegressor(random_state=0)
|
||||||
|
gbdt.fit(X_train, y_train) # no warning
|
||||||
|
default_score = gbdt.score(X_test, y_test)
|
||||||
|
|
||||||
|
gbdt = GradientBoostingRegressor(random_state=0, n_iter_no_change=5)
|
||||||
|
gbdt.fit(X_train, y_train) # raises warning
|
||||||
|
other_score = gbdt.score(X_test, y_test)
|
||||||
|
other_score = gbdt.score(X_test, y_test)
|
||||||
|
|
||||||
|
|
||||||
|
Boil down your script to something as small as possible
|
||||||
|
-------------------------------------------------------
|
||||||
|
|
||||||
|
You have to ask yourself which lines of code are relevant and which are not for
|
||||||
|
reproducing the bug. Deleting unnecessary lines of code or simplifying the
|
||||||
|
function calls by omitting unrelated non-default options will help you and other
|
||||||
|
contributors narrow down the cause of the bug.
|
||||||
|
|
||||||
|
In particular, for this specific example:
|
||||||
|
|
||||||
|
- the warning has nothing to do with the `train_test_split` since it already
|
||||||
|
appears in the training step, before we use the test set.
|
||||||
|
- similarly, the lines that compute the scores on the test set are not
|
||||||
|
necessary;
|
||||||
|
- the bug can be reproduced for any value of `random_state` so leave it to its
|
||||||
|
default;
|
||||||
|
- the bug can be reproduced without preprocessing the data with the
|
||||||
|
`StandardScaler`.
|
||||||
|
|
||||||
|
**Improved example**
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
df = pd.read_csv("https://example.com/my_data.csv")
|
||||||
|
X = df[["feature_name"]]
|
||||||
|
y = df["target"]
|
||||||
|
|
||||||
|
from sklearn.ensemble import GradientBoostingRegressor
|
||||||
|
|
||||||
|
gbdt = GradientBoostingRegressor()
|
||||||
|
gbdt.fit(X, y) # no warning
|
||||||
|
|
||||||
|
gbdt = GradientBoostingRegressor(n_iter_no_change=5)
|
||||||
|
gbdt.fit(X, y) # raises warning
|
||||||
|
|
||||||
|
|
||||||
|
**DO NOT** report your data unless it is extremely necessary
|
||||||
|
------------------------------------------------------------
|
||||||
|
|
||||||
|
The idea is to make the code as self-contained as possible. For doing so, you
|
||||||
|
can use a :ref:`synth_data`. It can be generated using numpy, pandas or the
|
||||||
|
:mod:`sklearn.datasets` module. Most of the times the bug is not related to a
|
||||||
|
particular structure of your data. Even if it is, try to find an available
|
||||||
|
dataset that has similar characteristics to yours and that reproduces the
|
||||||
|
problem. In this particular case, we are interested in data that has labeled
|
||||||
|
feature names.
|
||||||
|
|
||||||
|
**Improved example**
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
from sklearn.ensemble import GradientBoostingRegressor
|
||||||
|
|
||||||
|
df = pd.DataFrame(
|
||||||
|
{
|
||||||
|
"feature_name": [-12.32, 1.43, 30.01, 22.17],
|
||||||
|
"target": [72, 55, 32, 43],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
X = df[["feature_name"]]
|
||||||
|
y = df["target"]
|
||||||
|
|
||||||
|
gbdt = GradientBoostingRegressor()
|
||||||
|
gbdt.fit(X, y) # no warning
|
||||||
|
gbdt = GradientBoostingRegressor(n_iter_no_change=5)
|
||||||
|
gbdt.fit(X, y) # raises warning
|
||||||
|
|
||||||
|
As already mentioned, the key to communication is the readability of the code
|
||||||
|
and good formatting can really be a plus. Notice that in the previous snippet
|
||||||
|
we:
|
||||||
|
|
||||||
|
- try to limit all lines to a maximum of 79 characters to avoid horizontal
|
||||||
|
scrollbars in the code snippets blocks rendered on the GitHub issue;
|
||||||
|
- use blank lines to separate groups of related functions;
|
||||||
|
- place all the imports in their own group at the beginning.
|
||||||
|
|
||||||
|
The simplification steps presented in this guide can be implemented in a
|
||||||
|
different order than the progression we have shown here. The important points
|
||||||
|
are:
|
||||||
|
|
||||||
|
- a minimal reproducer should be runnable by a simple copy-and-paste in a
|
||||||
|
python terminal;
|
||||||
|
- it should be simplified as much as possible by removing any code steps
|
||||||
|
that are not strictly needed to reproducing the original problem;
|
||||||
|
- it should ideally only rely on a minimal dataset generated on-the-fly by
|
||||||
|
running the code instead of relying on external data, if possible.
|
||||||
|
|
||||||
|
|
||||||
|
Use markdown formatting
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
To format code or text into its own distinct block, use triple backticks.
|
||||||
|
`Markdown
|
||||||
|
<https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax>`_
|
||||||
|
supports an optional language identifier to enable syntax highlighting in your
|
||||||
|
fenced code block. For example::
|
||||||
|
|
||||||
|
```python
|
||||||
|
from sklearn.datasets import make_blobs
|
||||||
|
|
||||||
|
n_samples = 100
|
||||||
|
n_components = 3
|
||||||
|
X, y = make_blobs(n_samples=n_samples, centers=n_components)
|
||||||
|
```
|
||||||
|
|
||||||
|
will render a python formatted snippet as follows
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from sklearn.datasets import make_blobs
|
||||||
|
|
||||||
|
n_samples = 100
|
||||||
|
n_components = 3
|
||||||
|
X, y = make_blobs(n_samples=n_samples, centers=n_components)
|
||||||
|
|
||||||
|
It is not necessary to create several blocks of code when submitting a bug
|
||||||
|
report. Remember other reviewers are going to copy-paste your code and having a
|
||||||
|
single cell will make their task easier.
|
||||||
|
|
||||||
|
In the section named **Actual results** of the `Issue template
|
||||||
|
<https://github.com/scikit-learn/scikit-learn/blob/main/.github/ISSUE_TEMPLATE/bug_report.yml>`_
|
||||||
|
you are asked to provide the error message including the full traceback of the
|
||||||
|
exception. In this case, use the `python-traceback` qualifier. For example::
|
||||||
|
|
||||||
|
```python-traceback
|
||||||
|
---------------------------------------------------------------------------
|
||||||
|
TypeError Traceback (most recent call last)
|
||||||
|
<ipython-input-1-a674e682c281> in <module>
|
||||||
|
4 vectorizer = CountVectorizer(input=docs, analyzer='word')
|
||||||
|
5 lda_features = vectorizer.fit_transform(docs)
|
||||||
|
----> 6 lda_model = LatentDirichletAllocation(
|
||||||
|
7 n_topics=10,
|
||||||
|
8 learning_method='online',
|
||||||
|
|
||||||
|
TypeError: __init__() got an unexpected keyword argument 'n_topics'
|
||||||
|
```
|
||||||
|
|
||||||
|
yields the following when rendered:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
---------------------------------------------------------------------------
|
||||||
|
TypeError Traceback (most recent call last)
|
||||||
|
<ipython-input-1-a674e682c281> in <module>
|
||||||
|
4 vectorizer = CountVectorizer(input=docs, analyzer='word')
|
||||||
|
5 lda_features = vectorizer.fit_transform(docs)
|
||||||
|
----> 6 lda_model = LatentDirichletAllocation(
|
||||||
|
7 n_topics=10,
|
||||||
|
8 learning_method='online',
|
||||||
|
|
||||||
|
TypeError: __init__() got an unexpected keyword argument 'n_topics'
|
||||||
|
|
||||||
|
|
||||||
|
.. _synth_data:
|
||||||
|
|
||||||
|
Synthetic dataset
|
||||||
|
=================
|
||||||
|
|
||||||
|
Before choosing a particular synthetic dataset, first you have to identify the
|
||||||
|
type of problem you are solving: Is it a classification, a regression,
|
||||||
|
a clustering, etc?
|
||||||
|
|
||||||
|
Once that you narrowed down the type of problem, you need to provide a synthetic
|
||||||
|
dataset accordingly. Most of the times you only need a minimalistic dataset.
|
||||||
|
Here is a non-exhaustive list of tools that may help you.
|
||||||
|
|
||||||
|
NumPy
|
||||||
|
-----
|
||||||
|
|
||||||
|
NumPy tools such as `numpy.random.randn
|
||||||
|
<https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html>`_
|
||||||
|
and `numpy.random.randint
|
||||||
|
<https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html>`_
|
||||||
|
can be used to create dummy numeric data.
|
||||||
|
|
||||||
|
- regression
|
||||||
|
|
||||||
|
Regressions take continuous numeric data as features and target.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
rng = np.random.RandomState(0)
|
||||||
|
n_samples, n_features = 5, 5
|
||||||
|
X = rng.randn(n_samples, n_features)
|
||||||
|
y = rng.randn(n_samples)
|
||||||
|
|
||||||
|
A similar snippet can be used as synthetic data when testing scaling tools such
|
||||||
|
as :class:`sklearn.preprocessing.StandardScaler`.
|
||||||
|
|
||||||
|
- classification
|
||||||
|
|
||||||
|
If the bug is not raised during when encoding a categorical variable, you can
|
||||||
|
feed numeric data to a classifier. Just remember to ensure that the target
|
||||||
|
is indeed an integer.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
rng = np.random.RandomState(0)
|
||||||
|
n_samples, n_features = 5, 5
|
||||||
|
X = rng.randn(n_samples, n_features)
|
||||||
|
y = rng.randint(0, 2, n_samples) # binary target with values in {0, 1}
|
||||||
|
|
||||||
|
|
||||||
|
If the bug only happens with non-numeric class labels, you might want to
|
||||||
|
generate a random target with `numpy.random.choice
|
||||||
|
<https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html>`_.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
rng = np.random.RandomState(0)
|
||||||
|
n_samples, n_features = 50, 5
|
||||||
|
X = rng.randn(n_samples, n_features)
|
||||||
|
y = np.random.choice(
|
||||||
|
["male", "female", "other"], size=n_samples, p=[0.49, 0.49, 0.02]
|
||||||
|
)
|
||||||
|
|
||||||
|
Pandas
|
||||||
|
------
|
||||||
|
|
||||||
|
Some scikit-learn objects expect pandas dataframes as input. In this case you can
|
||||||
|
transform numpy arrays into pandas objects using `pandas.DataFrame
|
||||||
|
<https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`_, or
|
||||||
|
`pandas.Series
|
||||||
|
<https://pandas.pydata.org/docs/reference/api/pandas.Series.html>`_.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
|
|
||||||
|
rng = np.random.RandomState(0)
|
||||||
|
n_samples, n_features = 5, 5
|
||||||
|
X = pd.DataFrame(
|
||||||
|
{
|
||||||
|
"continuous_feature": rng.randn(n_samples),
|
||||||
|
"positive_feature": rng.uniform(low=0.0, high=100.0, size=n_samples),
|
||||||
|
"categorical_feature": rng.choice(["a", "b", "c"], size=n_samples),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
y = pd.Series(rng.randn(n_samples))
|
||||||
|
|
||||||
|
In addition, scikit-learn includes various :ref:`sample_generators` that can be
|
||||||
|
used to build artificial datasets of controlled size and complexity.
|
||||||
|
|
||||||
|
`make_regression`
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
As hinted by the name, :class:`sklearn.datasets.make_regression` produces
|
||||||
|
regression targets with noise as an optionally-sparse random linear combination
|
||||||
|
of random features.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from sklearn.datasets import make_regression
|
||||||
|
|
||||||
|
X, y = make_regression(n_samples=1000, n_features=20)
|
||||||
|
|
||||||
|
`make_classification`
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
:class:`sklearn.datasets.make_classification` creates multiclass datasets with multiple Gaussian
|
||||||
|
clusters per class. Noise can be introduced by means of correlated, redundant or
|
||||||
|
uninformative features.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from sklearn.datasets import make_classification
|
||||||
|
|
||||||
|
X, y = make_classification(
|
||||||
|
n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1
|
||||||
|
)
|
||||||
|
|
||||||
|
`make_blobs`
|
||||||
|
------------
|
||||||
|
|
||||||
|
Similarly to `make_classification`, :class:`sklearn.datasets.make_blobs` creates
|
||||||
|
multiclass datasets using normally-distributed clusters of points. It provides
|
||||||
|
greater control regarding the centers and standard deviations of each cluster,
|
||||||
|
and therefore it is useful to demonstrate clustering.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from sklearn.datasets import make_blobs
|
||||||
|
|
||||||
|
X, y = make_blobs(n_samples=10, centers=3, n_features=2)
|
||||||
|
|
||||||
|
Dataset loading utilities
|
||||||
|
-------------------------
|
||||||
|
|
||||||
|
You can use the :ref:`datasets` to load and fetch several popular reference
|
||||||
|
datasets. This option is useful when the bug relates to the particular structure
|
||||||
|
of the data, e.g. dealing with missing values or image recognition.
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
from sklearn.datasets import load_breast_cancer
|
||||||
|
|
||||||
|
X, y = load_breast_cancer(return_X_y=True)
|
|
@ -0,0 +1,420 @@
|
||||||
|
.. _performance-howto:
|
||||||
|
|
||||||
|
=========================
|
||||||
|
How to optimize for speed
|
||||||
|
=========================
|
||||||
|
|
||||||
|
The following gives some practical guidelines to help you write efficient
|
||||||
|
code for the scikit-learn project.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
While it is always useful to profile your code so as to **check
|
||||||
|
performance assumptions**, it is also highly recommended
|
||||||
|
to **review the literature** to ensure that the implemented algorithm
|
||||||
|
is the state of the art for the task before investing into costly
|
||||||
|
implementation optimization.
|
||||||
|
|
||||||
|
Times and times, hours of efforts invested in optimizing complicated
|
||||||
|
implementation details have been rendered irrelevant by the subsequent
|
||||||
|
discovery of simple **algorithmic tricks**, or by using another algorithm
|
||||||
|
altogether that is better suited to the problem.
|
||||||
|
|
||||||
|
The section :ref:`warm-restarts` gives an example of such a trick.
|
||||||
|
|
||||||
|
|
||||||
|
Python, Cython or C/C++?
|
||||||
|
========================
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn
|
||||||
|
|
||||||
|
In general, the scikit-learn project emphasizes the **readability** of
|
||||||
|
the source code to make it easy for the project users to dive into the
|
||||||
|
source code so as to understand how the algorithm behaves on their data
|
||||||
|
but also for ease of maintainability (by the developers).
|
||||||
|
|
||||||
|
When implementing a new algorithm is thus recommended to **start
|
||||||
|
implementing it in Python using Numpy and Scipy** by taking care of avoiding
|
||||||
|
looping code using the vectorized idioms of those libraries. In practice
|
||||||
|
this means trying to **replace any nested for loops by calls to equivalent
|
||||||
|
Numpy array methods**. The goal is to avoid the CPU wasting time in the
|
||||||
|
Python interpreter rather than crunching numbers to fit your statistical
|
||||||
|
model. It's generally a good idea to consider NumPy and SciPy performance tips:
|
||||||
|
https://scipy.github.io/old-wiki/pages/PerformanceTips
|
||||||
|
|
||||||
|
Sometimes however an algorithm cannot be expressed efficiently in simple
|
||||||
|
vectorized Numpy code. In this case, the recommended strategy is the
|
||||||
|
following:
|
||||||
|
|
||||||
|
1. **Profile** the Python implementation to find the main bottleneck and
|
||||||
|
isolate it in a **dedicated module level function**. This function
|
||||||
|
will be reimplemented as a compiled extension module.
|
||||||
|
|
||||||
|
2. If there exists a well maintained BSD or MIT **C/C++** implementation
|
||||||
|
of the same algorithm that is not too big, you can write a
|
||||||
|
**Cython wrapper** for it and include a copy of the source code
|
||||||
|
of the library in the scikit-learn source tree: this strategy is
|
||||||
|
used for the classes :class:`svm.LinearSVC`, :class:`svm.SVC` and
|
||||||
|
:class:`linear_model.LogisticRegression` (wrappers for liblinear
|
||||||
|
and libsvm).
|
||||||
|
|
||||||
|
3. Otherwise, write an optimized version of your Python function using
|
||||||
|
**Cython** directly. This strategy is used
|
||||||
|
for the :class:`linear_model.ElasticNet` and
|
||||||
|
:class:`linear_model.SGDClassifier` classes for instance.
|
||||||
|
|
||||||
|
4. **Move the Python version of the function in the tests** and use
|
||||||
|
it to check that the results of the compiled extension are consistent
|
||||||
|
with the gold standard, easy to debug Python version.
|
||||||
|
|
||||||
|
5. Once the code is optimized (not simple bottleneck spottable by
|
||||||
|
profiling), check whether it is possible to have **coarse grained
|
||||||
|
parallelism** that is amenable to **multi-processing** by using the
|
||||||
|
``joblib.Parallel`` class.
|
||||||
|
|
||||||
|
When using Cython, use either
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
python setup.py build_ext -i
|
||||||
|
python setup.py install
|
||||||
|
|
||||||
|
to generate C files. You are responsible for adding .c/.cpp extensions along
|
||||||
|
with build parameters in each submodule ``setup.py``.
|
||||||
|
|
||||||
|
C/C++ generated files are embedded in distributed stable packages. The goal is
|
||||||
|
to make it possible to install scikit-learn stable version
|
||||||
|
on any machine with Python, Numpy, Scipy and C/C++ compiler.
|
||||||
|
|
||||||
|
.. _profiling-python-code:
|
||||||
|
|
||||||
|
Profiling Python code
|
||||||
|
=====================
|
||||||
|
|
||||||
|
In order to profile Python code we recommend to write a script that
|
||||||
|
loads and prepare you data and then use the IPython integrated profiler
|
||||||
|
for interactively exploring the relevant part for the code.
|
||||||
|
|
||||||
|
Suppose we want to profile the Non Negative Matrix Factorization module
|
||||||
|
of scikit-learn. Let us setup a new IPython session and load the digits
|
||||||
|
dataset and as in the :ref:`sphx_glr_auto_examples_classification_plot_digits_classification.py` example::
|
||||||
|
|
||||||
|
In [1]: from sklearn.decomposition import NMF
|
||||||
|
|
||||||
|
In [2]: from sklearn.datasets import load_digits
|
||||||
|
|
||||||
|
In [3]: X, _ = load_digits(return_X_y=True)
|
||||||
|
|
||||||
|
Before starting the profiling session and engaging in tentative
|
||||||
|
optimization iterations, it is important to measure the total execution
|
||||||
|
time of the function we want to optimize without any kind of profiler
|
||||||
|
overhead and save it somewhere for later reference::
|
||||||
|
|
||||||
|
In [4]: %timeit NMF(n_components=16, tol=1e-2).fit(X)
|
||||||
|
1 loops, best of 3: 1.7 s per loop
|
||||||
|
|
||||||
|
To have a look at the overall performance profile using the ``%prun``
|
||||||
|
magic command::
|
||||||
|
|
||||||
|
In [5]: %prun -l nmf.py NMF(n_components=16, tol=1e-2).fit(X)
|
||||||
|
14496 function calls in 1.682 CPU seconds
|
||||||
|
|
||||||
|
Ordered by: internal time
|
||||||
|
List reduced from 90 to 9 due to restriction <'nmf.py'>
|
||||||
|
|
||||||
|
ncalls tottime percall cumtime percall filename:lineno(function)
|
||||||
|
36 0.609 0.017 1.499 0.042 nmf.py:151(_nls_subproblem)
|
||||||
|
1263 0.157 0.000 0.157 0.000 nmf.py:18(_pos)
|
||||||
|
1 0.053 0.053 1.681 1.681 nmf.py:352(fit_transform)
|
||||||
|
673 0.008 0.000 0.057 0.000 nmf.py:28(norm)
|
||||||
|
1 0.006 0.006 0.047 0.047 nmf.py:42(_initialize_nmf)
|
||||||
|
36 0.001 0.000 0.010 0.000 nmf.py:36(_sparseness)
|
||||||
|
30 0.001 0.000 0.001 0.000 nmf.py:23(_neg)
|
||||||
|
1 0.000 0.000 0.000 0.000 nmf.py:337(__init__)
|
||||||
|
1 0.000 0.000 1.681 1.681 nmf.py:461(fit)
|
||||||
|
|
||||||
|
The ``tottime`` column is the most interesting: it gives to total time spent
|
||||||
|
executing the code of a given function ignoring the time spent in executing the
|
||||||
|
sub-functions. The real total time (local code + sub-function calls) is given by
|
||||||
|
the ``cumtime`` column.
|
||||||
|
|
||||||
|
Note the use of the ``-l nmf.py`` that restricts the output to lines that
|
||||||
|
contains the "nmf.py" string. This is useful to have a quick look at the hotspot
|
||||||
|
of the nmf Python module it-self ignoring anything else.
|
||||||
|
|
||||||
|
Here is the beginning of the output of the same command without the ``-l nmf.py``
|
||||||
|
filter::
|
||||||
|
|
||||||
|
In [5] %prun NMF(n_components=16, tol=1e-2).fit(X)
|
||||||
|
16159 function calls in 1.840 CPU seconds
|
||||||
|
|
||||||
|
Ordered by: internal time
|
||||||
|
|
||||||
|
ncalls tottime percall cumtime percall filename:lineno(function)
|
||||||
|
2833 0.653 0.000 0.653 0.000 {numpy.core._dotblas.dot}
|
||||||
|
46 0.651 0.014 1.636 0.036 nmf.py:151(_nls_subproblem)
|
||||||
|
1397 0.171 0.000 0.171 0.000 nmf.py:18(_pos)
|
||||||
|
2780 0.167 0.000 0.167 0.000 {method 'sum' of 'numpy.ndarray' objects}
|
||||||
|
1 0.064 0.064 1.840 1.840 nmf.py:352(fit_transform)
|
||||||
|
1542 0.043 0.000 0.043 0.000 {method 'flatten' of 'numpy.ndarray' objects}
|
||||||
|
337 0.019 0.000 0.019 0.000 {method 'all' of 'numpy.ndarray' objects}
|
||||||
|
2734 0.011 0.000 0.181 0.000 fromnumeric.py:1185(sum)
|
||||||
|
2 0.010 0.005 0.010 0.005 {numpy.linalg.lapack_lite.dgesdd}
|
||||||
|
748 0.009 0.000 0.065 0.000 nmf.py:28(norm)
|
||||||
|
...
|
||||||
|
|
||||||
|
The above results show that the execution is largely dominated by
|
||||||
|
dot products operations (delegated to blas). Hence there is probably
|
||||||
|
no huge gain to expect by rewriting this code in Cython or C/C++: in
|
||||||
|
this case out of the 1.7s total execution time, almost 0.7s are spent
|
||||||
|
in compiled code we can consider optimal. By rewriting the rest of the
|
||||||
|
Python code and assuming we could achieve a 1000% boost on this portion
|
||||||
|
(which is highly unlikely given the shallowness of the Python loops),
|
||||||
|
we would not gain more than a 2.4x speed-up globally.
|
||||||
|
|
||||||
|
Hence major improvements can only be achieved by **algorithmic
|
||||||
|
improvements** in this particular example (e.g. trying to find operation
|
||||||
|
that are both costly and useless to avoid computing then rather than
|
||||||
|
trying to optimize their implementation).
|
||||||
|
|
||||||
|
It is however still interesting to check what's happening inside the
|
||||||
|
``_nls_subproblem`` function which is the hotspot if we only consider
|
||||||
|
Python code: it takes around 100% of the accumulated time of the module. In
|
||||||
|
order to better understand the profile of this specific function, let
|
||||||
|
us install ``line_profiler`` and wire it to IPython:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pip install line_profiler
|
||||||
|
|
||||||
|
**Under IPython 0.13+**, first create a configuration profile:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
ipython profile create
|
||||||
|
|
||||||
|
Then register the line_profiler extension in
|
||||||
|
``~/.ipython/profile_default/ipython_config.py``::
|
||||||
|
|
||||||
|
c.TerminalIPythonApp.extensions.append('line_profiler')
|
||||||
|
c.InteractiveShellApp.extensions.append('line_profiler')
|
||||||
|
|
||||||
|
This will register the ``%lprun`` magic command in the IPython terminal application and the other frontends such as qtconsole and notebook.
|
||||||
|
|
||||||
|
Now restart IPython and let us use this new toy::
|
||||||
|
|
||||||
|
In [1]: from sklearn.datasets import load_digits
|
||||||
|
|
||||||
|
In [2]: from sklearn.decomposition import NMF
|
||||||
|
... : from sklearn.decomposition._nmf import _nls_subproblem
|
||||||
|
|
||||||
|
In [3]: X, _ = load_digits(return_X_y=True)
|
||||||
|
|
||||||
|
In [4]: %lprun -f _nls_subproblem NMF(n_components=16, tol=1e-2).fit(X)
|
||||||
|
Timer unit: 1e-06 s
|
||||||
|
|
||||||
|
File: sklearn/decomposition/nmf.py
|
||||||
|
Function: _nls_subproblem at line 137
|
||||||
|
Total time: 1.73153 s
|
||||||
|
|
||||||
|
Line # Hits Time Per Hit % Time Line Contents
|
||||||
|
==============================================================
|
||||||
|
137 def _nls_subproblem(V, W, H_init, tol, max_iter):
|
||||||
|
138 """Non-negative least square solver
|
||||||
|
...
|
||||||
|
170 """
|
||||||
|
171 48 5863 122.1 0.3 if (H_init < 0).any():
|
||||||
|
172 raise ValueError("Negative values in H_init passed to NLS solver.")
|
||||||
|
173
|
||||||
|
174 48 139 2.9 0.0 H = H_init
|
||||||
|
175 48 112141 2336.3 5.8 WtV = np.dot(W.T, V)
|
||||||
|
176 48 16144 336.3 0.8 WtW = np.dot(W.T, W)
|
||||||
|
177
|
||||||
|
178 # values justified in the paper
|
||||||
|
179 48 144 3.0 0.0 alpha = 1
|
||||||
|
180 48 113 2.4 0.0 beta = 0.1
|
||||||
|
181 638 1880 2.9 0.1 for n_iter in range(1, max_iter + 1):
|
||||||
|
182 638 195133 305.9 10.2 grad = np.dot(WtW, H) - WtV
|
||||||
|
183 638 495761 777.1 25.9 proj_gradient = norm(grad[np.logical_or(grad < 0, H > 0)])
|
||||||
|
184 638 2449 3.8 0.1 if proj_gradient < tol:
|
||||||
|
185 48 130 2.7 0.0 break
|
||||||
|
186
|
||||||
|
187 1474 4474 3.0 0.2 for inner_iter in range(1, 20):
|
||||||
|
188 1474 83833 56.9 4.4 Hn = H - alpha * grad
|
||||||
|
189 # Hn = np.where(Hn > 0, Hn, 0)
|
||||||
|
190 1474 194239 131.8 10.1 Hn = _pos(Hn)
|
||||||
|
191 1474 48858 33.1 2.5 d = Hn - H
|
||||||
|
192 1474 150407 102.0 7.8 gradd = np.sum(grad * d)
|
||||||
|
193 1474 515390 349.7 26.9 dQd = np.sum(np.dot(WtW, d) * d)
|
||||||
|
...
|
||||||
|
|
||||||
|
By looking at the top values of the ``% Time`` column it is really easy to
|
||||||
|
pin-point the most expensive expressions that would deserve additional care.
|
||||||
|
|
||||||
|
|
||||||
|
Memory usage profiling
|
||||||
|
======================
|
||||||
|
|
||||||
|
You can analyze in detail the memory usage of any Python code with the help of
|
||||||
|
`memory_profiler <https://pypi.org/project/memory_profiler/>`_. First,
|
||||||
|
install the latest version:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pip install -U memory_profiler
|
||||||
|
|
||||||
|
Then, setup the magics in a manner similar to ``line_profiler``.
|
||||||
|
|
||||||
|
**Under IPython 0.11+**, first create a configuration profile:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
ipython profile create
|
||||||
|
|
||||||
|
|
||||||
|
Then register the extension in
|
||||||
|
``~/.ipython/profile_default/ipython_config.py``
|
||||||
|
alongside the line profiler::
|
||||||
|
|
||||||
|
c.TerminalIPythonApp.extensions.append('memory_profiler')
|
||||||
|
c.InteractiveShellApp.extensions.append('memory_profiler')
|
||||||
|
|
||||||
|
This will register the ``%memit`` and ``%mprun`` magic commands in the
|
||||||
|
IPython terminal application and the other frontends such as qtconsole and notebook.
|
||||||
|
|
||||||
|
``%mprun`` is useful to examine, line-by-line, the memory usage of key
|
||||||
|
functions in your program. It is very similar to ``%lprun``, discussed in the
|
||||||
|
previous section. For example, from the ``memory_profiler`` ``examples``
|
||||||
|
directory::
|
||||||
|
|
||||||
|
In [1] from example import my_func
|
||||||
|
|
||||||
|
In [2] %mprun -f my_func my_func()
|
||||||
|
Filename: example.py
|
||||||
|
|
||||||
|
Line # Mem usage Increment Line Contents
|
||||||
|
==============================================
|
||||||
|
3 @profile
|
||||||
|
4 5.97 MB 0.00 MB def my_func():
|
||||||
|
5 13.61 MB 7.64 MB a = [1] * (10 ** 6)
|
||||||
|
6 166.20 MB 152.59 MB b = [2] * (2 * 10 ** 7)
|
||||||
|
7 13.61 MB -152.59 MB del b
|
||||||
|
8 13.61 MB 0.00 MB return a
|
||||||
|
|
||||||
|
Another useful magic that ``memory_profiler`` defines is ``%memit``, which is
|
||||||
|
analogous to ``%timeit``. It can be used as follows::
|
||||||
|
|
||||||
|
In [1]: import numpy as np
|
||||||
|
|
||||||
|
In [2]: %memit np.zeros(1e7)
|
||||||
|
maximum of 3: 76.402344 MB per loop
|
||||||
|
|
||||||
|
For more details, see the docstrings of the magics, using ``%memit?`` and
|
||||||
|
``%mprun?``.
|
||||||
|
|
||||||
|
|
||||||
|
Using Cython
|
||||||
|
============
|
||||||
|
|
||||||
|
If profiling of the Python code reveals that the Python interpreter
|
||||||
|
overhead is larger by one order of magnitude or more than the cost of the
|
||||||
|
actual numerical computation (e.g. ``for`` loops over vector components,
|
||||||
|
nested evaluation of conditional expression, scalar arithmetic...), it
|
||||||
|
is probably adequate to extract the hotspot portion of the code as a
|
||||||
|
standalone function in a ``.pyx`` file, add static type declarations and
|
||||||
|
then use Cython to generate a C program suitable to be compiled as a
|
||||||
|
Python extension module.
|
||||||
|
|
||||||
|
The `Cython's documentation <http://docs.cython.org/>`_ contains a tutorial and
|
||||||
|
reference guide for developing such a module.
|
||||||
|
For more information about developing in Cython for scikit-learn, see :ref:`cython`.
|
||||||
|
|
||||||
|
|
||||||
|
.. _profiling-compiled-extension:
|
||||||
|
|
||||||
|
Profiling compiled extensions
|
||||||
|
=============================
|
||||||
|
|
||||||
|
When working with compiled extensions (written in C/C++ with a wrapper or
|
||||||
|
directly as Cython extension), the default Python profiler is useless:
|
||||||
|
we need a dedicated tool to introspect what's happening inside the
|
||||||
|
compiled extension it-self.
|
||||||
|
|
||||||
|
Using yep and gperftools
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
Easy profiling without special compilation options use yep:
|
||||||
|
|
||||||
|
- https://pypi.org/project/yep/
|
||||||
|
- https://fa.bianp.net/blog/2011/a-profiler-for-python-extensions
|
||||||
|
|
||||||
|
Using a debugger, gdb
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
* It is helpful to use ``gdb`` to debug. In order to do so, one must use
|
||||||
|
a Python interpreter built with debug support (debug symbols and proper
|
||||||
|
optimization). To create a new conda environment (which you might need
|
||||||
|
to deactivate and reactivate after building/installing) with a source-built
|
||||||
|
CPython interpreter:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
git clone https://github.com/python/cpython.git
|
||||||
|
conda create -n debug-scikit-dev
|
||||||
|
conda activate debug-scikit-dev
|
||||||
|
cd cpython
|
||||||
|
mkdir debug
|
||||||
|
cd debug
|
||||||
|
../configure --prefix=$CONDA_PREFIX --with-pydebug
|
||||||
|
make EXTRA_CFLAGS='-DPy_DEBUG' -j<num_cores>
|
||||||
|
make install
|
||||||
|
|
||||||
|
|
||||||
|
Using gprof
|
||||||
|
-----------
|
||||||
|
|
||||||
|
In order to profile compiled Python extensions one could use ``gprof``
|
||||||
|
after having recompiled the project with ``gcc -pg`` and using the
|
||||||
|
``python-dbg`` variant of the interpreter on debian / ubuntu: however
|
||||||
|
this approach requires to also have ``numpy`` and ``scipy`` recompiled
|
||||||
|
with ``-pg`` which is rather complicated to get working.
|
||||||
|
|
||||||
|
Fortunately there exist two alternative profilers that don't require you to
|
||||||
|
recompile everything.
|
||||||
|
|
||||||
|
Using valgrind / callgrind / kcachegrind
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
kcachegrind
|
||||||
|
~~~~~~~~~~~
|
||||||
|
|
||||||
|
``yep`` can be used to create a profiling report.
|
||||||
|
``kcachegrind`` provides a graphical environment to visualize this report:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
# Run yep to profile some python script
|
||||||
|
python -m yep -c my_file.py
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
# open my_file.py.callgrin with kcachegrind
|
||||||
|
kcachegrind my_file.py.prof
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
``yep`` can be executed with the argument ``--lines`` or ``-l`` to compile
|
||||||
|
a profiling report 'line by line'.
|
||||||
|
|
||||||
|
Multi-core parallelism using ``joblib.Parallel``
|
||||||
|
================================================
|
||||||
|
|
||||||
|
See `joblib documentation <https://joblib.readthedocs.io>`_
|
||||||
|
|
||||||
|
|
||||||
|
.. _warm-restarts:
|
||||||
|
|
||||||
|
A simple algorithmic trick: warm restarts
|
||||||
|
=========================================
|
||||||
|
|
||||||
|
See the glossary entry for :term:`warm_start`
|
|
@ -0,0 +1,97 @@
|
||||||
|
.. _plotting_api:
|
||||||
|
|
||||||
|
================================
|
||||||
|
Developing with the Plotting API
|
||||||
|
================================
|
||||||
|
|
||||||
|
Scikit-learn defines a simple API for creating visualizations for machine
|
||||||
|
learning. The key features of this API is to run calculations once and to have
|
||||||
|
the flexibility to adjust the visualizations after the fact. This section is
|
||||||
|
intended for developers who wish to develop or maintain plotting tools. For
|
||||||
|
usage, users should refer to the :ref:`User Guide <visualizations>`.
|
||||||
|
|
||||||
|
Plotting API Overview
|
||||||
|
---------------------
|
||||||
|
|
||||||
|
This logic is encapsulated into a display object where the computed data is
|
||||||
|
stored and the plotting is done in a `plot` method. The display object's
|
||||||
|
`__init__` method contains only the data needed to create the visualization.
|
||||||
|
The `plot` method takes in parameters that only have to do with visualization,
|
||||||
|
such as a matplotlib axes. The `plot` method will store the matplotlib artists
|
||||||
|
as attributes allowing for style adjustments through the display object. The
|
||||||
|
`Display` class should define one or both class methods: `from_estimator` and
|
||||||
|
`from_predictions`. These methods allows to create the `Display` object from
|
||||||
|
the estimator and some data or from the true and predicted values. After these
|
||||||
|
class methods create the display object with the computed values, then call the
|
||||||
|
display's plot method. Note that the `plot` method defines attributes related
|
||||||
|
to matplotlib, such as the line artist. This allows for customizations after
|
||||||
|
calling the `plot` method.
|
||||||
|
|
||||||
|
For example, the `RocCurveDisplay` defines the following methods and
|
||||||
|
attributes::
|
||||||
|
|
||||||
|
class RocCurveDisplay:
|
||||||
|
def __init__(self, fpr, tpr, roc_auc, estimator_name):
|
||||||
|
...
|
||||||
|
self.fpr = fpr
|
||||||
|
self.tpr = tpr
|
||||||
|
self.roc_auc = roc_auc
|
||||||
|
self.estimator_name = estimator_name
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_estimator(cls, estimator, X, y):
|
||||||
|
# get the predictions
|
||||||
|
y_pred = estimator.predict_proba(X)[:, 1]
|
||||||
|
return cls.from_predictions(y, y_pred, estimator.__class__.__name__)
|
||||||
|
|
||||||
|
@classmethod
|
||||||
|
def from_predictions(cls, y, y_pred, estimator_name):
|
||||||
|
# do ROC computation from y and y_pred
|
||||||
|
fpr, tpr, roc_auc = ...
|
||||||
|
viz = RocCurveDisplay(fpr, tpr, roc_auc, estimator_name)
|
||||||
|
return viz.plot()
|
||||||
|
|
||||||
|
def plot(self, ax=None, name=None, **kwargs):
|
||||||
|
...
|
||||||
|
self.line_ = ...
|
||||||
|
self.ax_ = ax
|
||||||
|
self.figure_ = ax.figure_
|
||||||
|
|
||||||
|
Read more in :ref:`sphx_glr_auto_examples_miscellaneous_plot_roc_curve_visualization_api.py`
|
||||||
|
and the :ref:`User Guide <visualizations>`.
|
||||||
|
|
||||||
|
Plotting with Multiple Axes
|
||||||
|
---------------------------
|
||||||
|
|
||||||
|
Some of the plotting tools like
|
||||||
|
:func:`~sklearn.inspection.PartialDependenceDisplay.from_estimator` and
|
||||||
|
:class:`~sklearn.inspection.PartialDependenceDisplay` support plotting on
|
||||||
|
multiple axes. Two different scenarios are supported:
|
||||||
|
|
||||||
|
1. If a list of axes is passed in, `plot` will check if the number of axes is
|
||||||
|
consistent with the number of axes it expects and then draws on those axes. 2.
|
||||||
|
If a single axes is passed in, that axes defines a space for multiple axes to
|
||||||
|
be placed. In this case, we suggest using matplotlib's
|
||||||
|
`~matplotlib.gridspec.GridSpecFromSubplotSpec` to split up the space::
|
||||||
|
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
from matplotlib.gridspec import GridSpecFromSubplotSpec
|
||||||
|
|
||||||
|
fig, ax = plt.subplots()
|
||||||
|
gs = GridSpecFromSubplotSpec(2, 2, subplot_spec=ax.get_subplotspec())
|
||||||
|
|
||||||
|
ax_top_left = fig.add_subplot(gs[0, 0])
|
||||||
|
ax_top_right = fig.add_subplot(gs[0, 1])
|
||||||
|
ax_bottom = fig.add_subplot(gs[1, :])
|
||||||
|
|
||||||
|
By default, the `ax` keyword in `plot` is `None`. In this case, the single
|
||||||
|
axes is created and the gridspec api is used to create the regions to plot in.
|
||||||
|
|
||||||
|
See for example, :meth:`~sklearn.inspection.PartialDependenceDisplay.from_estimator`
|
||||||
|
which plots multiple lines and contours using this API. The axes defining the
|
||||||
|
bounding box is saved in a `bounding_ax_` attribute. The individual axes
|
||||||
|
created are stored in an `axes_` ndarray, corresponding to the axes position on
|
||||||
|
the grid. Positions that are not used are set to `None`. Furthermore, the
|
||||||
|
matplotlib Artists are stored in `lines_` and `contours_` where the key is the
|
||||||
|
position on the grid. When a list of axes is passed in, the `axes_`, `lines_`,
|
||||||
|
and `contours_` is a 1d ndarray corresponding to the list of axes passed in.
|
|
@ -0,0 +1,373 @@
|
||||||
|
.. _developers-tips:
|
||||||
|
|
||||||
|
===========================
|
||||||
|
Developers' Tips and Tricks
|
||||||
|
===========================
|
||||||
|
|
||||||
|
Productivity and sanity-preserving tips
|
||||||
|
=======================================
|
||||||
|
|
||||||
|
In this section we gather some useful advice and tools that may increase your
|
||||||
|
quality-of-life when reviewing pull requests, running unit tests, and so forth.
|
||||||
|
Some of these tricks consist of userscripts that require a browser extension
|
||||||
|
such as `TamperMonkey`_ or `GreaseMonkey`_; to set up userscripts you must have
|
||||||
|
one of these extensions installed, enabled and running. We provide userscripts
|
||||||
|
as GitHub gists; to install them, click on the "Raw" button on the gist page.
|
||||||
|
|
||||||
|
.. _TamperMonkey: https://tampermonkey.net/
|
||||||
|
.. _GreaseMonkey: https://www.greasespot.net/
|
||||||
|
|
||||||
|
Folding and unfolding outdated diffs on pull requests
|
||||||
|
-----------------------------------------------------
|
||||||
|
|
||||||
|
GitHub hides discussions on PRs when the corresponding lines of code have been
|
||||||
|
changed in the mean while. This `userscript
|
||||||
|
<https://raw.githubusercontent.com/lesteve/userscripts/master/github-expand-all.user.js>`__
|
||||||
|
provides a shortcut (Control-Alt-P at the time of writing but look at the code
|
||||||
|
to be sure) to unfold all such hidden discussions at once, so you can catch up.
|
||||||
|
|
||||||
|
Checking out pull requests as remote-tracking branches
|
||||||
|
------------------------------------------------------
|
||||||
|
|
||||||
|
In your local fork, add to your ``.git/config``, under the ``[remote
|
||||||
|
"upstream"]`` heading, the line::
|
||||||
|
|
||||||
|
fetch = +refs/pull/*/head:refs/remotes/upstream/pr/*
|
||||||
|
|
||||||
|
You may then use ``git checkout pr/PR_NUMBER`` to navigate to the code of the
|
||||||
|
pull-request with the given number. (`Read more in this gist.
|
||||||
|
<https://gist.github.com/piscisaureus/3342247>`_)
|
||||||
|
|
||||||
|
Display code coverage in pull requests
|
||||||
|
--------------------------------------
|
||||||
|
|
||||||
|
To overlay the code coverage reports generated by the CodeCov continuous
|
||||||
|
integration, consider `this browser extension
|
||||||
|
<https://github.com/codecov/browser-extension>`_. The coverage of each line
|
||||||
|
will be displayed as a color background behind the line number.
|
||||||
|
|
||||||
|
|
||||||
|
.. _pytest_tips:
|
||||||
|
|
||||||
|
Useful pytest aliases and flags
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
The full test suite takes fairly long to run. For faster iterations,
|
||||||
|
it is possibly to select a subset of tests using pytest selectors.
|
||||||
|
In particular, one can run a `single test based on its node ID
|
||||||
|
<https://docs.pytest.org/en/latest/example/markers.html#selecting-tests-based-on-their-node-id>`_:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pytest -v sklearn/linear_model/tests/test_logistic.py::test_sparsify
|
||||||
|
|
||||||
|
or use the `-k pytest parameter
|
||||||
|
<https://docs.pytest.org/en/latest/example/markers.html#using-k-expr-to-select-tests-based-on-their-name>`_
|
||||||
|
to select tests based on their name. For instance,:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pytest sklearn/tests/test_common.py -v -k LogisticRegression
|
||||||
|
|
||||||
|
will run all :term:`common tests` for the ``LogisticRegression`` estimator.
|
||||||
|
|
||||||
|
When a unit test fails, the following tricks can make debugging easier:
|
||||||
|
|
||||||
|
1. The command line argument ``pytest -l`` instructs pytest to print the local
|
||||||
|
variables when a failure occurs.
|
||||||
|
|
||||||
|
2. The argument ``pytest --pdb`` drops into the Python debugger on failure. To
|
||||||
|
instead drop into the rich IPython debugger ``ipdb``, you may set up a
|
||||||
|
shell alias to:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
pytest --pdbcls=IPython.terminal.debugger:TerminalPdb --capture no
|
||||||
|
|
||||||
|
Other `pytest` options that may become useful include:
|
||||||
|
|
||||||
|
- ``-x`` which exits on the first failed test,
|
||||||
|
- ``--lf`` to rerun the tests that failed on the previous run,
|
||||||
|
- ``--ff`` to rerun all previous tests, running the ones that failed first,
|
||||||
|
- ``-s`` so that pytest does not capture the output of ``print()`` statements,
|
||||||
|
- ``--tb=short`` or ``--tb=line`` to control the length of the logs,
|
||||||
|
- ``--runxfail`` also run tests marked as a known failure (XFAIL) and report errors.
|
||||||
|
|
||||||
|
Since our continuous integration tests will error if
|
||||||
|
``FutureWarning`` isn't properly caught,
|
||||||
|
it is also recommended to run ``pytest`` along with the
|
||||||
|
``-Werror::FutureWarning`` flag.
|
||||||
|
|
||||||
|
.. _saved_replies:
|
||||||
|
|
||||||
|
Standard replies for reviewing
|
||||||
|
------------------------------
|
||||||
|
|
||||||
|
It may be helpful to store some of these in GitHub's `saved
|
||||||
|
replies <https://github.com/settings/replies/>`_ for reviewing:
|
||||||
|
|
||||||
|
.. highlight:: none
|
||||||
|
|
||||||
|
..
|
||||||
|
Note that putting this content on a single line in a literal is the easiest way to make it copyable and wrapped on screen.
|
||||||
|
|
||||||
|
Issue: Usage questions
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
You are asking a usage question. The issue tracker is for bugs and new features. For usage questions, it is recommended to try [Stack Overflow](https://stackoverflow.com/questions/tagged/scikit-learn) or [the Mailing List](https://mail.python.org/mailman/listinfo/scikit-learn).
|
||||||
|
|
||||||
|
Unfortunately, we need to close this issue as this issue tracker is a communication tool used for the development of scikit-learn. The additional activity created by usage questions crowds it too much and impedes this development. The conversation can continue here, however there is no guarantee that it will receive attention from core developers.
|
||||||
|
|
||||||
|
|
||||||
|
Issue: You're welcome to update the docs
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Please feel free to offer a pull request updating the documentation if you feel it could be improved.
|
||||||
|
|
||||||
|
Issue: Self-contained example for bug
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Please provide [self-contained example code](https://scikit-learn.org/dev/developers/minimal_reproducer.html), including imports and data (if possible), so that other contributors can just run it and reproduce your issue. Ideally your example code should be minimal.
|
||||||
|
|
||||||
|
Issue: Software versions
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
To help diagnose your issue, please paste the output of:
|
||||||
|
```py
|
||||||
|
import sklearn; sklearn.show_versions()
|
||||||
|
```
|
||||||
|
Thanks.
|
||||||
|
|
||||||
|
Issue: Code blocks
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Readability can be greatly improved if you [format](https://help.github.com/articles/creating-and-highlighting-code-blocks/) your code snippets and complete error messages appropriately. For example:
|
||||||
|
|
||||||
|
```python
|
||||||
|
print(something)
|
||||||
|
```
|
||||||
|
|
||||||
|
generates:
|
||||||
|
|
||||||
|
```python
|
||||||
|
print(something)
|
||||||
|
```
|
||||||
|
|
||||||
|
And:
|
||||||
|
|
||||||
|
```pytb
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "<stdin>", line 1, in <module>
|
||||||
|
ImportError: No module named 'hello'
|
||||||
|
```
|
||||||
|
|
||||||
|
generates:
|
||||||
|
|
||||||
|
```pytb
|
||||||
|
Traceback (most recent call last):
|
||||||
|
File "<stdin>", line 1, in <module>
|
||||||
|
ImportError: No module named 'hello'
|
||||||
|
```
|
||||||
|
|
||||||
|
You can edit your issue descriptions and comments at any time to improve readability. This helps maintainers a lot. Thanks!
|
||||||
|
|
||||||
|
Issue/Comment: Linking to code
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Friendly advice: for clarity's sake, you can link to code like [this](https://help.github.com/articles/creating-a-permanent-link-to-a-code-snippet/).
|
||||||
|
|
||||||
|
Issue/Comment: Linking to comments
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Please use links to comments, which make it a lot easier to see what you are referring to, rather than just linking to the issue. See [this](https://stackoverflow.com/questions/25163598/how-do-i-reference-a-specific-issue-comment-on-github) for more details.
|
||||||
|
|
||||||
|
PR-NEW: Better description and title
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Thanks for the pull request! Please make the title of the PR more descriptive. The title will become the commit message when this is merged. You should state what issue (or PR) it fixes/resolves in the description using the syntax described [here](https://scikit-learn.org/dev/developers/contributing.html#contributing-pull-requests).
|
||||||
|
|
||||||
|
PR-NEW: Fix #
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Please use "Fix #issueNumber" in your PR description (and you can do it more than once). This way the associated issue gets closed automatically when the PR is merged. For more details, look at [this](https://github.com/blog/1506-closing-issues-via-pull-requests).
|
||||||
|
|
||||||
|
PR-NEW or Issue: Maintenance cost
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Every feature we include has a [maintenance cost](https://scikit-learn.org/dev/faq.html#why-are-you-so-selective-on-what-algorithms-you-include-in-scikit-learn). Our maintainers are mostly volunteers. For a new feature to be included, we need evidence that it is often useful and, ideally, [well-established](https://scikit-learn.org/dev/faq.html#what-are-the-inclusion-criteria-for-new-algorithms) in the literature or in practice. Also, we expect PR authors to take part in the maintenance for the code they submit, at least initially. That doesn't stop you implementing it for yourself and publishing it in a separate repository, or even [scikit-learn-contrib](https://scikit-learn-contrib.github.io).
|
||||||
|
|
||||||
|
PR-WIP: What's needed before merge?
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Please clarify (perhaps as a TODO list in the PR description) what work you believe still needs to be done before it can be reviewed for merge. When it is ready, please prefix the PR title with `[MRG]`.
|
||||||
|
|
||||||
|
PR-WIP: Regression test needed
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Please add a [non-regression test](https://en.wikipedia.org/wiki/Non-regression_testing) that would fail at main but pass in this PR.
|
||||||
|
|
||||||
|
PR-WIP: PEP8
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
You have some [PEP8](https://www.python.org/dev/peps/pep-0008/) violations, whose details you can see in the Circle CI `lint` job. It might be worth configuring your code editor to check for such errors on the fly, so you can catch them before committing.
|
||||||
|
|
||||||
|
PR-MRG: Patience
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Before merging, we generally require two core developers to agree that your pull request is desirable and ready. [Please be patient](https://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention), as we mostly rely on volunteered time from busy core developers. (You are also welcome to help us out with [reviewing other PRs](https://scikit-learn.org/dev/developers/contributing.html#code-review-guidelines).)
|
||||||
|
|
||||||
|
PR-MRG: Add to what's new
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Please add an entry to the change log at `doc/whats_new/v*.rst`. Like the other entries there, please reference this pull request with `:pr:` and credit yourself (and other contributors if applicable) with `:user:`.
|
||||||
|
|
||||||
|
PR: Don't change unrelated
|
||||||
|
|
||||||
|
::
|
||||||
|
|
||||||
|
Please do not change unrelated lines. It makes your contribution harder to review and may introduce merge conflicts to other pull requests.
|
||||||
|
|
||||||
|
.. highlight:: default
|
||||||
|
|
||||||
|
Debugging memory errors in Cython with valgrind
|
||||||
|
===============================================
|
||||||
|
|
||||||
|
While python/numpy's built-in memory management is relatively robust, it can
|
||||||
|
lead to performance penalties for some routines. For this reason, much of
|
||||||
|
the high-performance code in scikit-learn is written in cython. This
|
||||||
|
performance gain comes with a tradeoff, however: it is very easy for memory
|
||||||
|
bugs to crop up in cython code, especially in situations where that code
|
||||||
|
relies heavily on pointer arithmetic.
|
||||||
|
|
||||||
|
Memory errors can manifest themselves a number of ways. The easiest ones to
|
||||||
|
debug are often segmentation faults and related glibc errors. Uninitialized
|
||||||
|
variables can lead to unexpected behavior that is difficult to track down.
|
||||||
|
A very useful tool when debugging these sorts of errors is
|
||||||
|
valgrind_.
|
||||||
|
|
||||||
|
|
||||||
|
Valgrind is a command-line tool that can trace memory errors in a variety of
|
||||||
|
code. Follow these steps:
|
||||||
|
|
||||||
|
1. Install `valgrind`_ on your system.
|
||||||
|
|
||||||
|
2. Download the python valgrind suppression file: `valgrind-python.supp`_.
|
||||||
|
|
||||||
|
3. Follow the directions in the `README.valgrind`_ file to customize your
|
||||||
|
python suppressions. If you don't, you will have spurious output coming
|
||||||
|
related to the python interpreter instead of your own code.
|
||||||
|
|
||||||
|
4. Run valgrind as follows:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
valgrind -v --suppressions=valgrind-python.supp python my_test_script.py
|
||||||
|
|
||||||
|
.. _valgrind: https://valgrind.org
|
||||||
|
.. _`README.valgrind`: https://github.com/python/cpython/blob/master/Misc/README.valgrind
|
||||||
|
.. _`valgrind-python.supp`: https://github.com/python/cpython/blob/master/Misc/valgrind-python.supp
|
||||||
|
|
||||||
|
|
||||||
|
The result will be a list of all the memory-related errors, which reference
|
||||||
|
lines in the C-code generated by cython from your .pyx file. If you examine
|
||||||
|
the referenced lines in the .c file, you will see comments which indicate the
|
||||||
|
corresponding location in your .pyx source file. Hopefully the output will
|
||||||
|
give you clues as to the source of your memory error.
|
||||||
|
|
||||||
|
For more information on valgrind and the array of options it has, see the
|
||||||
|
tutorials and documentation on the `valgrind web site <https://valgrind.org>`_.
|
||||||
|
|
||||||
|
.. _arm64_dev_env:
|
||||||
|
|
||||||
|
Building and testing for the ARM64 platform on a x86_64 machine
|
||||||
|
===============================================================
|
||||||
|
|
||||||
|
ARM-based machines are a popular target for mobile, edge or other low-energy
|
||||||
|
deployments (including in the cloud, for instance on Scaleway or AWS Graviton).
|
||||||
|
|
||||||
|
Here are instructions to setup a local dev environment to reproduce
|
||||||
|
ARM-specific bugs or test failures on a x86_64 host laptop or workstation. This
|
||||||
|
is based on QEMU user mode emulation using docker for convenience (see
|
||||||
|
https://github.com/multiarch/qemu-user-static).
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
The following instructions are illustrated for ARM64 but they also apply to
|
||||||
|
ppc64le, after changing the Docker image and Miniforge paths appropriately.
|
||||||
|
|
||||||
|
Prepare a folder on the host filesystem and download the necessary tools and
|
||||||
|
source code:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
mkdir arm64
|
||||||
|
pushd arm64
|
||||||
|
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-aarch64.sh
|
||||||
|
git clone https://github.com/scikit-learn/scikit-learn.git
|
||||||
|
|
||||||
|
Use docker to install QEMU user mode and run an ARM64v8 container with access
|
||||||
|
to your shared folder under the `/io` mount point:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
|
||||||
|
docker run -v`pwd`:/io --rm -it arm64v8/ubuntu /bin/bash
|
||||||
|
|
||||||
|
In the container, install miniforge3 for the ARM64 (a.k.a. aarch64)
|
||||||
|
architecture:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
bash Miniforge3-Linux-aarch64.sh
|
||||||
|
# Choose to install miniforge3 under: `/io/miniforge3`
|
||||||
|
|
||||||
|
Whenever you restart a new container, you will need to reinit the conda env
|
||||||
|
previously installed under `/io/miniforge3`:
|
||||||
|
|
||||||
|
.. prompt:: bash $
|
||||||
|
|
||||||
|
/io/miniforge3/bin/conda init
|
||||||
|
source /root/.bashrc
|
||||||
|
|
||||||
|
as the `/root` home folder is part of the ephemeral docker container. Every
|
||||||
|
file or directory stored under `/io` is persistent on the other hand.
|
||||||
|
|
||||||
|
You can then build scikit-learn as usual (you will need to install compiler
|
||||||
|
tools and dependencies using apt or conda as usual). Building scikit-learn
|
||||||
|
takes a lot of time because of the emulation layer, however it needs to be
|
||||||
|
done only once if you put the scikit-learn folder under the `/io` mount
|
||||||
|
point.
|
||||||
|
|
||||||
|
Then use pytest to run only the tests of the module you are interested in
|
||||||
|
debugging.
|
||||||
|
|
||||||
|
.. _meson_build_backend:
|
||||||
|
|
||||||
|
The Meson Build Backend
|
||||||
|
=======================
|
||||||
|
|
||||||
|
Since scikit-learn 1.5.0 we use meson-python as the build tool. Meson is
|
||||||
|
a new tool for scikit-learn and the PyData ecosystem. It is used by several
|
||||||
|
other packages that have written good guides about what it is and how it works.
|
||||||
|
|
||||||
|
- `pandas setup doc
|
||||||
|
<https://pandas.pydata.org/docs/development/contributing_environment.html#step-3-build-and-install-pandas>`_:
|
||||||
|
pandas has a similar setup as ours (no spin or dev.py)
|
||||||
|
- `scipy Meson doc
|
||||||
|
<https://scipy.github.io/devdocs/building/understanding_meson.html>`_ gives
|
||||||
|
more background about how Meson works behind the scenes
|
|
@ -0,0 +1,242 @@
|
||||||
|
.. _developers-utils:
|
||||||
|
|
||||||
|
========================
|
||||||
|
Utilities for Developers
|
||||||
|
========================
|
||||||
|
|
||||||
|
Scikit-learn contains a number of utilities to help with development. These are
|
||||||
|
located in :mod:`sklearn.utils`, and include tools in a number of categories.
|
||||||
|
All the following functions and classes are in the module :mod:`sklearn.utils`.
|
||||||
|
|
||||||
|
.. warning ::
|
||||||
|
|
||||||
|
These utilities are meant to be used internally within the scikit-learn
|
||||||
|
package. They are not guaranteed to be stable between versions of
|
||||||
|
scikit-learn. Backports, in particular, will be removed as the scikit-learn
|
||||||
|
dependencies evolve.
|
||||||
|
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn.utils
|
||||||
|
|
||||||
|
Validation Tools
|
||||||
|
================
|
||||||
|
|
||||||
|
These are tools used to check and validate input. When you write a function
|
||||||
|
which accepts arrays, matrices, or sparse matrices as arguments, the following
|
||||||
|
should be used when applicable.
|
||||||
|
|
||||||
|
- :func:`assert_all_finite`: Throw an error if array contains NaNs or Infs.
|
||||||
|
|
||||||
|
- :func:`as_float_array`: convert input to an array of floats. If a sparse
|
||||||
|
matrix is passed, a sparse matrix will be returned.
|
||||||
|
|
||||||
|
- :func:`check_array`: check that input is a 2D array, raise error on sparse
|
||||||
|
matrices. Allowed sparse matrix formats can be given optionally, as well as
|
||||||
|
allowing 1D or N-dimensional arrays. Calls :func:`assert_all_finite` by
|
||||||
|
default.
|
||||||
|
|
||||||
|
- :func:`check_X_y`: check that X and y have consistent length, calls
|
||||||
|
check_array on X, and column_or_1d on y. For multilabel classification or
|
||||||
|
multitarget regression, specify multi_output=True, in which case check_array
|
||||||
|
will be called on y.
|
||||||
|
|
||||||
|
- :func:`indexable`: check that all input arrays have consistent length and can
|
||||||
|
be sliced or indexed using safe_index. This is used to validate input for
|
||||||
|
cross-validation.
|
||||||
|
|
||||||
|
- :func:`validation.check_memory` checks that input is ``joblib.Memory``-like,
|
||||||
|
which means that it can be converted into a
|
||||||
|
``sklearn.utils.Memory`` instance (typically a str denoting
|
||||||
|
the ``cachedir``) or has the same interface.
|
||||||
|
|
||||||
|
If your code relies on a random number generator, it should never use
|
||||||
|
functions like ``numpy.random.random`` or ``numpy.random.normal``. This
|
||||||
|
approach can lead to repeatability issues in unit tests. Instead, a
|
||||||
|
``numpy.random.RandomState`` object should be used, which is built from
|
||||||
|
a ``random_state`` argument passed to the class or function. The function
|
||||||
|
:func:`check_random_state`, below, can then be used to create a random
|
||||||
|
number generator object.
|
||||||
|
|
||||||
|
- :func:`check_random_state`: create a ``np.random.RandomState`` object from
|
||||||
|
a parameter ``random_state``.
|
||||||
|
|
||||||
|
- If ``random_state`` is ``None`` or ``np.random``, then a
|
||||||
|
randomly-initialized ``RandomState`` object is returned.
|
||||||
|
- If ``random_state`` is an integer, then it is used to seed a new
|
||||||
|
``RandomState`` object.
|
||||||
|
- If ``random_state`` is a ``RandomState`` object, then it is passed through.
|
||||||
|
|
||||||
|
For example::
|
||||||
|
|
||||||
|
>>> from sklearn.utils import check_random_state
|
||||||
|
>>> random_state = 0
|
||||||
|
>>> random_state = check_random_state(random_state)
|
||||||
|
>>> random_state.rand(4)
|
||||||
|
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318])
|
||||||
|
|
||||||
|
When developing your own scikit-learn compatible estimator, the following
|
||||||
|
helpers are available.
|
||||||
|
|
||||||
|
- :func:`validation.check_is_fitted`: check that the estimator has been fitted
|
||||||
|
before calling ``transform``, ``predict``, or similar methods. This helper
|
||||||
|
allows to raise a standardized error message across estimator.
|
||||||
|
|
||||||
|
- :func:`validation.has_fit_parameter`: check that a given parameter is
|
||||||
|
supported in the ``fit`` method of a given estimator.
|
||||||
|
|
||||||
|
Efficient Linear Algebra & Array Operations
|
||||||
|
===========================================
|
||||||
|
|
||||||
|
- :func:`extmath.randomized_range_finder`: construct an orthonormal matrix
|
||||||
|
whose range approximates the range of the input. This is used in
|
||||||
|
:func:`extmath.randomized_svd`, below.
|
||||||
|
|
||||||
|
- :func:`extmath.randomized_svd`: compute the k-truncated randomized SVD.
|
||||||
|
This algorithm finds the exact truncated singular values decomposition
|
||||||
|
using randomization to speed up the computations. It is particularly
|
||||||
|
fast on large matrices on which you wish to extract only a small
|
||||||
|
number of components.
|
||||||
|
|
||||||
|
- `arrayfuncs.cholesky_delete`:
|
||||||
|
(used in :func:`~sklearn.linear_model.lars_path`) Remove an
|
||||||
|
item from a cholesky factorization.
|
||||||
|
|
||||||
|
- :func:`arrayfuncs.min_pos`: (used in ``sklearn.linear_model.least_angle``)
|
||||||
|
Find the minimum of the positive values within an array.
|
||||||
|
|
||||||
|
|
||||||
|
- :func:`extmath.fast_logdet`: efficiently compute the log of the determinant
|
||||||
|
of a matrix.
|
||||||
|
|
||||||
|
- :func:`extmath.density`: efficiently compute the density of a sparse vector
|
||||||
|
|
||||||
|
- :func:`extmath.safe_sparse_dot`: dot product which will correctly handle
|
||||||
|
``scipy.sparse`` inputs. If the inputs are dense, it is equivalent to
|
||||||
|
``numpy.dot``.
|
||||||
|
|
||||||
|
- :func:`extmath.weighted_mode`: an extension of ``scipy.stats.mode`` which
|
||||||
|
allows each item to have a real-valued weight.
|
||||||
|
|
||||||
|
- :func:`resample`: Resample arrays or sparse matrices in a consistent way.
|
||||||
|
used in :func:`shuffle`, below.
|
||||||
|
|
||||||
|
- :func:`shuffle`: Shuffle arrays or sparse matrices in a consistent way.
|
||||||
|
Used in :func:`~sklearn.cluster.k_means`.
|
||||||
|
|
||||||
|
|
||||||
|
Efficient Random Sampling
|
||||||
|
=========================
|
||||||
|
|
||||||
|
- :func:`random.sample_without_replacement`: implements efficient algorithms
|
||||||
|
for sampling ``n_samples`` integers from a population of size ``n_population``
|
||||||
|
without replacement.
|
||||||
|
|
||||||
|
|
||||||
|
Efficient Routines for Sparse Matrices
|
||||||
|
======================================
|
||||||
|
|
||||||
|
The ``sklearn.utils.sparsefuncs`` cython module hosts compiled extensions to
|
||||||
|
efficiently process ``scipy.sparse`` data.
|
||||||
|
|
||||||
|
- :func:`sparsefuncs.mean_variance_axis`: compute the means and
|
||||||
|
variances along a specified axis of a CSR matrix.
|
||||||
|
Used for normalizing the tolerance stopping criterion in
|
||||||
|
:class:`~sklearn.cluster.KMeans`.
|
||||||
|
|
||||||
|
- :func:`sparsefuncs_fast.inplace_csr_row_normalize_l1` and
|
||||||
|
:func:`sparsefuncs_fast.inplace_csr_row_normalize_l2`: can be used to normalize
|
||||||
|
individual sparse samples to unit L1 or L2 norm as done in
|
||||||
|
:class:`~sklearn.preprocessing.Normalizer`.
|
||||||
|
|
||||||
|
- :func:`sparsefuncs.inplace_csr_column_scale`: can be used to multiply the
|
||||||
|
columns of a CSR matrix by a constant scale (one scale per column).
|
||||||
|
Used for scaling features to unit standard deviation in
|
||||||
|
:class:`~sklearn.preprocessing.StandardScaler`.
|
||||||
|
|
||||||
|
- :func:`~sklearn.neighbors.sort_graph_by_row_values`: can be used to sort a
|
||||||
|
CSR sparse matrix such that each row is stored with increasing values. This
|
||||||
|
is useful to improve efficiency when using precomputed sparse distance
|
||||||
|
matrices in estimators relying on nearest neighbors graph.
|
||||||
|
|
||||||
|
|
||||||
|
Graph Routines
|
||||||
|
==============
|
||||||
|
|
||||||
|
- :func:`graph.single_source_shortest_path_length`:
|
||||||
|
(not currently used in scikit-learn)
|
||||||
|
Return the shortest path from a single source
|
||||||
|
to all connected nodes on a graph. Code is adapted from `networkx
|
||||||
|
<https://networkx.github.io/>`_.
|
||||||
|
If this is ever needed again, it would be far faster to use a single
|
||||||
|
iteration of Dijkstra's algorithm from ``graph_shortest_path``.
|
||||||
|
|
||||||
|
|
||||||
|
Testing Functions
|
||||||
|
=================
|
||||||
|
|
||||||
|
- :func:`discovery.all_estimators` : returns a list of all estimators in
|
||||||
|
scikit-learn to test for consistent behavior and interfaces.
|
||||||
|
|
||||||
|
- :func:`discovery.all_displays` : returns a list of all displays (related to
|
||||||
|
plotting API) in scikit-learn to test for consistent behavior and interfaces.
|
||||||
|
|
||||||
|
- :func:`discovery.all_functions` : returns a list all functions in
|
||||||
|
scikit-learn to test for consistent behavior and interfaces.
|
||||||
|
|
||||||
|
Multiclass and multilabel utility function
|
||||||
|
==========================================
|
||||||
|
|
||||||
|
- :func:`multiclass.is_multilabel`: Helper function to check if the task
|
||||||
|
is a multi-label classification one.
|
||||||
|
|
||||||
|
- :func:`multiclass.unique_labels`: Helper function to extract an ordered
|
||||||
|
array of unique labels from different formats of target.
|
||||||
|
|
||||||
|
|
||||||
|
Helper Functions
|
||||||
|
================
|
||||||
|
|
||||||
|
- :class:`gen_even_slices`: generator to create ``n``-packs of slices going up
|
||||||
|
to ``n``. Used in :func:`~sklearn.decomposition.dict_learning` and
|
||||||
|
:func:`~sklearn.cluster.k_means`.
|
||||||
|
|
||||||
|
- :class:`gen_batches`: generator to create slices containing batch size elements
|
||||||
|
from 0 to ``n``
|
||||||
|
|
||||||
|
- :func:`safe_mask`: Helper function to convert a mask to the format expected
|
||||||
|
by the numpy array or scipy sparse matrix on which to use it (sparse
|
||||||
|
matrices support integer indices only while numpy arrays support both
|
||||||
|
boolean masks and integer indices).
|
||||||
|
|
||||||
|
- :func:`safe_sqr`: Helper function for unified squaring (``**2``) of
|
||||||
|
array-likes, matrices and sparse matrices.
|
||||||
|
|
||||||
|
|
||||||
|
Hash Functions
|
||||||
|
==============
|
||||||
|
|
||||||
|
- :func:`murmurhash3_32` provides a python wrapper for the
|
||||||
|
``MurmurHash3_x86_32`` C++ non cryptographic hash function. This hash
|
||||||
|
function is suitable for implementing lookup tables, Bloom filters,
|
||||||
|
Count Min Sketch, feature hashing and implicitly defined sparse
|
||||||
|
random projections::
|
||||||
|
|
||||||
|
>>> from sklearn.utils import murmurhash3_32
|
||||||
|
>>> murmurhash3_32("some feature", seed=0) == -384616559
|
||||||
|
True
|
||||||
|
|
||||||
|
>>> murmurhash3_32("some feature", seed=0, positive=True) == 3910350737
|
||||||
|
True
|
||||||
|
|
||||||
|
The ``sklearn.utils.murmurhash`` module can also be "cimported" from
|
||||||
|
other cython modules so as to benefit from the high performance of
|
||||||
|
MurmurHash while skipping the overhead of the Python interpreter.
|
||||||
|
|
||||||
|
|
||||||
|
Warnings and Exceptions
|
||||||
|
=======================
|
||||||
|
|
||||||
|
- :class:`deprecated`: Decorator to mark a function or class as deprecated.
|
||||||
|
|
||||||
|
- :class:`~sklearn.exceptions.ConvergenceWarning`: Custom warning to catch
|
||||||
|
convergence problems. Used in ``sklearn.covariance.graphical_lasso``.
|
|
@ -0,0 +1,8 @@
|
||||||
|
===========
|
||||||
|
Dispatching
|
||||||
|
===========
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 2
|
||||||
|
|
||||||
|
modules/array_api
|
|
@ -0,0 +1,20 @@
|
||||||
|
.. raw :: html
|
||||||
|
|
||||||
|
<!-- Generated by generate_authors_table.py -->
|
||||||
|
<div class="sk-authors-container">
|
||||||
|
<style>
|
||||||
|
img.avatar {border-radius: 10px;}
|
||||||
|
</style>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/ArturoAmorQ'><img src='https://avatars.githubusercontent.com/u/86408019?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Arturo Amor</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/lucyleeow'><img src='https://avatars.githubusercontent.com/u/23182829?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Lucy Liu</p>
|
||||||
|
</div>
|
||||||
|
<div>
|
||||||
|
<a href='https://github.com/Charlie-XIAO'><img src='https://avatars.githubusercontent.com/u/108576690?v=4' class='avatar' /></a> <br />
|
||||||
|
<p>Yao Xiao</p>
|
||||||
|
</div>
|
||||||
|
</div>
|
|
@ -0,0 +1,530 @@
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<style>
|
||||||
|
/* h3 headings on this page are the questions; make them rubric-like */
|
||||||
|
h3 {
|
||||||
|
font-size: 1rem;
|
||||||
|
font-weight: bold;
|
||||||
|
padding-bottom: 0.2rem;
|
||||||
|
margin: 2rem 0 1.15rem 0;
|
||||||
|
border-bottom: 1px solid var(--pst-color-border);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Increase top margin for first question in each section */
|
||||||
|
h2 + section > h3 {
|
||||||
|
margin-top: 2.5rem;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Make the headerlinks a bit more visible */
|
||||||
|
h3 > a.headerlink {
|
||||||
|
font-size: 0.9rem;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Remove the backlink decoration on the titles */
|
||||||
|
h2 > a.toc-backref,
|
||||||
|
h3 > a.toc-backref {
|
||||||
|
text-decoration: none;
|
||||||
|
}
|
||||||
|
</style>
|
||||||
|
|
||||||
|
.. _faq:
|
||||||
|
|
||||||
|
==========================
|
||||||
|
Frequently Asked Questions
|
||||||
|
==========================
|
||||||
|
|
||||||
|
.. currentmodule:: sklearn
|
||||||
|
|
||||||
|
Here we try to give some answers to questions that regularly pop up on the mailing list.
|
||||||
|
|
||||||
|
.. contents:: Table of Contents
|
||||||
|
:local:
|
||||||
|
:depth: 2
|
||||||
|
|
||||||
|
|
||||||
|
About the project
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
What is the project name (a lot of people get it wrong)?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
scikit-learn, but not scikit or SciKit nor sci-kit learn.
|
||||||
|
Also not scikits.learn or scikits-learn, which were previously used.
|
||||||
|
|
||||||
|
How do you pronounce the project name?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
sy-kit learn. sci stands for science!
|
||||||
|
|
||||||
|
Why scikit?
|
||||||
|
^^^^^^^^^^^
|
||||||
|
There are multiple scikits, which are scientific toolboxes built around SciPy.
|
||||||
|
Apart from scikit-learn, another popular one is `scikit-image <https://scikit-image.org/>`_.
|
||||||
|
|
||||||
|
Do you support PyPy?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
scikit-learn is regularly tested and maintained to work with
|
||||||
|
`PyPy <https://pypy.org/>`_ (an alternative Python implementation with
|
||||||
|
a built-in just-in-time compiler).
|
||||||
|
|
||||||
|
Note however that this support is still considered experimental and specific
|
||||||
|
components might behave slightly differently. Please refer to the test
|
||||||
|
suite of the specific module of interest for more details.
|
||||||
|
|
||||||
|
How can I obtain permission to use the images in scikit-learn for my work?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The images contained in the `scikit-learn repository
|
||||||
|
<https://github.com/scikit-learn/scikit-learn>`_ and the images generated within
|
||||||
|
the `scikit-learn documentation <https://scikit-learn.org/stable/index.html>`_
|
||||||
|
can be used via the `BSD 3-Clause License
|
||||||
|
<https://github.com/scikit-learn/scikit-learn?tab=BSD-3-Clause-1-ov-file>`_ for
|
||||||
|
your work. Citations of scikit-learn are highly encouraged and appreciated. See
|
||||||
|
:ref:`citing scikit-learn <citing-scikit-learn>`.
|
||||||
|
|
||||||
|
Implementation decisions
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
Why is there no support for deep or reinforcement learning? Will there be such support in the future?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Deep learning and reinforcement learning both require a rich vocabulary to
|
||||||
|
define an architecture, with deep learning additionally requiring
|
||||||
|
GPUs for efficient computing. However, neither of these fit within
|
||||||
|
the design constraints of scikit-learn. As a result, deep learning
|
||||||
|
and reinforcement learning are currently out of scope for what
|
||||||
|
scikit-learn seeks to achieve.
|
||||||
|
|
||||||
|
You can find more information about the addition of GPU support at
|
||||||
|
`Will you add GPU support?`_.
|
||||||
|
|
||||||
|
Note that scikit-learn currently implements a simple multilayer perceptron
|
||||||
|
in :mod:`sklearn.neural_network`. We will only accept bug fixes for this module.
|
||||||
|
If you want to implement more complex deep learning models, please turn to
|
||||||
|
popular deep learning frameworks such as
|
||||||
|
`tensorflow <https://www.tensorflow.org/>`_,
|
||||||
|
`keras <https://keras.io/>`_,
|
||||||
|
and `pytorch <https://pytorch.org/>`_.
|
||||||
|
|
||||||
|
.. _adding_graphical_models:
|
||||||
|
|
||||||
|
Will you add graphical models or sequence prediction to scikit-learn?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Not in the foreseeable future.
|
||||||
|
scikit-learn tries to provide a unified API for the basic tasks in machine
|
||||||
|
learning, with pipelines and meta-algorithms like grid search to tie
|
||||||
|
everything together. The required concepts, APIs, algorithms and
|
||||||
|
expertise required for structured learning are different from what
|
||||||
|
scikit-learn has to offer. If we started doing arbitrary structured
|
||||||
|
learning, we'd need to redesign the whole package and the project
|
||||||
|
would likely collapse under its own weight.
|
||||||
|
|
||||||
|
There are two projects with API similar to scikit-learn that
|
||||||
|
do structured prediction:
|
||||||
|
|
||||||
|
* `pystruct <https://pystruct.github.io/>`_ handles general structured
|
||||||
|
learning (focuses on SSVMs on arbitrary graph structures with
|
||||||
|
approximate inference; defines the notion of sample as an instance of
|
||||||
|
the graph structure).
|
||||||
|
|
||||||
|
* `seqlearn <https://larsmans.github.io/seqlearn/>`_ handles sequences only
|
||||||
|
(focuses on exact inference; has HMMs, but mostly for the sake of
|
||||||
|
completeness; treats a feature vector as a sample and uses an offset encoding
|
||||||
|
for the dependencies between feature vectors).
|
||||||
|
|
||||||
|
Why did you remove HMMs from scikit-learn?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
See :ref:`adding_graphical_models`.
|
||||||
|
|
||||||
|
|
||||||
|
Will you add GPU support?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Adding GPU support by default would introduce heavy harware-specific software
|
||||||
|
dependencies and existing algorithms would need to be reimplemented. This would
|
||||||
|
make it both harder for the average user to install scikit-learn and harder for
|
||||||
|
the developers to maintain the code.
|
||||||
|
|
||||||
|
However, since 2023, a limited but growing :ref:`list of scikit-learn
|
||||||
|
estimators <array_api_supported>` can already run on GPUs if the input data is
|
||||||
|
provided as a PyTorch or CuPy array and if scikit-learn has been configured to
|
||||||
|
accept such inputs as explained in :ref:`array_api`. This Array API support
|
||||||
|
allows scikit-learn to run on GPUs without introducing heavy and
|
||||||
|
hardware-specific software dependencies to the main package.
|
||||||
|
|
||||||
|
Most estimators that rely on NumPy for their computationally intensive operations
|
||||||
|
can be considered for Array API support and therefore GPU support.
|
||||||
|
|
||||||
|
However, not all scikit-learn estimators are amenable to efficiently running
|
||||||
|
on GPUs via the Array API for fundamental algorithmic reasons. For instance,
|
||||||
|
tree-based models currently implemented with Cython in scikit-learn are
|
||||||
|
fundamentally not array-based algorithms. Other algorithms such as k-means or
|
||||||
|
k-nearest neighbors rely on array-based algorithms but are also implemented in
|
||||||
|
Cython. Cython is used to manually interleave consecutive array operations to
|
||||||
|
avoid introducing performance killing memory access to large intermediate
|
||||||
|
arrays: this low-level algorithmic rewrite is called "kernel fusion" and cannot
|
||||||
|
be expressed via the Array API for the foreseeable future.
|
||||||
|
|
||||||
|
Adding efficient GPU support to estimators that cannot be efficiently
|
||||||
|
implemented with the Array API would require designing and adopting a more
|
||||||
|
flexible extension system for scikit-learn. This possibility is being
|
||||||
|
considered in the following GitHub issue (under discussion):
|
||||||
|
|
||||||
|
- https://github.com/scikit-learn/scikit-learn/issues/22438
|
||||||
|
|
||||||
|
|
||||||
|
Why do categorical variables need preprocessing in scikit-learn, compared to other tools?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Most of scikit-learn assumes data is in NumPy arrays or SciPy sparse matrices
|
||||||
|
of a single numeric dtype. These do not explicitly represent categorical
|
||||||
|
variables at present. Thus, unlike R's ``data.frames`` or :class:`pandas.DataFrame`,
|
||||||
|
we require explicit conversion of categorical features to numeric values, as
|
||||||
|
discussed in :ref:`preprocessing_categorical_features`.
|
||||||
|
See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py` for an
|
||||||
|
example of working with heterogeneous (e.g. categorical and numeric) data.
|
||||||
|
|
||||||
|
Why does scikit-learn not directly work with, for example, :class:`pandas.DataFrame`?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The homogeneous NumPy and SciPy data objects currently expected are most
|
||||||
|
efficient to process for most operations. Extensive work would also be needed
|
||||||
|
to support Pandas categorical types. Restricting input to homogeneous
|
||||||
|
types therefore reduces maintenance cost and encourages usage of efficient
|
||||||
|
data structures.
|
||||||
|
|
||||||
|
Note however that :class:`~sklearn.compose.ColumnTransformer` makes it
|
||||||
|
convenient to handle heterogeneous pandas dataframes by mapping homogeneous subsets of
|
||||||
|
dataframe columns selected by name or dtype to dedicated scikit-learn transformers.
|
||||||
|
Therefore :class:`~sklearn.compose.ColumnTransformer` are often used in the first
|
||||||
|
step of scikit-learn pipelines when dealing
|
||||||
|
with heterogeneous dataframes (see :ref:`pipeline` for more details).
|
||||||
|
|
||||||
|
See also :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`
|
||||||
|
for an example of working with heterogeneous (e.g. categorical and numeric) data.
|
||||||
|
|
||||||
|
Do you plan to implement transform for target ``y`` in a pipeline?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
Currently transform only works for features ``X`` in a pipeline. There's a
|
||||||
|
long-standing discussion about not being able to transform ``y`` in a pipeline.
|
||||||
|
Follow on GitHub issue :issue:`4143`. Meanwhile, you can check out
|
||||||
|
:class:`~compose.TransformedTargetRegressor`,
|
||||||
|
`pipegraph <https://github.com/mcasl/PipeGraph>`_,
|
||||||
|
and `imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
|
||||||
|
Note that scikit-learn solved for the case where ``y``
|
||||||
|
has an invertible transformation applied before training
|
||||||
|
and inverted after prediction. scikit-learn intends to solve for
|
||||||
|
use cases where ``y`` should be transformed at training time
|
||||||
|
and not at test time, for resampling and similar uses, like at
|
||||||
|
`imbalanced-learn <https://github.com/scikit-learn-contrib/imbalanced-learn>`_.
|
||||||
|
In general, these use cases can be solved
|
||||||
|
with a custom meta estimator rather than a :class:`~pipeline.Pipeline`.
|
||||||
|
|
||||||
|
Why are there so many different estimators for linear models?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
Usually, there is one classifier and one regressor per model type, e.g.
|
||||||
|
:class:`~ensemble.GradientBoostingClassifier` and
|
||||||
|
:class:`~ensemble.GradientBoostingRegressor`. Both have similar options and
|
||||||
|
both have the parameter `loss`, which is especially useful in the regression
|
||||||
|
case as it enables the estimation of conditional mean as well as conditional
|
||||||
|
quantiles.
|
||||||
|
|
||||||
|
For linear models, there are many estimator classes which are very close to
|
||||||
|
each other. Let us have a look at
|
||||||
|
|
||||||
|
- :class:`~linear_model.LinearRegression`, no penalty
|
||||||
|
- :class:`~linear_model.Ridge`, L2 penalty
|
||||||
|
- :class:`~linear_model.Lasso`, L1 penalty (sparse models)
|
||||||
|
- :class:`~linear_model.ElasticNet`, L1 + L2 penalty (less sparse models)
|
||||||
|
- :class:`~linear_model.SGDRegressor` with `loss="squared_loss"`
|
||||||
|
|
||||||
|
**Maintainer perspective:**
|
||||||
|
They all do in principle the same and are different only by the penalty they
|
||||||
|
impose. This, however, has a large impact on the way the underlying
|
||||||
|
optimization problem is solved. In the end, this amounts to usage of different
|
||||||
|
methods and tricks from linear algebra. A special case is
|
||||||
|
:class:`~linear_model.SGDRegressor` which
|
||||||
|
comprises all 4 previous models and is different by the optimization procedure.
|
||||||
|
A further side effect is that the different estimators favor different data
|
||||||
|
layouts (`X` C-contiguous or F-contiguous, sparse csr or csc). This complexity
|
||||||
|
of the seemingly simple linear models is the reason for having different
|
||||||
|
estimator classes for different penalties.
|
||||||
|
|
||||||
|
**User perspective:**
|
||||||
|
First, the current design is inspired by the scientific literature where linear
|
||||||
|
regression models with different regularization/penalty were given different
|
||||||
|
names, e.g. *ridge regression*. Having different model classes with according
|
||||||
|
names makes it easier for users to find those regression models.
|
||||||
|
Secondly, if all the 5 above mentioned linear models were unified into a single
|
||||||
|
class, there would be parameters with a lot of options like the ``solver``
|
||||||
|
parameter. On top of that, there would be a lot of exclusive interactions
|
||||||
|
between different parameters. For example, the possible options of the
|
||||||
|
parameters ``solver``, ``precompute`` and ``selection`` would depend on the
|
||||||
|
chosen values of the penalty parameters ``alpha`` and ``l1_ratio``.
|
||||||
|
|
||||||
|
|
||||||
|
Contributing
|
||||||
|
------------
|
||||||
|
|
||||||
|
How can I contribute to scikit-learn?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
See :ref:`contributing`. Before wanting to add a new algorithm, which is
|
||||||
|
usually a major and lengthy undertaking, it is recommended to start with
|
||||||
|
:ref:`known issues <new_contributors>`. Please do not contact the contributors
|
||||||
|
of scikit-learn directly regarding contributing to scikit-learn.
|
||||||
|
|
||||||
|
Why is my pull request not getting any attention?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
The scikit-learn review process takes a significant amount of time, and
|
||||||
|
contributors should not be discouraged by a lack of activity or review on
|
||||||
|
their pull request. We care a lot about getting things right
|
||||||
|
the first time, as maintenance and later change comes at a high cost.
|
||||||
|
We rarely release any "experimental" code, so all of our contributions
|
||||||
|
will be subject to high use immediately and should be of the highest
|
||||||
|
quality possible initially.
|
||||||
|
|
||||||
|
Beyond that, scikit-learn is limited in its reviewing bandwidth; many of the
|
||||||
|
reviewers and core developers are working on scikit-learn on their own time.
|
||||||
|
If a review of your pull request comes slowly, it is likely because the
|
||||||
|
reviewers are busy. We ask for your understanding and request that you
|
||||||
|
not close your pull request or discontinue your work solely because of
|
||||||
|
this reason.
|
||||||
|
|
||||||
|
.. _new_algorithms_inclusion_criteria:
|
||||||
|
|
||||||
|
What are the inclusion criteria for new algorithms?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
We only consider well-established algorithms for inclusion. A rule of thumb is
|
||||||
|
at least 3 years since publication, 200+ citations, and wide use and
|
||||||
|
usefulness. A technique that provides a clear-cut improvement (e.g. an
|
||||||
|
enhanced data structure or a more efficient approximation technique) on
|
||||||
|
a widely-used method will also be considered for inclusion.
|
||||||
|
|
||||||
|
From the algorithms or techniques that meet the above criteria, only those
|
||||||
|
which fit well within the current API of scikit-learn, that is a ``fit``,
|
||||||
|
``predict/transform`` interface and ordinarily having input/output that is a
|
||||||
|
numpy array or sparse matrix, are accepted.
|
||||||
|
|
||||||
|
The contributor should support the importance of the proposed addition with
|
||||||
|
research papers and/or implementations in other similar packages, demonstrate
|
||||||
|
its usefulness via common use-cases/applications and corroborate performance
|
||||||
|
improvements, if any, with benchmarks and/or plots. It is expected that the
|
||||||
|
proposed algorithm should outperform the methods that are already implemented
|
||||||
|
in scikit-learn at least in some areas.
|
||||||
|
|
||||||
|
Inclusion of a new algorithm speeding up an existing model is easier if:
|
||||||
|
|
||||||
|
- it does not introduce new hyper-parameters (as it makes the library
|
||||||
|
more future-proof),
|
||||||
|
- it is easy to document clearly when the contribution improves the speed
|
||||||
|
and when it does not, for instance, "when ``n_features >>
|
||||||
|
n_samples``",
|
||||||
|
- benchmarks clearly show a speed up.
|
||||||
|
|
||||||
|
Also, note that your implementation need not be in scikit-learn to be used
|
||||||
|
together with scikit-learn tools. You can implement your favorite algorithm
|
||||||
|
in a scikit-learn compatible way, upload it to GitHub and let us know. We
|
||||||
|
will be happy to list it under :ref:`related_projects`. If you already have
|
||||||
|
a package on GitHub following the scikit-learn API, you may also be
|
||||||
|
interested to look at `scikit-learn-contrib
|
||||||
|
<https://scikit-learn-contrib.github.io>`_.
|
||||||
|
|
||||||
|
.. _selectiveness:
|
||||||
|
|
||||||
|
Why are you so selective on what algorithms you include in scikit-learn?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
Code comes with maintenance cost, and we need to balance the amount of
|
||||||
|
code we have with the size of the team (and add to this the fact that
|
||||||
|
complexity scales non linearly with the number of features).
|
||||||
|
The package relies on core developers using their free time to
|
||||||
|
fix bugs, maintain code and review contributions.
|
||||||
|
Any algorithm that is added needs future attention by the developers,
|
||||||
|
at which point the original author might long have lost interest.
|
||||||
|
See also :ref:`new_algorithms_inclusion_criteria`. For a great read about
|
||||||
|
long-term maintenance issues in open-source software, look at
|
||||||
|
`the Executive Summary of Roads and Bridges
|
||||||
|
<https://www.fordfoundation.org/media/2976/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure.pdf#page=8>`_.
|
||||||
|
|
||||||
|
|
||||||
|
Using scikit-learn
|
||||||
|
------------------
|
||||||
|
|
||||||
|
What's the best way to get help on scikit-learn usage?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
* General machine learning questions: use `Cross Validated
|
||||||
|
<https://stats.stackexchange.com/>`_ with the ``[machine-learning]`` tag.
|
||||||
|
|
||||||
|
* scikit-learn usage questions: use `Stack Overflow
|
||||||
|
<https://stackoverflow.com/questions/tagged/scikit-learn>`_ with the
|
||||||
|
``[scikit-learn]`` and ``[python]`` tags. You can alternatively use the `mailing list
|
||||||
|
<https://mail.python.org/mailman/listinfo/scikit-learn>`_.
|
||||||
|
|
||||||
|
Please make sure to include a minimal reproduction code snippet (ideally shorter
|
||||||
|
than 10 lines) that highlights your problem on a toy dataset (for instance from
|
||||||
|
:mod:`sklearn.datasets` or randomly generated with functions of ``numpy.random`` with
|
||||||
|
a fixed random seed). Please remove any line of code that is not necessary to
|
||||||
|
reproduce your problem.
|
||||||
|
|
||||||
|
The problem should be reproducible by simply copy-pasting your code snippet in a Python
|
||||||
|
shell with scikit-learn installed. Do not forget to include the import statements.
|
||||||
|
More guidance to write good reproduction code snippets can be found at:
|
||||||
|
https://stackoverflow.com/help/mcve.
|
||||||
|
|
||||||
|
If your problem raises an exception that you do not understand (even after googling it),
|
||||||
|
please make sure to include the full traceback that you obtain when running the
|
||||||
|
reproduction script.
|
||||||
|
|
||||||
|
For bug reports or feature requests, please make use of the
|
||||||
|
`issue tracker on GitHub <https://github.com/scikit-learn/scikit-learn/issues>`_.
|
||||||
|
|
||||||
|
.. warning::
|
||||||
|
Please do not email any authors directly to ask for assistance, report bugs,
|
||||||
|
or for any other issue related to scikit-learn.
|
||||||
|
|
||||||
|
How should I save, export or deploy estimators for production?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
See :ref:`model_persistence`.
|
||||||
|
|
||||||
|
How can I create a bunch object?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Bunch objects are sometimes used as an output for functions and methods. They
|
||||||
|
extend dictionaries by enabling values to be accessed by key,
|
||||||
|
`bunch["value_key"]`, or by an attribute, `bunch.value_key`.
|
||||||
|
|
||||||
|
They should not be used as an input. Therefore you almost never need to create
|
||||||
|
a :class:`~utils.Bunch` object, unless you are extending scikit-learn's API.
|
||||||
|
|
||||||
|
How can I load my own datasets into a format usable by scikit-learn?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Generally, scikit-learn works on any numeric data stored as numpy arrays
|
||||||
|
or scipy sparse matrices. Other types that are convertible to numeric
|
||||||
|
arrays such as :class:`pandas.DataFrame` are also acceptable.
|
||||||
|
|
||||||
|
For more information on loading your data files into these usable data
|
||||||
|
structures, please refer to :ref:`loading external datasets <external_datasets>`.
|
||||||
|
|
||||||
|
How do I deal with string data (or trees, graphs...)?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
scikit-learn estimators assume you'll feed them real-valued feature vectors.
|
||||||
|
This assumption is hard-coded in pretty much all of the library.
|
||||||
|
However, you can feed non-numerical inputs to estimators in several ways.
|
||||||
|
|
||||||
|
If you have text documents, you can use a term frequency features; see
|
||||||
|
:ref:`text_feature_extraction` for the built-in *text vectorizers*.
|
||||||
|
For more general feature extraction from any kind of data, see
|
||||||
|
:ref:`dict_feature_extraction` and :ref:`feature_hashing`.
|
||||||
|
|
||||||
|
Another common case is when you have non-numerical data and a custom distance
|
||||||
|
(or similarity) metric on these data. Examples include strings with edit
|
||||||
|
distance (aka. Levenshtein distance), for instance, DNA or RNA sequences. These can be
|
||||||
|
encoded as numbers, but doing so is painful and error-prone. Working with
|
||||||
|
distance metrics on arbitrary data can be done in two ways.
|
||||||
|
|
||||||
|
Firstly, many estimators take precomputed distance/similarity matrices, so if
|
||||||
|
the dataset is not too large, you can compute distances for all pairs of inputs.
|
||||||
|
If the dataset is large, you can use feature vectors with only one "feature",
|
||||||
|
which is an index into a separate data structure, and supply a custom metric
|
||||||
|
function that looks up the actual data in this data structure. For instance, to use
|
||||||
|
:class:`~cluster.dbscan` with Levenshtein distances::
|
||||||
|
|
||||||
|
>>> import numpy as np
|
||||||
|
>>> from leven import levenshtein # doctest: +SKIP
|
||||||
|
>>> from sklearn.cluster import dbscan
|
||||||
|
>>> data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
|
||||||
|
>>> def lev_metric(x, y):
|
||||||
|
... i, j = int(x[0]), int(y[0]) # extract indices
|
||||||
|
... return levenshtein(data[i], data[j])
|
||||||
|
...
|
||||||
|
>>> X = np.arange(len(data)).reshape(-1, 1)
|
||||||
|
>>> X
|
||||||
|
array([[0],
|
||||||
|
[1],
|
||||||
|
[2]])
|
||||||
|
>>> # We need to specify algorithm='brute' as the default assumes
|
||||||
|
>>> # a continuous feature space.
|
||||||
|
>>> dbscan(X, metric=lev_metric, eps=5, min_samples=2, algorithm='brute') # doctest: +SKIP
|
||||||
|
(array([0, 1]), array([ 0, 0, -1]))
|
||||||
|
|
||||||
|
Note that the example above uses the third-party edit distance package
|
||||||
|
`leven <https://pypi.org/project/leven/>`_. Similar tricks can be used,
|
||||||
|
with some care, for tree kernels, graph kernels, etc.
|
||||||
|
|
||||||
|
Why do I sometimes get a crash/freeze with ``n_jobs > 1`` under OSX or Linux?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Several scikit-learn tools such as :class:`~model_selection.GridSearchCV` and
|
||||||
|
:class:`~model_selection.cross_val_score` rely internally on Python's
|
||||||
|
:mod:`multiprocessing` module to parallelize execution
|
||||||
|
onto several Python processes by passing ``n_jobs > 1`` as an argument.
|
||||||
|
|
||||||
|
The problem is that Python :mod:`multiprocessing` does a ``fork`` system call
|
||||||
|
without following it with an ``exec`` system call for performance reasons. Many
|
||||||
|
libraries like (some versions of) Accelerate or vecLib under OSX, (some versions
|
||||||
|
of) MKL, the OpenMP runtime of GCC, nvidia's Cuda (and probably many others),
|
||||||
|
manage their own internal thread pool. Upon a call to `fork`, the thread pool
|
||||||
|
state in the child process is corrupted: the thread pool believes it has many
|
||||||
|
threads while only the main thread state has been forked. It is possible to
|
||||||
|
change the libraries to make them detect when a fork happens and reinitialize
|
||||||
|
the thread pool in that case: we did that for OpenBLAS (merged upstream in
|
||||||
|
main since 0.2.10) and we contributed a `patch
|
||||||
|
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035>`_ to GCC's OpenMP runtime
|
||||||
|
(not yet reviewed).
|
||||||
|
|
||||||
|
But in the end the real culprit is Python's :mod:`multiprocessing` that does
|
||||||
|
``fork`` without ``exec`` to reduce the overhead of starting and using new
|
||||||
|
Python processes for parallel computing. Unfortunately this is a violation of
|
||||||
|
the POSIX standard and therefore some software editors like Apple refuse to
|
||||||
|
consider the lack of fork-safety in Accelerate and vecLib as a bug.
|
||||||
|
|
||||||
|
In Python 3.4+ it is now possible to configure :mod:`multiprocessing` to
|
||||||
|
use the ``"forkserver"`` or ``"spawn"`` start methods (instead of the default
|
||||||
|
``"fork"``) to manage the process pools. To work around this issue when
|
||||||
|
using scikit-learn, you can set the ``JOBLIB_START_METHOD`` environment
|
||||||
|
variable to ``"forkserver"``. However the user should be aware that using
|
||||||
|
the ``"forkserver"`` method prevents :class:`joblib.Parallel` to call function
|
||||||
|
interactively defined in a shell session.
|
||||||
|
|
||||||
|
If you have custom code that uses :mod:`multiprocessing` directly instead of using
|
||||||
|
it via :mod:`joblib` you can enable the ``"forkserver"`` mode globally for your
|
||||||
|
program. Insert the following instructions in your main script::
|
||||||
|
|
||||||
|
import multiprocessing
|
||||||
|
|
||||||
|
# other imports, custom code, load data, define model...
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
multiprocessing.set_start_method("forkserver")
|
||||||
|
|
||||||
|
# call scikit-learn utils with n_jobs > 1 here
|
||||||
|
|
||||||
|
You can find more default on the new start methods in the `multiprocessing
|
||||||
|
documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_.
|
||||||
|
|
||||||
|
.. _faq_mkl_threading:
|
||||||
|
|
||||||
|
Why does my job use more cores than specified with ``n_jobs``?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
This is because ``n_jobs`` only controls the number of jobs for
|
||||||
|
routines that are parallelized with :mod:`joblib`, but parallel code can come
|
||||||
|
from other sources:
|
||||||
|
|
||||||
|
- some routines may be parallelized with OpenMP (for code written in C or
|
||||||
|
Cython),
|
||||||
|
- scikit-learn relies a lot on numpy, which in turn may rely on numerical
|
||||||
|
libraries like MKL, OpenBLAS or BLIS which can provide parallel
|
||||||
|
implementations.
|
||||||
|
|
||||||
|
For more details, please refer to our :ref:`notes on parallelism <parallelism>`.
|
||||||
|
|
||||||
|
How do I set a ``random_state`` for an entire execution?
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
Please refer to :ref:`randomness`.
|
|
@ -0,0 +1,231 @@
|
||||||
|
Getting Started
|
||||||
|
===============
|
||||||
|
|
||||||
|
The purpose of this guide is to illustrate some of the main features that
|
||||||
|
``scikit-learn`` provides. It assumes a very basic working knowledge of
|
||||||
|
machine learning practices (model fitting, predicting, cross-validation,
|
||||||
|
etc.). Please refer to our :ref:`installation instructions
|
||||||
|
<installation-instructions>` for installing ``scikit-learn``.
|
||||||
|
|
||||||
|
``Scikit-learn`` is an open source machine learning library that supports
|
||||||
|
supervised and unsupervised learning. It also provides various tools for
|
||||||
|
model fitting, data preprocessing, model selection, model evaluation,
|
||||||
|
and many other utilities.
|
||||||
|
|
||||||
|
Fitting and predicting: estimator basics
|
||||||
|
----------------------------------------
|
||||||
|
|
||||||
|
``Scikit-learn`` provides dozens of built-in machine learning algorithms and
|
||||||
|
models, called :term:`estimators`. Each estimator can be fitted to some data
|
||||||
|
using its :term:`fit` method.
|
||||||
|
|
||||||
|
Here is a simple example where we fit a
|
||||||
|
:class:`~sklearn.ensemble.RandomForestClassifier` to some very basic data::
|
||||||
|
|
||||||
|
>>> from sklearn.ensemble import RandomForestClassifier
|
||||||
|
>>> clf = RandomForestClassifier(random_state=0)
|
||||||
|
>>> X = [[ 1, 2, 3], # 2 samples, 3 features
|
||||||
|
... [11, 12, 13]]
|
||||||
|
>>> y = [0, 1] # classes of each sample
|
||||||
|
>>> clf.fit(X, y)
|
||||||
|
RandomForestClassifier(random_state=0)
|
||||||
|
|
||||||
|
The :term:`fit` method generally accepts 2 inputs:
|
||||||
|
|
||||||
|
- The samples matrix (or design matrix) :term:`X`. The size of ``X``
|
||||||
|
is typically ``(n_samples, n_features)``, which means that samples are
|
||||||
|
represented as rows and features are represented as columns.
|
||||||
|
- The target values :term:`y` which are real numbers for regression tasks, or
|
||||||
|
integers for classification (or any other discrete set of values). For
|
||||||
|
unsupervised learning tasks, ``y`` does not need to be specified. ``y`` is
|
||||||
|
usually a 1d array where the ``i`` th entry corresponds to the target of the
|
||||||
|
``i`` th sample (row) of ``X``.
|
||||||
|
|
||||||
|
Both ``X`` and ``y`` are usually expected to be numpy arrays or equivalent
|
||||||
|
:term:`array-like` data types, though some estimators work with other
|
||||||
|
formats such as sparse matrices.
|
||||||
|
|
||||||
|
Once the estimator is fitted, it can be used for predicting target values of
|
||||||
|
new data. You don't need to re-train the estimator::
|
||||||
|
|
||||||
|
>>> clf.predict(X) # predict classes of the training data
|
||||||
|
array([0, 1])
|
||||||
|
>>> clf.predict([[4, 5, 6], [14, 15, 16]]) # predict classes of new data
|
||||||
|
array([0, 1])
|
||||||
|
|
||||||
|
You can check :ref:`ml_map` on how to choose the right model for your use case.
|
||||||
|
|
||||||
|
Transformers and pre-processors
|
||||||
|
-------------------------------
|
||||||
|
|
||||||
|
Machine learning workflows are often composed of different parts. A typical
|
||||||
|
pipeline consists of a pre-processing step that transforms or imputes the
|
||||||
|
data, and a final predictor that predicts target values.
|
||||||
|
|
||||||
|
In ``scikit-learn``, pre-processors and transformers follow the same API as
|
||||||
|
the estimator objects (they actually all inherit from the same
|
||||||
|
``BaseEstimator`` class). The transformer objects don't have a
|
||||||
|
:term:`predict` method but rather a :term:`transform` method that outputs a
|
||||||
|
newly transformed sample matrix ``X``::
|
||||||
|
|
||||||
|
>>> from sklearn.preprocessing import StandardScaler
|
||||||
|
>>> X = [[0, 15],
|
||||||
|
... [1, -10]]
|
||||||
|
>>> # scale data according to computed scaling values
|
||||||
|
>>> StandardScaler().fit(X).transform(X)
|
||||||
|
array([[-1., 1.],
|
||||||
|
[ 1., -1.]])
|
||||||
|
|
||||||
|
Sometimes, you want to apply different transformations to different features:
|
||||||
|
the :ref:`ColumnTransformer<column_transformer>` is designed for these
|
||||||
|
use-cases.
|
||||||
|
|
||||||
|
Pipelines: chaining pre-processors and estimators
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
Transformers and estimators (predictors) can be combined together into a
|
||||||
|
single unifying object: a :class:`~sklearn.pipeline.Pipeline`. The pipeline
|
||||||
|
offers the same API as a regular estimator: it can be fitted and used for
|
||||||
|
prediction with ``fit`` and ``predict``. As we will see later, using a
|
||||||
|
pipeline will also prevent you from data leakage, i.e. disclosing some
|
||||||
|
testing data in your training data.
|
||||||
|
|
||||||
|
In the following example, we :ref:`load the Iris dataset <datasets>`, split it
|
||||||
|
into train and test sets, and compute the accuracy score of a pipeline on
|
||||||
|
the test data::
|
||||||
|
|
||||||
|
>>> from sklearn.preprocessing import StandardScaler
|
||||||
|
>>> from sklearn.linear_model import LogisticRegression
|
||||||
|
>>> from sklearn.pipeline import make_pipeline
|
||||||
|
>>> from sklearn.datasets import load_iris
|
||||||
|
>>> from sklearn.model_selection import train_test_split
|
||||||
|
>>> from sklearn.metrics import accuracy_score
|
||||||
|
...
|
||||||
|
>>> # create a pipeline object
|
||||||
|
>>> pipe = make_pipeline(
|
||||||
|
... StandardScaler(),
|
||||||
|
... LogisticRegression()
|
||||||
|
... )
|
||||||
|
...
|
||||||
|
>>> # load the iris dataset and split it into train and test sets
|
||||||
|
>>> X, y = load_iris(return_X_y=True)
|
||||||
|
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
|
||||||
|
...
|
||||||
|
>>> # fit the whole pipeline
|
||||||
|
>>> pipe.fit(X_train, y_train)
|
||||||
|
Pipeline(steps=[('standardscaler', StandardScaler()),
|
||||||
|
('logisticregression', LogisticRegression())])
|
||||||
|
>>> # we can now use it like any other estimator
|
||||||
|
>>> accuracy_score(pipe.predict(X_test), y_test)
|
||||||
|
0.97...
|
||||||
|
|
||||||
|
Model evaluation
|
||||||
|
----------------
|
||||||
|
|
||||||
|
Fitting a model to some data does not entail that it will predict well on
|
||||||
|
unseen data. This needs to be directly evaluated. We have just seen the
|
||||||
|
:func:`~sklearn.model_selection.train_test_split` helper that splits a
|
||||||
|
dataset into train and test sets, but ``scikit-learn`` provides many other
|
||||||
|
tools for model evaluation, in particular for :ref:`cross-validation
|
||||||
|
<cross_validation>`.
|
||||||
|
|
||||||
|
We here briefly show how to perform a 5-fold cross-validation procedure,
|
||||||
|
using the :func:`~sklearn.model_selection.cross_validate` helper. Note that
|
||||||
|
it is also possible to manually iterate over the folds, use different
|
||||||
|
data splitting strategies, and use custom scoring functions. Please refer to
|
||||||
|
our :ref:`User Guide <cross_validation>` for more details::
|
||||||
|
|
||||||
|
>>> from sklearn.datasets import make_regression
|
||||||
|
>>> from sklearn.linear_model import LinearRegression
|
||||||
|
>>> from sklearn.model_selection import cross_validate
|
||||||
|
...
|
||||||
|
>>> X, y = make_regression(n_samples=1000, random_state=0)
|
||||||
|
>>> lr = LinearRegression()
|
||||||
|
...
|
||||||
|
>>> result = cross_validate(lr, X, y) # defaults to 5-fold CV
|
||||||
|
>>> result['test_score'] # r_squared score is high because dataset is easy
|
||||||
|
array([1., 1., 1., 1., 1.])
|
||||||
|
|
||||||
|
Automatic parameter searches
|
||||||
|
----------------------------
|
||||||
|
|
||||||
|
All estimators have parameters (often called hyper-parameters in the
|
||||||
|
literature) that can be tuned. The generalization power of an estimator
|
||||||
|
often critically depends on a few parameters. For example a
|
||||||
|
:class:`~sklearn.ensemble.RandomForestRegressor` has a ``n_estimators``
|
||||||
|
parameter that determines the number of trees in the forest, and a
|
||||||
|
``max_depth`` parameter that determines the maximum depth of each tree.
|
||||||
|
Quite often, it is not clear what the exact values of these parameters
|
||||||
|
should be since they depend on the data at hand.
|
||||||
|
|
||||||
|
``Scikit-learn`` provides tools to automatically find the best parameter
|
||||||
|
combinations (via cross-validation). In the following example, we randomly
|
||||||
|
search over the parameter space of a random forest with a
|
||||||
|
:class:`~sklearn.model_selection.RandomizedSearchCV` object. When the search
|
||||||
|
is over, the :class:`~sklearn.model_selection.RandomizedSearchCV` behaves as
|
||||||
|
a :class:`~sklearn.ensemble.RandomForestRegressor` that has been fitted with
|
||||||
|
the best set of parameters. Read more in the :ref:`User Guide
|
||||||
|
<grid_search>`::
|
||||||
|
|
||||||
|
>>> from sklearn.datasets import fetch_california_housing
|
||||||
|
>>> from sklearn.ensemble import RandomForestRegressor
|
||||||
|
>>> from sklearn.model_selection import RandomizedSearchCV
|
||||||
|
>>> from sklearn.model_selection import train_test_split
|
||||||
|
>>> from scipy.stats import randint
|
||||||
|
...
|
||||||
|
>>> X, y = fetch_california_housing(return_X_y=True)
|
||||||
|
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
|
||||||
|
...
|
||||||
|
>>> # define the parameter space that will be searched over
|
||||||
|
>>> param_distributions = {'n_estimators': randint(1, 5),
|
||||||
|
... 'max_depth': randint(5, 10)}
|
||||||
|
...
|
||||||
|
>>> # now create a searchCV object and fit it to the data
|
||||||
|
>>> search = RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0),
|
||||||
|
... n_iter=5,
|
||||||
|
... param_distributions=param_distributions,
|
||||||
|
... random_state=0)
|
||||||
|
>>> search.fit(X_train, y_train)
|
||||||
|
RandomizedSearchCV(estimator=RandomForestRegressor(random_state=0), n_iter=5,
|
||||||
|
param_distributions={'max_depth': ...,
|
||||||
|
'n_estimators': ...},
|
||||||
|
random_state=0)
|
||||||
|
>>> search.best_params_
|
||||||
|
{'max_depth': 9, 'n_estimators': 4}
|
||||||
|
|
||||||
|
>>> # the search object now acts like a normal random forest estimator
|
||||||
|
>>> # with max_depth=9 and n_estimators=4
|
||||||
|
>>> search.score(X_test, y_test)
|
||||||
|
0.73...
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
In practice, you almost always want to :ref:`search over a pipeline
|
||||||
|
<composite_grid_search>`, instead of a single estimator. One of the main
|
||||||
|
reasons is that if you apply a pre-processing step to the whole dataset
|
||||||
|
without using a pipeline, and then perform any kind of cross-validation,
|
||||||
|
you would be breaking the fundamental assumption of independence between
|
||||||
|
training and testing data. Indeed, since you pre-processed the data
|
||||||
|
using the whole dataset, some information about the test sets are
|
||||||
|
available to the train sets. This will lead to over-estimating the
|
||||||
|
generalization power of the estimator (you can read more in this `Kaggle
|
||||||
|
post <https://www.kaggle.com/alexisbcook/data-leakage>`_).
|
||||||
|
|
||||||
|
Using a pipeline for cross-validation and searching will largely keep
|
||||||
|
you from this common pitfall.
|
||||||
|
|
||||||
|
|
||||||
|
Next steps
|
||||||
|
----------
|
||||||
|
|
||||||
|
We have briefly covered estimator fitting and predicting, pre-processing
|
||||||
|
steps, pipelines, cross-validation tools and automatic hyper-parameter
|
||||||
|
searches. This guide should give you an overview of some of the main
|
||||||
|
features of the library, but there is much more to ``scikit-learn``!
|
||||||
|
|
||||||
|
Please refer to our :ref:`user_guide` for details on all the tools that we
|
||||||
|
provide. You can also find an exhaustive list of the public API in the
|
||||||
|
:ref:`api_ref`.
|
||||||
|
|
||||||
|
You can also look at our numerous :ref:`examples <general_examples>` that
|
||||||
|
illustrate the use of ``scikit-learn`` in many different contexts.
|
|
@ -0,0 +1,198 @@
|
||||||
|
.. _governance:
|
||||||
|
|
||||||
|
===========================================
|
||||||
|
Scikit-learn governance and decision-making
|
||||||
|
===========================================
|
||||||
|
|
||||||
|
The purpose of this document is to formalize the governance process used by the
|
||||||
|
scikit-learn project, to clarify how decisions are made and how the various
|
||||||
|
elements of our community interact.
|
||||||
|
This document establishes a decision-making structure that takes into account
|
||||||
|
feedback from all members of the community and strives to find consensus, while
|
||||||
|
avoiding any deadlocks.
|
||||||
|
|
||||||
|
This is a meritocratic, consensus-based community project. Anyone with an
|
||||||
|
interest in the project can join the community, contribute to the project
|
||||||
|
design and participate in the decision making process. This document describes
|
||||||
|
how that participation takes place and how to set about earning merit within
|
||||||
|
the project community.
|
||||||
|
|
||||||
|
Roles And Responsibilities
|
||||||
|
==========================
|
||||||
|
|
||||||
|
We distinguish between contributors, core contributors, and the technical
|
||||||
|
committee. A key distinction between them is their voting rights: contributors
|
||||||
|
have no voting rights, whereas the other two groups all have voting rights,
|
||||||
|
as well as permissions to the tools relevant to their roles.
|
||||||
|
|
||||||
|
Contributors
|
||||||
|
------------
|
||||||
|
|
||||||
|
Contributors are community members who contribute in concrete ways to the
|
||||||
|
project. Anyone can become a contributor, and contributions can take many forms
|
||||||
|
– not only code – as detailed in the :ref:`contributors guide <contributing>`.
|
||||||
|
There is no process to become a contributor: once somebody contributes to the
|
||||||
|
project in any way, they are a contributor.
|
||||||
|
|
||||||
|
Core Contributors
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
All core contributor members have the same voting rights and right to propose
|
||||||
|
new members to any of the roles listed below. Their membership is represented
|
||||||
|
as being an organization member on the scikit-learn `GitHub organization
|
||||||
|
<https://github.com/orgs/scikit-learn/people>`_.
|
||||||
|
|
||||||
|
They are also welcome to join our `monthly core contributor meetings
|
||||||
|
<https://github.com/scikit-learn/administrative/tree/master/meeting_notes>`_.
|
||||||
|
|
||||||
|
New members can be nominated by any existing member. Once they have been
|
||||||
|
nominated, there will be a vote by the current core contributors. Voting on new
|
||||||
|
members is one of the few activities that takes place on the project's private
|
||||||
|
mailing list. While it is expected that most votes will be unanimous, a
|
||||||
|
two-thirds majority of the cast votes is enough. The vote needs to be open for
|
||||||
|
at least 1 week.
|
||||||
|
|
||||||
|
Core contributors that have not contributed to the project, corresponding to
|
||||||
|
their role, in the past 12 months will be asked if they want to become emeritus
|
||||||
|
members and recant their rights until they become active again. The list of
|
||||||
|
members, active and emeritus (with dates at which they became active) is public
|
||||||
|
on the scikit-learn website.
|
||||||
|
|
||||||
|
The following teams form the core contributors group:
|
||||||
|
|
||||||
|
* **Contributor Experience Team**
|
||||||
|
The contributor experience team improves the experience of contributors by
|
||||||
|
helping with the triage of issues and pull requests, as well as noticing any
|
||||||
|
repeating patterns where people might struggle, and to help with improving
|
||||||
|
those aspects of the project.
|
||||||
|
|
||||||
|
To this end, they have the required permissions on github to label and close
|
||||||
|
issues. :ref:`Their work <bug_triaging>` is crucial to improve the
|
||||||
|
communication in the project and limit the crowding of the issue tracker.
|
||||||
|
|
||||||
|
.. _communication_team:
|
||||||
|
|
||||||
|
* **Communication Team**
|
||||||
|
Members of the communication team help with outreach and communication
|
||||||
|
for scikit-learn. The goal of the team is to develop public awareness of
|
||||||
|
scikit-learn, of its features and usage, as well as branding.
|
||||||
|
|
||||||
|
For this, they can operate the scikit-learn accounts on various social networks
|
||||||
|
and produce materials. They also have the required rights to our blog
|
||||||
|
repository and other relevant accounts and platforms.
|
||||||
|
|
||||||
|
* **Documentation Team**
|
||||||
|
Members of the documentation team engage with the documentation of the project
|
||||||
|
among other things. They might also be involved in other aspects of the
|
||||||
|
project, but their reviews on documentation contributions are considered
|
||||||
|
authoritative, and can merge such contributions.
|
||||||
|
|
||||||
|
To this end, they have permissions to merge pull requests in scikit-learn's
|
||||||
|
repository.
|
||||||
|
|
||||||
|
* **Maintainers Team**
|
||||||
|
Maintainers are community members who have shown that they are dedicated to the
|
||||||
|
continued development of the project through ongoing engagement with the
|
||||||
|
community. They have shown they can be trusted to maintain scikit-learn with
|
||||||
|
care. Being a maintainer allows contributors to more easily carry on with their
|
||||||
|
project related activities by giving them direct access to the project's
|
||||||
|
repository. Maintainers are expected to review code contributions, merge
|
||||||
|
approved pull requests, cast votes for and against merging a pull-request,
|
||||||
|
and to be involved in deciding major changes to the API.
|
||||||
|
|
||||||
|
Technical Committee
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
The Technical Committee (TC) members are maintainers who have additional
|
||||||
|
responsibilities to ensure the smooth running of the project. TC members are
|
||||||
|
expected to participate in strategic planning, and approve changes to the
|
||||||
|
governance model. The purpose of the TC is to ensure a smooth progress from the
|
||||||
|
big-picture perspective. Indeed changes that impact the full project require a
|
||||||
|
synthetic analysis and a consensus that is both explicit and informed. In cases
|
||||||
|
that the core contributor community (which includes the TC members) fails to
|
||||||
|
reach such a consensus in the required time frame, the TC is the entity to
|
||||||
|
resolve the issue. Membership of the TC is by nomination by a core contributor.
|
||||||
|
A nomination will result in discussion which cannot take more than a month and
|
||||||
|
then a vote by the core contributors which will stay open for a week. TC
|
||||||
|
membership votes are subject to a two-third majority of all cast votes as well
|
||||||
|
as a simple majority approval of all the current TC members. TC members who do
|
||||||
|
not actively engage with the TC duties are expected to resign.
|
||||||
|
|
||||||
|
The Technical Committee of scikit-learn consists of :user:`Thomas Fan
|
||||||
|
<thomasjpfan>`, :user:`Alexandre Gramfort <agramfort>`, :user:`Olivier Grisel
|
||||||
|
<ogrisel>`, :user:`Adrin Jalali <adrinjalali>`, :user:`Andreas Müller
|
||||||
|
<amueller>`, :user:`Joel Nothman <jnothman>` and :user:`Gaël Varoquaux
|
||||||
|
<GaelVaroquaux>`.
|
||||||
|
|
||||||
|
Decision Making Process
|
||||||
|
=======================
|
||||||
|
Decisions about the future of the project are made through discussion with all
|
||||||
|
members of the community. All non-sensitive project management discussion takes
|
||||||
|
place on the project contributors' `mailing list <mailto:scikit-learn@python.org>`_
|
||||||
|
and the `issue tracker <https://github.com/scikit-learn/scikit-learn/issues>`_.
|
||||||
|
Occasionally, sensitive discussion occurs on a private list.
|
||||||
|
|
||||||
|
Scikit-learn uses a "consensus seeking" process for making decisions. The group
|
||||||
|
tries to find a resolution that has no open objections among core contributors.
|
||||||
|
At any point during the discussion, any core contributor can call for a vote,
|
||||||
|
which will conclude one month from the call for the vote. Most votes have to be
|
||||||
|
backed by a :ref:`SLEP <slep>`. If no option can gather two thirds of the votes
|
||||||
|
cast, the decision is escalated to the TC, which in turn will use consensus
|
||||||
|
seeking with the fallback option of a simple majority vote if no consensus can
|
||||||
|
be found within a month. This is what we hereafter may refer to as "**the
|
||||||
|
decision making process**".
|
||||||
|
|
||||||
|
Decisions (in addition to adding core contributors and TC membership as above)
|
||||||
|
are made according to the following rules:
|
||||||
|
|
||||||
|
* **Minor Documentation changes**, such as typo fixes, or addition / correction
|
||||||
|
of a sentence, but no change of the ``scikit-learn.org`` landing page or the
|
||||||
|
“about” page: Requires +1 by a maintainer, no -1 by a maintainer (lazy
|
||||||
|
consensus), happens on the issue or pull request page. Maintainers are
|
||||||
|
expected to give “reasonable time” to others to give their opinion on the
|
||||||
|
pull request if they're not confident others would agree.
|
||||||
|
|
||||||
|
* **Code changes and major documentation changes**
|
||||||
|
require +1 by two maintainers, no -1 by a maintainer (lazy
|
||||||
|
consensus), happens on the issue of pull-request page.
|
||||||
|
|
||||||
|
* **Changes to the API principles and changes to dependencies or supported
|
||||||
|
versions** happen via a :ref:`slep` and follows the decision-making process
|
||||||
|
outlined above.
|
||||||
|
|
||||||
|
* **Changes to the governance model** follow the process outlined in `SLEP020
|
||||||
|
<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep020/proposal.html>`__.
|
||||||
|
|
||||||
|
If a veto -1 vote is cast on a lazy consensus, the proposer can appeal to the
|
||||||
|
community and maintainers and the change can be approved or rejected using
|
||||||
|
the decision making procedure outlined above.
|
||||||
|
|
||||||
|
Governance Model Changes
|
||||||
|
------------------------
|
||||||
|
|
||||||
|
Governance model changes occur through an enhancement proposal or a GitHub Pull
|
||||||
|
Request. An enhancement proposal will go through "**the decision-making process**"
|
||||||
|
described in the previous section. Alternatively, an author may propose a change
|
||||||
|
directly to the governance model with a GitHub Pull Request. Logistically, an
|
||||||
|
author can open a Draft Pull Request for feedback and follow up with a new
|
||||||
|
revised Pull Request for voting. Once that author is happy with the state of the
|
||||||
|
Pull Request, they can call for a vote on the public mailing list. During the
|
||||||
|
one-month voting period, the Pull Request can not change. A Pull Request
|
||||||
|
Approval will count as a positive vote, and a "Request Changes" review will
|
||||||
|
count as a negative vote. If two-thirds of the cast votes are positive, then
|
||||||
|
the governance model change is accepted.
|
||||||
|
|
||||||
|
.. _slep:
|
||||||
|
|
||||||
|
Enhancement proposals (SLEPs)
|
||||||
|
==============================
|
||||||
|
For all votes, a proposal must have been made public and discussed before the
|
||||||
|
vote. Such proposal must be a consolidated document, in the form of a
|
||||||
|
"Scikit-Learn Enhancement Proposal" (SLEP), rather than a long discussion on an
|
||||||
|
issue. A SLEP must be submitted as a pull-request to `enhancement proposals
|
||||||
|
<https://scikit-learn-enhancement-proposals.readthedocs.io>`_ using the `SLEP
|
||||||
|
template
|
||||||
|
<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep_template.html>`_.
|
||||||
|
`SLEP000
|
||||||
|
<https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep000/proposal.html>`__
|
||||||
|
describes the process in more detail.
|
|
@ -0,0 +1,33 @@
|
||||||
|
<?xml version="1.0" encoding="utf-8"?>
|
||||||
|
<!-- Generator: Adobe Illustrator 21.1.0, SVG Export Plug-In . SVG Version: 6.00 Build 0) -->
|
||||||
|
<svg version="1.1" id="Artwork" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
|
||||||
|
viewBox="0 0 190.1 33" style="enable-background:new 0 0 190.1 33;" xml:space="preserve">
|
||||||
|
<style type="text/css">
|
||||||
|
.st0{fill:#4B5168;}
|
||||||
|
.st1{fill:#F6914D;}
|
||||||
|
</style>
|
||||||
|
<g>
|
||||||
|
<path class="st0" d="M33.4,27.7V5.3c0-2.3,0-2.3,2.4-2.3c2.4,0,2.4,0,2.4,2.3v22.4c0,2.3,0,2.3-2.4,2.3
|
||||||
|
C33.4,29.9,33.4,29.9,33.4,27.7z"/>
|
||||||
|
<path class="st0" d="M45,26.4V6.6c0-3.6,0-3.6,3.6-3.6h5.8c7.8,0,12.5,3.9,13,10.2c0.2,2.2,0.2,3.4,0,5.5
|
||||||
|
c-0.5,6.3-5.3,11.2-13,11.2h-5.8C45,29.9,45,29.9,45,26.4z M54.3,25.4c5.3,0,8-3,8.3-7.1c0.1-1.8,0.1-2.8,0-4.6
|
||||||
|
c-0.3-4.2-3-6.1-8.3-6.1h-4.5v17.8H54.3z"/>
|
||||||
|
<path class="st0" d="M73.8,26.4V6.6c0-3.6,0-3.6,3.6-3.6h13.5c2.3,0,2.3,0,2.3,2.2c0,2.2,0,2.2-2.3,2.2H78.6v6.9h11
|
||||||
|
c2.2,0,2.2,0,2.2,2.1c0,2.1,0,2.1-2.2,2.1h-11v6.9h12.3c2.3,0,2.3,0,2.3,2.2c0,2.3,0,2.3-2.3,2.3H77.4
|
||||||
|
C73.8,29.9,73.8,29.9,73.8,26.4z"/>
|
||||||
|
<path class="st0" d="M100,26.4v-21c0-2.3,0-2.3,2.4-2.3c2.4,0,2.4,0,2.4,2.3v20.2h11.9c2.4,0,2.4,0,2.4,2.2c0,2.2,0,2.2-2.4,2.2
|
||||||
|
h-13.1C100,29.9,100,29.9,100,26.4z"/>
|
||||||
|
<path class="st0" d="M125.8,27.7V5.3c0-2.3,0-2.3,2.4-2.3c2.4,0,2.4,0,2.4,2.3v22.4c0,2.3,0,2.3-2.4,2.3
|
||||||
|
C125.8,29.9,125.8,29.9,125.8,27.7z"/>
|
||||||
|
<path class="st0" d="M137.4,27.7V6.6c0-3.6,0-3.6,3.6-3.6h13.5c2.3,0,2.3,0,2.3,2.2c0,2.2,0,2.2-2.3,2.2h-12.2v7.2h11.3
|
||||||
|
c2.3,0,2.3,0,2.3,2.2c0,2.2,0,2.2-2.3,2.2h-11.3v8.6c0,2.3,0,2.3-2.4,2.3S137.4,29.9,137.4,27.7z"/>
|
||||||
|
<path class="st0" d="M24.2,3.1H5.5c-2.4,0-2.4,0-2.4,2.2c0,2.2,0,2.2,2.4,2.2h7v4.7v3.2l4.8-3.7v-1.1V7.5h7c2.4,0,2.4,0,2.4-2.2
|
||||||
|
C26.6,3.1,26.6,3.1,24.2,3.1z"/>
|
||||||
|
<path class="st1" d="M12.5,20v7.6c0,2.3,0,2.3,2.4,2.3c2.4,0,2.4,0,2.4-2.3V16.3L12.5,20z"/>
|
||||||
|
<g>
|
||||||
|
<path class="st0" d="M165.9,3.1h18.7c2.4,0,2.4,0,2.4,2.2c0,2.2,0,2.2-2.4,2.2h-7v4.7v3.2l-4.8-3.7v-1.1V7.5h-7
|
||||||
|
c-2.4,0-2.4,0-2.4-2.2C163.5,3.1,163.5,3.1,165.9,3.1z"/>
|
||||||
|
<path class="st1" d="M177.6,20v7.6c0,2.3,0,2.3-2.4,2.3c-2.4,0-2.4,0-2.4-2.3V16.3L177.6,20z"/>
|
||||||
|
</g>
|
||||||
|
</g>
|
||||||
|
</svg>
|
After Width: | Height: | Size: 2.1 KiB |
After Width: | Height: | Size: 11 KiB |
After Width: | Height: | Size: 17 KiB |
After Width: | Height: | Size: 30 KiB |
After Width: | Height: | Size: 54 KiB |
After Width: | Height: | Size: 12 KiB |
After Width: | Height: | Size: 21 KiB |
After Width: | Height: | Size: 13 KiB |
After Width: | Height: | Size: 4.3 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 1.1 KiB |
After Width: | Height: | Size: 1.7 KiB |
|
@ -0,0 +1,19 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<svg width="192px" height="192px" viewBox="0 0 192 192" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
|
||||||
|
<!-- Generator: Sketch 52.2 (67145) - http://www.bohemiancoding.com/sketch -->
|
||||||
|
<title>nav / elements / czi_mark_red</title>
|
||||||
|
<desc>Created with Sketch.</desc>
|
||||||
|
<defs>
|
||||||
|
<polygon id="path-1" points="0 0 192 0 192 192 0 192"></polygon>
|
||||||
|
</defs>
|
||||||
|
<g id="nav-/-elements-/-czi_mark_red" stroke="none" stroke-width="1" fill="none" fill-rule="evenodd">
|
||||||
|
<g id="czi_mark">
|
||||||
|
<mask id="mask-2" fill="white">
|
||||||
|
<use xlink:href="#path-1"></use>
|
||||||
|
</mask>
|
||||||
|
<g id="Clip-2"></g>
|
||||||
|
<path d="M69.7933712,96.0856792 C56.904554,96.3514262 47.2394287,87.8624926 46.342904,75.3156235 C45.8651731,68.644557 48.110735,62.2697375 52.6627795,57.388862 C57.219641,52.500079 63.427876,49.7657946 70.1438772,49.71016 C73.5426804,49.6796598 77.1369963,50.3684555 80.213062,51.6949308 C80.213062,51.6949308 79.3749077,58.7000872 79.0980732,61.8545962 L89.6903251,61.9153142 L91.5927482,46.1405096 L88.6107553,44.259383 C83.0403449,40.8543771 76.6238464,39.0543018 70.0475376,39.1124781 C60.4838522,39.1960712 51.2757731,43.2215297 44.7856033,50.1809359 C38.201644,57.2442685 34.9578341,66.424257 35.6463786,76.0473453 C36.2558681,84.5890893 39.9417065,92.3790605 46.0178996,97.9919403 C52.1725812,103.677964 60.4583506,106.741255 69.4148134,106.665287 C69.6332775,106.663028 69.8542918,106.657662 70.0753061,106.650319 C75.5060241,106.50855 81.6227365,105.483123 88.354322,102.824806 L96,88.0373038 C96,88.0373038 95.8450066,87.6955889 94.7606198,88.4473617 C88.1840277,93.0068558 80.4898965,95.8651178 69.7933712,96.0856792 Z" id="Fill-1" fill="#FF414B" mask="url(#mask-2)"></path>
|
||||||
|
<path d="M128.264258,140.830158 C127.731065,146.452835 124.81253,151.094434 120.437535,153.404009 C116.963637,155.237918 113.167297,155.227815 109.745876,153.371175 C106.186106,151.433995 104.498127,148.533417 103.864188,144.868125 C102.862906,139.054059 106.168707,132.991356 110.67195,129.934748 L181.049041,84.1510133 C181.585041,88.0250929 181.869318,91.9799935 181.869318,96 C181.869318,143.38388 143.348592,181.932164 95.9998597,181.932164 C48.6516891,181.932164 10.1309628,143.38388 10.1309628,96 C10.1309628,48.616401 48.6516891,10.0655911 95.9998597,10.0655911 C131.406173,10.0655911 161.85659,31.6327505 174.973438,62.3195017 L183.562348,56.7394801 C168.526003,23.330911 134.948264,0 95.9998597,0 C43.0640987,0 0,43.0641617 0,96 C0,148.933313 43.0640987,192 95.9998597,192 C148.933376,192 192,148.933313 192,96 C192,89.8893095 191.418819,83.9121983 190.322123,78.115812 C189.660402,74.3219922 188.211237,69.2931255 187.972422,68.477899 L167.980181,80.9835569 L141.509354,97.7435463 C140.575984,94.2213751 138.445173,90.6540228 133.924531,88.6012237 C128.571266,86.1709789 119.901815,88.0427725 113.539691,91.6603574 C113.539691,91.6603574 130.963622,57.9473061 133.854094,52.4051694 C134.042957,52.0454034 133.77636,51.6202509 133.368607,51.6227765 L132.299413,51.6328792 L100.784853,51.6076226 L99.4768445,62.5030328 L104.405239,62.4856339 L117.132014,62.4856339 L92.1861209,110.251449 C91.7006339,111.182575 92.706967,112.183578 93.6428625,111.700616 L106.95587,104.624001 C113.383661,101.326053 124.083177,94.5586909 129.373582,98.473181 C130.143346,99.0414541 131.129473,100.545905 131.192615,102.123599 C131.220116,102.734528 130.910864,103.318236 130.39984,103.660322 L103.841457,121.605126 C95.2152229,127.771101 91.8415093,136.945976 92.6415806,145.600005 C93.3105985,152.875585 97.8390977,159.144831 104.560988,162.797775 C108.085679,164.71475 111.899418,165.617532 115.708386,165.512016 C119.048986,165.418847 122.385095,164.546092 125.514381,162.893189 C133.086856,158.890862 138.125818,151.11464 139.001378,142.078114 L141.139766,117.778753 L129.537188,126.397704 L128.264258,140.830158 Z" id="Fill-3" fill="#FF414B" mask="url(#mask-2)"></path>
|
||||||
|
</g>
|
||||||
|
</g>
|
||||||
|
</svg>
|
After Width: | Height: | Size: 3.9 KiB |
After Width: | Height: | Size: 6.0 KiB |
After Width: | Height: | Size: 8.8 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 17 KiB |
After Width: | Height: | Size: 1.1 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 87 KiB |
After Width: | Height: | Size: 4.6 KiB |
After Width: | Height: | Size: 44 KiB |
After Width: | Height: | Size: 100 KiB |
After Width: | Height: | Size: 6.2 KiB |
After Width: | Height: | Size: 26 KiB |
After Width: | Height: | Size: 6.9 KiB |
After Width: | Height: | Size: 9.4 KiB |
After Width: | Height: | Size: 15 KiB |
|
@ -0,0 +1,239 @@
|
||||||
|
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||||||
|
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
|
||||||
|
"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
|
||||||
|
<!-- Generated by graphviz version 2.36.0 (20140111.2315)
|
||||||
|
-->
|
||||||
|
<!-- Title: Tree Pages: 1 -->
|
||||||
|
<svg width="866pt" height="676pt"
|
||||||
|
viewBox="0.00 0.00 866.00 676.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
|
||||||
|
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 672)">
|
||||||
|
<title>Tree</title>
|
||||||
|
<polygon fill="white" stroke="none" points="-4,4 -4,-672 862,-672 862,4 -4,4"/>
|
||||||
|
<!-- 0 -->
|
||||||
|
<g id="node1" class="node"><title>0</title>
|
||||||
|
<path fill="none" stroke="black" d="M520,-667.5C520,-667.5 384,-667.5 384,-667.5 378,-667.5 372,-661.5 372,-655.5 372,-655.5 372,-596.5 372,-596.5 372,-590.5 378,-584.5 384,-584.5 384,-584.5 520,-584.5 520,-584.5 526,-584.5 532,-590.5 532,-596.5 532,-596.5 532,-655.5 532,-655.5 532,-661.5 526,-667.5 520,-667.5"/>
|
||||||
|
<text text-anchor="start" x="380" y="-652.3" font-family="Helvetica,sans-Serif" font-size="14.00">petal length (cm) ≤ 2.45</text>
|
||||||
|
<text text-anchor="start" x="412.5" y="-637.3" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.6667</text>
|
||||||
|
<text text-anchor="start" x="407" y="-622.3" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 150</text>
|
||||||
|
<text text-anchor="start" x="394" y="-607.3" font-family="Helvetica,sans-Serif" font-size="14.00">value = [50, 50, 50]</text>
|
||||||
|
<text text-anchor="start" x="408.5" y="-592.3" font-family="Helvetica,sans-Serif" font-size="14.00">class = setosa</text>
|
||||||
|
</g>
|
||||||
|
<!-- 1 -->
|
||||||
|
<g id="node2" class="node"><title>1</title>
|
||||||
|
<path fill="#e58139" stroke="black" d="M421.25,-540C421.25,-540 328.75,-540 328.75,-540 322.75,-540 316.75,-534 316.75,-528 316.75,-528 316.75,-484 316.75,-484 316.75,-478 322.75,-472 328.75,-472 328.75,-472 421.25,-472 421.25,-472 427.25,-472 433.25,-478 433.25,-484 433.25,-484 433.25,-528 433.25,-528 433.25,-534 427.25,-540 421.25,-540"/>
|
||||||
|
<text text-anchor="start" x="347" y="-524.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="334" y="-509.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 50</text>
|
||||||
|
<text text-anchor="start" x="324.5" y="-494.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [50, 0, 0]</text>
|
||||||
|
<text text-anchor="start" x="331.5" y="-479.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = setosa</text>
|
||||||
|
</g>
|
||||||
|
<!-- 0->1 -->
|
||||||
|
<g id="edge1" class="edge"><title>0->1</title>
|
||||||
|
<path fill="none" stroke="black" d="M425.501,-584.391C418.013,-572.916 409.852,-560.41 402.317,-548.863"/>
|
||||||
|
<polygon fill="black" stroke="black" points="405.095,-546.714 396.699,-540.252 399.232,-550.54 405.095,-546.714"/>
|
||||||
|
<text text-anchor="middle" x="391.555" y="-561.017" font-family="Helvetica,sans-Serif" font-size="14.00">True</text>
|
||||||
|
</g>
|
||||||
|
<!-- 2 -->
|
||||||
|
<g id="node3" class="node"><title>2</title>
|
||||||
|
<path fill="none" stroke="black" d="M594.25,-547.5C594.25,-547.5 463.75,-547.5 463.75,-547.5 457.75,-547.5 451.75,-541.5 451.75,-535.5 451.75,-535.5 451.75,-476.5 451.75,-476.5 451.75,-470.5 457.75,-464.5 463.75,-464.5 463.75,-464.5 594.25,-464.5 594.25,-464.5 600.25,-464.5 606.25,-470.5 606.25,-476.5 606.25,-476.5 606.25,-535.5 606.25,-535.5 606.25,-541.5 600.25,-547.5 594.25,-547.5"/>
|
||||||
|
<text text-anchor="start" x="459.5" y="-532.3" font-family="Helvetica,sans-Serif" font-size="14.00">petal width (cm) ≤ 1.75</text>
|
||||||
|
<text text-anchor="start" x="501" y="-517.3" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.5</text>
|
||||||
|
<text text-anchor="start" x="484" y="-502.3" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 100</text>
|
||||||
|
<text text-anchor="start" x="474.5" y="-487.3" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 50, 50]</text>
|
||||||
|
<text text-anchor="start" x="476.5" y="-472.3" font-family="Helvetica,sans-Serif" font-size="14.00">class = versicolor</text>
|
||||||
|
</g>
|
||||||
|
<!-- 0->2 -->
|
||||||
|
<g id="edge2" class="edge"><title>0->2</title>
|
||||||
|
<path fill="none" stroke="black" d="M478.499,-584.391C484.428,-575.306 490.778,-565.574 496.906,-556.183"/>
|
||||||
|
<polygon fill="black" stroke="black" points="499.918,-557.971 502.452,-547.684 494.056,-554.146 499.918,-557.971"/>
|
||||||
|
<text text-anchor="middle" x="507.596" y="-568.449" font-family="Helvetica,sans-Serif" font-size="14.00">False</text>
|
||||||
|
</g>
|
||||||
|
<!-- 3 -->
|
||||||
|
<g id="node4" class="node"><title>3</title>
|
||||||
|
<path fill="#39e581" fill-opacity="0.894118" stroke="black" d="M484,-427.5C484,-427.5 348,-427.5 348,-427.5 342,-427.5 336,-421.5 336,-415.5 336,-415.5 336,-356.5 336,-356.5 336,-350.5 342,-344.5 348,-344.5 348,-344.5 484,-344.5 484,-344.5 490,-344.5 496,-350.5 496,-356.5 496,-356.5 496,-415.5 496,-415.5 496,-421.5 490,-427.5 484,-427.5"/>
|
||||||
|
<text text-anchor="start" x="344" y="-412.3" font-family="Helvetica,sans-Serif" font-size="14.00">petal length (cm) ≤ 4.95</text>
|
||||||
|
<text text-anchor="start" x="380.5" y="-397.3" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.168</text>
|
||||||
|
<text text-anchor="start" x="375" y="-382.3" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 54</text>
|
||||||
|
<text text-anchor="start" x="365.5" y="-367.3" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 49, 5]</text>
|
||||||
|
<text text-anchor="start" x="363.5" y="-352.3" font-family="Helvetica,sans-Serif" font-size="14.00">class = versicolor</text>
|
||||||
|
</g>
|
||||||
|
<!-- 2->3 -->
|
||||||
|
<g id="edge3" class="edge"><title>2->3</title>
|
||||||
|
<path fill="none" stroke="black" d="M490.112,-464.391C481.056,-454.935 471.33,-444.778 462,-435.035"/>
|
||||||
|
<polygon fill="black" stroke="black" points="464.404,-432.485 454.96,-427.684 459.348,-437.327 464.404,-432.485"/>
|
||||||
|
</g>
|
||||||
|
<!-- 12 -->
|
||||||
|
<g id="node13" class="node"><title>12</title>
|
||||||
|
<path fill="#8139e5" fill-opacity="0.976471" stroke="black" d="M710,-427.5C710,-427.5 574,-427.5 574,-427.5 568,-427.5 562,-421.5 562,-415.5 562,-415.5 562,-356.5 562,-356.5 562,-350.5 568,-344.5 574,-344.5 574,-344.5 710,-344.5 710,-344.5 716,-344.5 722,-350.5 722,-356.5 722,-356.5 722,-415.5 722,-415.5 722,-421.5 716,-427.5 710,-427.5"/>
|
||||||
|
<text text-anchor="start" x="570" y="-412.3" font-family="Helvetica,sans-Serif" font-size="14.00">petal length (cm) ≤ 4.85</text>
|
||||||
|
<text text-anchor="start" x="602.5" y="-397.3" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0425</text>
|
||||||
|
<text text-anchor="start" x="601" y="-382.3" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 46</text>
|
||||||
|
<text text-anchor="start" x="591.5" y="-367.3" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 1, 45]</text>
|
||||||
|
<text text-anchor="start" x="593.5" y="-352.3" font-family="Helvetica,sans-Serif" font-size="14.00">class = virginica</text>
|
||||||
|
</g>
|
||||||
|
<!-- 2->12 -->
|
||||||
|
<g id="edge12" class="edge"><title>2->12</title>
|
||||||
|
<path fill="none" stroke="black" d="M567.888,-464.391C576.944,-454.935 586.67,-444.778 596,-435.035"/>
|
||||||
|
<polygon fill="black" stroke="black" points="598.652,-437.327 603.04,-427.684 593.596,-432.485 598.652,-437.327"/>
|
||||||
|
</g>
|
||||||
|
<!-- 4 -->
|
||||||
|
<g id="node5" class="node"><title>4</title>
|
||||||
|
<path fill="#39e581" fill-opacity="0.976471" stroke="black" d="M260.25,-307.5C260.25,-307.5 129.75,-307.5 129.75,-307.5 123.75,-307.5 117.75,-301.5 117.75,-295.5 117.75,-295.5 117.75,-236.5 117.75,-236.5 117.75,-230.5 123.75,-224.5 129.75,-224.5 129.75,-224.5 260.25,-224.5 260.25,-224.5 266.25,-224.5 272.25,-230.5 272.25,-236.5 272.25,-236.5 272.25,-295.5 272.25,-295.5 272.25,-301.5 266.25,-307.5 260.25,-307.5"/>
|
||||||
|
<text text-anchor="start" x="125.5" y="-292.3" font-family="Helvetica,sans-Serif" font-size="14.00">petal width (cm) ≤ 1.65</text>
|
||||||
|
<text text-anchor="start" x="155.5" y="-277.3" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0408</text>
|
||||||
|
<text text-anchor="start" x="154" y="-262.3" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 48</text>
|
||||||
|
<text text-anchor="start" x="144.5" y="-247.3" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 47, 1]</text>
|
||||||
|
<text text-anchor="start" x="142.5" y="-232.3" font-family="Helvetica,sans-Serif" font-size="14.00">class = versicolor</text>
|
||||||
|
</g>
|
||||||
|
<!-- 3->4 -->
|
||||||
|
<g id="edge4" class="edge"><title>3->4</title>
|
||||||
|
<path fill="none" stroke="black" d="M339.944,-344.391C320.671,-334.101 299.845,-322.981 280.154,-312.467"/>
|
||||||
|
<polygon fill="black" stroke="black" points="281.666,-309.306 271.196,-307.684 278.369,-315.481 281.666,-309.306"/>
|
||||||
|
</g>
|
||||||
|
<!-- 7 -->
|
||||||
|
<g id="node8" class="node"><title>7</title>
|
||||||
|
<path fill="#8139e5" fill-opacity="0.498039" stroke="black" d="M481.25,-307.5C481.25,-307.5 350.75,-307.5 350.75,-307.5 344.75,-307.5 338.75,-301.5 338.75,-295.5 338.75,-295.5 338.75,-236.5 338.75,-236.5 338.75,-230.5 344.75,-224.5 350.75,-224.5 350.75,-224.5 481.25,-224.5 481.25,-224.5 487.25,-224.5 493.25,-230.5 493.25,-236.5 493.25,-236.5 493.25,-295.5 493.25,-295.5 493.25,-301.5 487.25,-307.5 481.25,-307.5"/>
|
||||||
|
<text text-anchor="start" x="346.5" y="-292.3" font-family="Helvetica,sans-Serif" font-size="14.00">petal width (cm) ≤ 1.55</text>
|
||||||
|
<text text-anchor="start" x="376.5" y="-277.3" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.4444</text>
|
||||||
|
<text text-anchor="start" x="378.5" y="-262.3" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 6</text>
|
||||||
|
<text text-anchor="start" x="369" y="-247.3" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 2, 4]</text>
|
||||||
|
<text text-anchor="start" x="367.5" y="-232.3" font-family="Helvetica,sans-Serif" font-size="14.00">class = virginica</text>
|
||||||
|
</g>
|
||||||
|
<!-- 3->7 -->
|
||||||
|
<g id="edge7" class="edge"><title>3->7</title>
|
||||||
|
<path fill="none" stroke="black" d="M416,-344.391C416,-335.862 416,-326.763 416,-317.912"/>
|
||||||
|
<polygon fill="black" stroke="black" points="419.5,-317.684 416,-307.684 412.5,-317.684 419.5,-317.684"/>
|
||||||
|
</g>
|
||||||
|
<!-- 5 -->
|
||||||
|
<g id="node6" class="node"><title>5</title>
|
||||||
|
<path fill="#39e581" stroke="black" d="M108.25,-180C108.25,-180 11.75,-180 11.75,-180 5.75,-180 -0.25,-174 -0.25,-168 -0.25,-168 -0.25,-124 -0.25,-124 -0.25,-118 5.75,-112 11.75,-112 11.75,-112 108.25,-112 108.25,-112 114.25,-112 120.25,-118 120.25,-124 120.25,-124 120.25,-168 120.25,-168 120.25,-174 114.25,-180 108.25,-180"/>
|
||||||
|
<text text-anchor="start" x="32" y="-164.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="19" y="-149.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 47</text>
|
||||||
|
<text text-anchor="start" x="9.5" y="-134.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 47, 0]</text>
|
||||||
|
<text text-anchor="start" x="7.5" y="-119.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = versicolor</text>
|
||||||
|
</g>
|
||||||
|
<!-- 4->5 -->
|
||||||
|
<g id="edge5" class="edge"><title>4->5</title>
|
||||||
|
<path fill="none" stroke="black" d="M148.541,-224.391C134.64,-212.241 119.417,-198.935 105.574,-186.835"/>
|
||||||
|
<polygon fill="black" stroke="black" points="107.876,-184.198 98.043,-180.252 103.269,-189.469 107.876,-184.198"/>
|
||||||
|
</g>
|
||||||
|
<!-- 6 -->
|
||||||
|
<g id="node7" class="node"><title>6</title>
|
||||||
|
<path fill="#8139e5" stroke="black" d="M239.25,-180C239.25,-180 150.75,-180 150.75,-180 144.75,-180 138.75,-174 138.75,-168 138.75,-168 138.75,-124 138.75,-124 138.75,-118 144.75,-112 150.75,-112 150.75,-112 239.25,-112 239.25,-112 245.25,-112 251.25,-118 251.25,-124 251.25,-124 251.25,-168 251.25,-168 251.25,-174 245.25,-180 239.25,-180"/>
|
||||||
|
<text text-anchor="start" x="167" y="-164.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="157.5" y="-149.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 1</text>
|
||||||
|
<text text-anchor="start" x="148" y="-134.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 0, 1]</text>
|
||||||
|
<text text-anchor="start" x="146.5" y="-119.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = virginica</text>
|
||||||
|
</g>
|
||||||
|
<!-- 4->6 -->
|
||||||
|
<g id="edge6" class="edge"><title>4->6</title>
|
||||||
|
<path fill="none" stroke="black" d="M195,-224.391C195,-213.479 195,-201.634 195,-190.568"/>
|
||||||
|
<polygon fill="black" stroke="black" points="198.5,-190.252 195,-180.252 191.5,-190.252 198.5,-190.252"/>
|
||||||
|
</g>
|
||||||
|
<!-- 8 -->
|
||||||
|
<g id="node9" class="node"><title>8</title>
|
||||||
|
<path fill="#8139e5" stroke="black" d="M370.25,-180C370.25,-180 281.75,-180 281.75,-180 275.75,-180 269.75,-174 269.75,-168 269.75,-168 269.75,-124 269.75,-124 269.75,-118 275.75,-112 281.75,-112 281.75,-112 370.25,-112 370.25,-112 376.25,-112 382.25,-118 382.25,-124 382.25,-124 382.25,-168 382.25,-168 382.25,-174 376.25,-180 370.25,-180"/>
|
||||||
|
<text text-anchor="start" x="298" y="-164.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="288.5" y="-149.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 3</text>
|
||||||
|
<text text-anchor="start" x="279" y="-134.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 0, 3]</text>
|
||||||
|
<text text-anchor="start" x="277.5" y="-119.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = virginica</text>
|
||||||
|
</g>
|
||||||
|
<!-- 7->8 -->
|
||||||
|
<g id="edge8" class="edge"><title>7->8</title>
|
||||||
|
<path fill="none" stroke="black" d="M385.027,-224.391C376.189,-212.804 366.549,-200.165 357.67,-188.523"/>
|
||||||
|
<polygon fill="black" stroke="black" points="360.209,-186.081 351.362,-180.252 354.644,-190.326 360.209,-186.081"/>
|
||||||
|
</g>
|
||||||
|
<!-- 9 -->
|
||||||
|
<g id="node10" class="node"><title>9</title>
|
||||||
|
<path fill="#39e581" fill-opacity="0.498039" stroke="black" d="M551.25,-187.5C551.25,-187.5 412.75,-187.5 412.75,-187.5 406.75,-187.5 400.75,-181.5 400.75,-175.5 400.75,-175.5 400.75,-116.5 400.75,-116.5 400.75,-110.5 406.75,-104.5 412.75,-104.5 412.75,-104.5 551.25,-104.5 551.25,-104.5 557.25,-104.5 563.25,-110.5 563.25,-116.5 563.25,-116.5 563.25,-175.5 563.25,-175.5 563.25,-181.5 557.25,-187.5 551.25,-187.5"/>
|
||||||
|
<text text-anchor="start" x="408.5" y="-172.3" font-family="Helvetica,sans-Serif" font-size="14.00">sepal length (cm) ≤ 6.95</text>
|
||||||
|
<text text-anchor="start" x="442.5" y="-157.3" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.4444</text>
|
||||||
|
<text text-anchor="start" x="444.5" y="-142.3" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 3</text>
|
||||||
|
<text text-anchor="start" x="435" y="-127.3" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 2, 1]</text>
|
||||||
|
<text text-anchor="start" x="429.5" y="-112.3" font-family="Helvetica,sans-Serif" font-size="14.00">class = versicolor</text>
|
||||||
|
</g>
|
||||||
|
<!-- 7->9 -->
|
||||||
|
<g id="edge9" class="edge"><title>7->9</title>
|
||||||
|
<path fill="none" stroke="black" d="M438.713,-224.391C443.743,-215.398 449.127,-205.772 454.33,-196.471"/>
|
||||||
|
<polygon fill="black" stroke="black" points="457.418,-198.12 459.245,-187.684 451.308,-194.703 457.418,-198.12"/>
|
||||||
|
</g>
|
||||||
|
<!-- 10 -->
|
||||||
|
<g id="node11" class="node"><title>10</title>
|
||||||
|
<path fill="#39e581" stroke="black" d="M462.25,-68C462.25,-68 365.75,-68 365.75,-68 359.75,-68 353.75,-62 353.75,-56 353.75,-56 353.75,-12 353.75,-12 353.75,-6 359.75,-0 365.75,-0 365.75,-0 462.25,-0 462.25,-0 468.25,-0 474.25,-6 474.25,-12 474.25,-12 474.25,-56 474.25,-56 474.25,-62 468.25,-68 462.25,-68"/>
|
||||||
|
<text text-anchor="start" x="386" y="-52.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="376.5" y="-37.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 2</text>
|
||||||
|
<text text-anchor="start" x="367" y="-22.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 2, 0]</text>
|
||||||
|
<text text-anchor="start" x="361.5" y="-7.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = versicolor</text>
|
||||||
|
</g>
|
||||||
|
<!-- 9->10 -->
|
||||||
|
<g id="edge10" class="edge"><title>9->10</title>
|
||||||
|
<path fill="none" stroke="black" d="M456.873,-104.353C451.368,-95.4478 445.529,-86.0034 439.992,-77.0452"/>
|
||||||
|
<polygon fill="black" stroke="black" points="442.813,-74.9525 434.577,-68.287 436.859,-78.6333 442.813,-74.9525"/>
|
||||||
|
</g>
|
||||||
|
<!-- 11 -->
|
||||||
|
<g id="node12" class="node"><title>11</title>
|
||||||
|
<path fill="#8139e5" stroke="black" d="M593.25,-68C593.25,-68 504.75,-68 504.75,-68 498.75,-68 492.75,-62 492.75,-56 492.75,-56 492.75,-12 492.75,-12 492.75,-6 498.75,-0 504.75,-0 504.75,-0 593.25,-0 593.25,-0 599.25,-0 605.25,-6 605.25,-12 605.25,-12 605.25,-56 605.25,-56 605.25,-62 599.25,-68 593.25,-68"/>
|
||||||
|
<text text-anchor="start" x="521" y="-52.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="511.5" y="-37.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 1</text>
|
||||||
|
<text text-anchor="start" x="502" y="-22.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 0, 1]</text>
|
||||||
|
<text text-anchor="start" x="500.5" y="-7.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = virginica</text>
|
||||||
|
</g>
|
||||||
|
<!-- 9->11 -->
|
||||||
|
<g id="edge11" class="edge"><title>9->11</title>
|
||||||
|
<path fill="none" stroke="black" d="M506.758,-104.353C512.182,-95.4478 517.934,-86.0034 523.391,-77.0452"/>
|
||||||
|
<polygon fill="black" stroke="black" points="526.512,-78.6482 528.725,-68.287 520.534,-75.0068 526.512,-78.6482"/>
|
||||||
|
</g>
|
||||||
|
<!-- 13 -->
|
||||||
|
<g id="node14" class="node"><title>13</title>
|
||||||
|
<path fill="#8139e5" fill-opacity="0.498039" stroke="black" d="M711.25,-307.5C711.25,-307.5 572.75,-307.5 572.75,-307.5 566.75,-307.5 560.75,-301.5 560.75,-295.5 560.75,-295.5 560.75,-236.5 560.75,-236.5 560.75,-230.5 566.75,-224.5 572.75,-224.5 572.75,-224.5 711.25,-224.5 711.25,-224.5 717.25,-224.5 723.25,-230.5 723.25,-236.5 723.25,-236.5 723.25,-295.5 723.25,-295.5 723.25,-301.5 717.25,-307.5 711.25,-307.5"/>
|
||||||
|
<text text-anchor="start" x="568.5" y="-292.3" font-family="Helvetica,sans-Serif" font-size="14.00">sepal length (cm) ≤ 5.95</text>
|
||||||
|
<text text-anchor="start" x="602.5" y="-277.3" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.4444</text>
|
||||||
|
<text text-anchor="start" x="604.5" y="-262.3" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 3</text>
|
||||||
|
<text text-anchor="start" x="595" y="-247.3" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 1, 2]</text>
|
||||||
|
<text text-anchor="start" x="593.5" y="-232.3" font-family="Helvetica,sans-Serif" font-size="14.00">class = virginica</text>
|
||||||
|
</g>
|
||||||
|
<!-- 12->13 -->
|
||||||
|
<g id="edge13" class="edge"><title>12->13</title>
|
||||||
|
<path fill="none" stroke="black" d="M642,-344.391C642,-335.862 642,-326.763 642,-317.912"/>
|
||||||
|
<polygon fill="black" stroke="black" points="645.5,-317.684 642,-307.684 638.5,-317.684 645.5,-317.684"/>
|
||||||
|
</g>
|
||||||
|
<!-- 16 -->
|
||||||
|
<g id="node17" class="node"><title>16</title>
|
||||||
|
<path fill="#8139e5" stroke="black" d="M846.25,-300C846.25,-300 753.75,-300 753.75,-300 747.75,-300 741.75,-294 741.75,-288 741.75,-288 741.75,-244 741.75,-244 741.75,-238 747.75,-232 753.75,-232 753.75,-232 846.25,-232 846.25,-232 852.25,-232 858.25,-238 858.25,-244 858.25,-244 858.25,-288 858.25,-288 858.25,-294 852.25,-300 846.25,-300"/>
|
||||||
|
<text text-anchor="start" x="772" y="-284.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="759" y="-269.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 43</text>
|
||||||
|
<text text-anchor="start" x="749.5" y="-254.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 0, 43]</text>
|
||||||
|
<text text-anchor="start" x="751.5" y="-239.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = virginica</text>
|
||||||
|
</g>
|
||||||
|
<!-- 12->16 -->
|
||||||
|
<g id="edge16" class="edge"><title>12->16</title>
|
||||||
|
<path fill="none" stroke="black" d="M696.375,-344.391C712.87,-332.072 730.957,-318.564 747.337,-306.33"/>
|
||||||
|
<polygon fill="black" stroke="black" points="749.813,-308.85 755.73,-300.062 745.624,-303.242 749.813,-308.85"/>
|
||||||
|
</g>
|
||||||
|
<!-- 14 -->
|
||||||
|
<g id="node15" class="node"><title>14</title>
|
||||||
|
<path fill="#39e581" stroke="black" d="M690.25,-180C690.25,-180 593.75,-180 593.75,-180 587.75,-180 581.75,-174 581.75,-168 581.75,-168 581.75,-124 581.75,-124 581.75,-118 587.75,-112 593.75,-112 593.75,-112 690.25,-112 690.25,-112 696.25,-112 702.25,-118 702.25,-124 702.25,-124 702.25,-168 702.25,-168 702.25,-174 696.25,-180 690.25,-180"/>
|
||||||
|
<text text-anchor="start" x="614" y="-164.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="604.5" y="-149.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 1</text>
|
||||||
|
<text text-anchor="start" x="595" y="-134.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 1, 0]</text>
|
||||||
|
<text text-anchor="start" x="589.5" y="-119.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = versicolor</text>
|
||||||
|
</g>
|
||||||
|
<!-- 13->14 -->
|
||||||
|
<g id="edge14" class="edge"><title>13->14</title>
|
||||||
|
<path fill="none" stroke="black" d="M642,-224.391C642,-213.479 642,-201.634 642,-190.568"/>
|
||||||
|
<polygon fill="black" stroke="black" points="645.5,-190.252 642,-180.252 638.5,-190.252 645.5,-190.252"/>
|
||||||
|
</g>
|
||||||
|
<!-- 15 -->
|
||||||
|
<g id="node16" class="node"><title>15</title>
|
||||||
|
<path fill="#8139e5" stroke="black" d="M821.25,-180C821.25,-180 732.75,-180 732.75,-180 726.75,-180 720.75,-174 720.75,-168 720.75,-168 720.75,-124 720.75,-124 720.75,-118 726.75,-112 732.75,-112 732.75,-112 821.25,-112 821.25,-112 827.25,-112 833.25,-118 833.25,-124 833.25,-124 833.25,-168 833.25,-168 833.25,-174 827.25,-180 821.25,-180"/>
|
||||||
|
<text text-anchor="start" x="749" y="-164.8" font-family="Helvetica,sans-Serif" font-size="14.00">gini = 0.0</text>
|
||||||
|
<text text-anchor="start" x="739.5" y="-149.8" font-family="Helvetica,sans-Serif" font-size="14.00">samples = 2</text>
|
||||||
|
<text text-anchor="start" x="730" y="-134.8" font-family="Helvetica,sans-Serif" font-size="14.00">value = [0, 0, 2]</text>
|
||||||
|
<text text-anchor="start" x="728.5" y="-119.8" font-family="Helvetica,sans-Serif" font-size="14.00">class = virginica</text>
|
||||||
|
</g>
|
||||||
|
<!-- 13->15 -->
|
||||||
|
<g id="edge15" class="edge"><title>13->15</title>
|
||||||
|
<path fill="none" stroke="black" d="M688.459,-224.391C702.36,-212.241 717.583,-198.935 731.426,-186.835"/>
|
||||||
|
<polygon fill="black" stroke="black" points="733.731,-189.469 738.957,-180.252 729.124,-184.198 733.731,-189.469"/>
|
||||||
|
</g>
|
||||||
|
</g>
|
||||||
|
</svg>
|
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 3.0 KiB |
After Width: | Height: | Size: 13 KiB |
After Width: | Height: | Size: 16 KiB |
After Width: | Height: | Size: 30 KiB |
After Width: | Height: | Size: 7.9 KiB |
After Width: | Height: | Size: 10 KiB |
|
@ -0,0 +1,20 @@
|
||||||
|
The scikit-learn machine learning cheat sheet was originally created by Andreas Mueller:
|
||||||
|
https://peekaboo-vision.blogspot.de/2013/01/machine-learning-cheat-sheet-for-scikit.html
|
||||||
|
|
||||||
|
The current version of the chart is located at `doc/images/ml_map.svg` in SVG+XML
|
||||||
|
format, created using [draw.io](https://draw.io/). To edit the chart, open the file in
|
||||||
|
draw.io, make changes, and save. This should update the chart in-place. Another option
|
||||||
|
would be to re-export the chart as SVG and replace the existing file. The options used
|
||||||
|
for exporting the chart are:
|
||||||
|
|
||||||
|
- Zoom: 100%
|
||||||
|
- Border width: 15
|
||||||
|
- Size: Diagram
|
||||||
|
- Transparent Background: False
|
||||||
|
- Appearance: Light
|
||||||
|
|
||||||
|
Each node in the chart that contains an estimator should have a link, where the root
|
||||||
|
directory is at `../../`. Note that after updating or re-exporting the SVG, the links
|
||||||
|
may be prefixed with e.g. `https://app.diagrams.net/`. Remember to check and remove
|
||||||
|
them, for instance by replacing all occurrences of `https://app.diagrams.net/../../`
|
||||||
|
with `../../`.
|
After Width: | Height: | Size: 195 KiB |
After Width: | Height: | Size: 26 KiB |
After Width: | Height: | Size: 87 KiB |
After Width: | Height: | Size: 4.2 KiB |
After Width: | Height: | Size: 7.9 KiB |
After Width: | Height: | Size: 10 KiB |
After Width: | Height: | Size: 5.4 KiB |
After Width: | Height: | Size: 44 KiB |
After Width: | Height: | Size: 42 KiB |
After Width: | Height: | Size: 30 KiB |
After Width: | Height: | Size: 122 KiB |
After Width: | Height: | Size: 85 KiB |