452 lines
20 KiB
ReStructuredText
452 lines
20 KiB
ReStructuredText
torch.utils.data
|
|
===================================
|
|
|
|
.. automodule:: torch.utils.data
|
|
|
|
At the heart of PyTorch data loading utility is the :class:`torch.utils.data.DataLoader`
|
|
class. It represents a Python iterable over a dataset, with support for
|
|
|
|
* `map-style and iterable-style datasets <Dataset Types_>`_,
|
|
|
|
* `customizing data loading order <Data Loading Order and Sampler_>`_,
|
|
|
|
* `automatic batching <Loading Batched and Non-Batched Data_>`_,
|
|
|
|
* `single- and multi-process data loading <Single- and Multi-process Data Loading_>`_,
|
|
|
|
* `automatic memory pinning <Memory Pinning_>`_.
|
|
|
|
These options are configured by the constructor arguments of a
|
|
:class:`~torch.utils.data.DataLoader`, which has signature::
|
|
|
|
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
|
|
batch_sampler=None, num_workers=0, collate_fn=None,
|
|
pin_memory=False, drop_last=False, timeout=0,
|
|
worker_init_fn=None, *, prefetch_factor=2,
|
|
persistent_workers=False)
|
|
|
|
The sections below describe in details the effects and usages of these options.
|
|
|
|
Dataset Types
|
|
-------------
|
|
|
|
The most important argument of :class:`~torch.utils.data.DataLoader`
|
|
constructor is :attr:`dataset`, which indicates a dataset object to load data
|
|
from. PyTorch supports two different types of datasets:
|
|
|
|
* `map-style datasets <Map-style datasets_>`_,
|
|
|
|
* `iterable-style datasets <Iterable-style datasets_>`_.
|
|
|
|
Map-style datasets
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
A map-style dataset is one that implements the :meth:`__getitem__` and
|
|
:meth:`__len__` protocols, and represents a map from (possibly non-integral)
|
|
indices/keys to data samples.
|
|
|
|
For example, such a dataset, when accessed with ``dataset[idx]``, could read
|
|
the ``idx``-th image and its corresponding label from a folder on the disk.
|
|
|
|
See :class:`~torch.utils.data.Dataset` for more details.
|
|
|
|
Iterable-style datasets
|
|
^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
An iterable-style dataset is an instance of a subclass of :class:`~torch.utils.data.IterableDataset`
|
|
that implements the :meth:`__iter__` protocol, and represents an iterable over
|
|
data samples. This type of datasets is particularly suitable for cases where
|
|
random reads are expensive or even improbable, and where the batch size depends
|
|
on the fetched data.
|
|
|
|
For example, such a dataset, when called ``iter(dataset)``, could return a
|
|
stream of data reading from a database, a remote server, or even logs generated
|
|
in real time.
|
|
|
|
See :class:`~torch.utils.data.IterableDataset` for more details.
|
|
|
|
.. note:: When using a :class:`~torch.utils.data.IterableDataset` with
|
|
`multi-process data loading <Multi-process data loading_>`_. The same
|
|
dataset object is replicated on each worker process, and thus the
|
|
replicas must be configured differently to avoid duplicated data. See
|
|
:class:`~torch.utils.data.IterableDataset` documentations for how to
|
|
achieve this.
|
|
|
|
Data Loading Order and :class:`~torch.utils.data.Sampler`
|
|
---------------------------------------------------------
|
|
|
|
For `iterable-style datasets <Iterable-style datasets_>`_, data loading order
|
|
is entirely controlled by the user-defined iterable. This allows easier
|
|
implementations of chunk-reading and dynamic batch size (e.g., by yielding a
|
|
batched sample at each time).
|
|
|
|
The rest of this section concerns the case with
|
|
`map-style datasets <Map-style datasets_>`_. :class:`torch.utils.data.Sampler`
|
|
classes are used to specify the sequence of indices/keys used in data loading.
|
|
They represent iterable objects over the indices to datasets. E.g., in the
|
|
common case with stochastic gradient decent (SGD), a
|
|
:class:`~torch.utils.data.Sampler` could randomly permute a list of indices
|
|
and yield each one at a time, or yield a small number of them for mini-batch
|
|
SGD.
|
|
|
|
A sequential or shuffled sampler will be automatically constructed based on the :attr:`shuffle` argument to a :class:`~torch.utils.data.DataLoader`.
|
|
Alternatively, users may use the :attr:`sampler` argument to specify a
|
|
custom :class:`~torch.utils.data.Sampler` object that at each time yields
|
|
the next index/key to fetch.
|
|
|
|
A custom :class:`~torch.utils.data.Sampler` that yields a list of batch
|
|
indices at a time can be passed as the :attr:`batch_sampler` argument.
|
|
Automatic batching can also be enabled via :attr:`batch_size` and
|
|
:attr:`drop_last` arguments. See
|
|
`the next section <Loading Batched and Non-Batched Data_>`_ for more details
|
|
on this.
|
|
|
|
.. note::
|
|
Neither :attr:`sampler` nor :attr:`batch_sampler` is compatible with
|
|
iterable-style datasets, since such datasets have no notion of a key or an
|
|
index.
|
|
|
|
Loading Batched and Non-Batched Data
|
|
------------------------------------
|
|
|
|
:class:`~torch.utils.data.DataLoader` supports automatically collating
|
|
individual fetched data samples into batches via arguments
|
|
:attr:`batch_size`, :attr:`drop_last`, :attr:`batch_sampler`, and
|
|
:attr:`collate_fn` (which has a default function).
|
|
|
|
|
|
Automatic batching (default)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
This is the most common case, and corresponds to fetching a minibatch of
|
|
data and collating them into batched samples, i.e., containing Tensors with
|
|
one dimension being the batch dimension (usually the first).
|
|
|
|
When :attr:`batch_size` (default ``1``) is not ``None``, the data loader yields
|
|
batched samples instead of individual samples. :attr:`batch_size` and
|
|
:attr:`drop_last` arguments are used to specify how the data loader obtains
|
|
batches of dataset keys. For map-style datasets, users can alternatively
|
|
specify :attr:`batch_sampler`, which yields a list of keys at a time.
|
|
|
|
.. note::
|
|
The :attr:`batch_size` and :attr:`drop_last` arguments essentially are used
|
|
to construct a :attr:`batch_sampler` from :attr:`sampler`. For map-style
|
|
datasets, the :attr:`sampler` is either provided by user or constructed
|
|
based on the :attr:`shuffle` argument. For iterable-style datasets, the
|
|
:attr:`sampler` is a dummy infinite one. See
|
|
`this section <Data Loading Order and Sampler_>`_ on more details on
|
|
samplers.
|
|
|
|
.. note::
|
|
When fetching from
|
|
`iterable-style datasets <Iterable-style datasets_>`_ with
|
|
`multi-processing <Multi-process data loading_>`_, the :attr:`drop_last`
|
|
argument drops the last non-full batch of each worker's dataset replica.
|
|
|
|
After fetching a list of samples using the indices from sampler, the function
|
|
passed as the :attr:`collate_fn` argument is used to collate lists of samples
|
|
into batches.
|
|
|
|
In this case, loading from a map-style dataset is roughly equivalent with::
|
|
|
|
for indices in batch_sampler:
|
|
yield collate_fn([dataset[i] for i in indices])
|
|
|
|
and loading from an iterable-style dataset is roughly equivalent with::
|
|
|
|
dataset_iter = iter(dataset)
|
|
for indices in batch_sampler:
|
|
yield collate_fn([next(dataset_iter) for _ in indices])
|
|
|
|
A custom :attr:`collate_fn` can be used to customize collation, e.g., padding
|
|
sequential data to max length of a batch. See
|
|
`this section <dataloader-collate_fn_>`_ on more about :attr:`collate_fn`.
|
|
|
|
Disable automatic batching
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
In certain cases, users may want to handle batching manually in dataset code,
|
|
or simply load individual samples. For example, it could be cheaper to directly
|
|
load batched data (e.g., bulk reads from a database or reading continuous
|
|
chunks of memory), or the batch size is data dependent, or the program is
|
|
designed to work on individual samples. Under these scenarios, it's likely
|
|
better to not use automatic batching (where :attr:`collate_fn` is used to
|
|
collate the samples), but let the data loader directly return each member of
|
|
the :attr:`dataset` object.
|
|
|
|
When both :attr:`batch_size` and :attr:`batch_sampler` are ``None`` (default
|
|
value for :attr:`batch_sampler` is already ``None``), automatic batching is
|
|
disabled. Each sample obtained from the :attr:`dataset` is processed with the
|
|
function passed as the :attr:`collate_fn` argument.
|
|
|
|
**When automatic batching is disabled**, the default :attr:`collate_fn` simply
|
|
converts NumPy arrays into PyTorch Tensors, and keeps everything else untouched.
|
|
|
|
In this case, loading from a map-style dataset is roughly equivalent with::
|
|
|
|
for index in sampler:
|
|
yield collate_fn(dataset[index])
|
|
|
|
and loading from an iterable-style dataset is roughly equivalent with::
|
|
|
|
for data in iter(dataset):
|
|
yield collate_fn(data)
|
|
|
|
See `this section <dataloader-collate_fn_>`_ on more about :attr:`collate_fn`.
|
|
|
|
.. _dataloader-collate_fn:
|
|
|
|
Working with :attr:`collate_fn`
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The use of :attr:`collate_fn` is slightly different when automatic batching is
|
|
enabled or disabled.
|
|
|
|
**When automatic batching is disabled**, :attr:`collate_fn` is called with
|
|
each individual data sample, and the output is yielded from the data loader
|
|
iterator. In this case, the default :attr:`collate_fn` simply converts NumPy
|
|
arrays in PyTorch tensors.
|
|
|
|
**When automatic batching is enabled**, :attr:`collate_fn` is called with a list
|
|
of data samples at each time. It is expected to collate the input samples into
|
|
a batch for yielding from the data loader iterator. The rest of this section
|
|
describes the behavior of the default :attr:`collate_fn`
|
|
(:func:`~torch.utils.data.default_collate`).
|
|
|
|
For instance, if each data sample consists of a 3-channel image and an integral
|
|
class label, i.e., each element of the dataset returns a tuple
|
|
``(image, class_index)``, the default :attr:`collate_fn` collates a list of
|
|
such tuples into a single tuple of a batched image tensor and a batched class
|
|
label Tensor. In particular, the default :attr:`collate_fn` has the following
|
|
properties:
|
|
|
|
* It always prepends a new dimension as the batch dimension.
|
|
|
|
* It automatically converts NumPy arrays and Python numerical values into
|
|
PyTorch Tensors.
|
|
|
|
* It preserves the data structure, e.g., if each sample is a dictionary, it
|
|
outputs a dictionary with the same set of keys but batched Tensors as values
|
|
(or lists if the values can not be converted into Tensors). Same
|
|
for ``list`` s, ``tuple`` s, ``namedtuple`` s, etc.
|
|
|
|
Users may use customized :attr:`collate_fn` to achieve custom batching, e.g.,
|
|
collating along a dimension other than the first, padding sequences of
|
|
various lengths, or adding support for custom data types.
|
|
|
|
If you run into a situation where the outputs of :class:`~torch.utils.data.DataLoader`
|
|
have dimensions or type that is different from your expectation, you may
|
|
want to check your :attr:`collate_fn`.
|
|
|
|
Single- and Multi-process Data Loading
|
|
--------------------------------------
|
|
|
|
A :class:`~torch.utils.data.DataLoader` uses single-process data loading by
|
|
default.
|
|
|
|
Within a Python process, the
|
|
`Global Interpreter Lock (GIL) <https://wiki.python.org/moin/GlobalInterpreterLock>`_
|
|
prevents true fully parallelizing Python code across threads. To avoid blocking
|
|
computation code with data loading, PyTorch provides an easy switch to perform
|
|
multi-process data loading by simply setting the argument :attr:`num_workers`
|
|
to a positive integer.
|
|
|
|
Single-process data loading (default)
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
In this mode, data fetching is done in the same process a
|
|
:class:`~torch.utils.data.DataLoader` is initialized. Therefore, data loading
|
|
may block computing. However, this mode may be preferred when resource(s) used
|
|
for sharing data among processes (e.g., shared memory, file descriptors) is
|
|
limited, or when the entire dataset is small and can be loaded entirely in
|
|
memory. Additionally, single-process loading often shows more readable error
|
|
traces and thus is useful for debugging.
|
|
|
|
|
|
Multi-process data loading
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Setting the argument :attr:`num_workers` as a positive integer will
|
|
turn on multi-process data loading with the specified number of loader worker
|
|
processes.
|
|
|
|
.. warning::
|
|
After several iterations, the loader worker processes will consume
|
|
the same amount of CPU memory as the parent process for all Python
|
|
objects in the parent process which are accessed from the worker
|
|
processes. This can be problematic if the Dataset contains a lot of
|
|
data (e.g., you are loading a very large list of filenames at Dataset
|
|
construction time) and/or you are using a lot of workers (overall
|
|
memory usage is ``number of workers * size of parent process``). The
|
|
simplest workaround is to replace Python objects with non-refcounted
|
|
representations such as Pandas, Numpy or PyArrow objects. Check out
|
|
`issue #13246
|
|
<https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662>`_
|
|
for more details on why this occurs and example code for how to
|
|
workaround these problems.
|
|
|
|
In this mode, each time an iterator of a :class:`~torch.utils.data.DataLoader`
|
|
is created (e.g., when you call ``enumerate(dataloader)``), :attr:`num_workers`
|
|
worker processes are created. At this point, the :attr:`dataset`,
|
|
:attr:`collate_fn`, and :attr:`worker_init_fn` are passed to each
|
|
worker, where they are used to initialize, and fetch data. This means that
|
|
dataset access together with its internal IO, transforms
|
|
(including :attr:`collate_fn`) runs in the worker process.
|
|
|
|
:func:`torch.utils.data.get_worker_info()` returns various useful information
|
|
in a worker process (including the worker id, dataset replica, initial seed,
|
|
etc.), and returns ``None`` in main process. Users may use this function in
|
|
dataset code and/or :attr:`worker_init_fn` to individually configure each
|
|
dataset replica, and to determine whether the code is running in a worker
|
|
process. For example, this can be particularly helpful in sharding the dataset.
|
|
|
|
For map-style datasets, the main process generates the indices using
|
|
:attr:`sampler` and sends them to the workers. So any shuffle randomization is
|
|
done in the main process which guides loading by assigning indices to load.
|
|
|
|
For iterable-style datasets, since each worker process gets a replica of the
|
|
:attr:`dataset` object, naive multi-process loading will often result in
|
|
duplicated data. Using :func:`torch.utils.data.get_worker_info()` and/or
|
|
:attr:`worker_init_fn`, users may configure each replica independently. (See
|
|
:class:`~torch.utils.data.IterableDataset` documentations for how to achieve
|
|
this. ) For similar reasons, in multi-process loading, the :attr:`drop_last`
|
|
argument drops the last non-full batch of each worker's iterable-style dataset
|
|
replica.
|
|
|
|
Workers are shut down once the end of the iteration is reached, or when the
|
|
iterator becomes garbage collected.
|
|
|
|
.. warning::
|
|
It is generally not recommended to return CUDA tensors in multi-process
|
|
loading because of many subtleties in using CUDA and sharing CUDA tensors in
|
|
multiprocessing (see :ref:`multiprocessing-cuda-note`). Instead, we recommend
|
|
using `automatic memory pinning <Memory Pinning_>`_ (i.e., setting
|
|
:attr:`pin_memory=True`), which enables fast data transfer to CUDA-enabled
|
|
GPUs.
|
|
|
|
Platform-specific behaviors
|
|
"""""""""""""""""""""""""""
|
|
|
|
Since workers rely on Python :py:mod:`multiprocessing`, worker launch behavior is
|
|
different on Windows compared to Unix.
|
|
|
|
* On Unix, :func:`fork()` is the default :py:mod:`multiprocessing` start method.
|
|
Using :func:`fork`, child workers typically can access the :attr:`dataset` and
|
|
Python argument functions directly through the cloned address space.
|
|
|
|
* On Windows or MacOS, :func:`spawn()` is the default :py:mod:`multiprocessing` start method.
|
|
Using :func:`spawn()`, another interpreter is launched which runs your main script,
|
|
followed by the internal worker function that receives the :attr:`dataset`,
|
|
:attr:`collate_fn` and other arguments through :py:mod:`pickle` serialization.
|
|
|
|
This separate serialization means that you should take two steps to ensure you
|
|
are compatible with Windows while using multi-process data loading:
|
|
|
|
- Wrap most of you main script's code within ``if __name__ == '__main__':`` block,
|
|
to make sure it doesn't run again (most likely generating error) when each worker
|
|
process is launched. You can place your dataset and :class:`~torch.utils.data.DataLoader`
|
|
instance creation logic here, as it doesn't need to be re-executed in workers.
|
|
|
|
- Make sure that any custom :attr:`collate_fn`, :attr:`worker_init_fn`
|
|
or :attr:`dataset` code is declared as top level definitions, outside of the
|
|
``__main__`` check. This ensures that they are available in worker processes.
|
|
(this is needed since functions are pickled as references only, not ``bytecode``.)
|
|
|
|
.. _data-loading-randomness:
|
|
|
|
Randomness in multi-process data loading
|
|
""""""""""""""""""""""""""""""""""""""""""
|
|
|
|
By default, each worker will have its PyTorch seed set to ``base_seed + worker_id``,
|
|
where ``base_seed`` is a long generated by main process using its RNG (thereby,
|
|
consuming a RNG state mandatorily) or a specified :attr:`generator`. However, seeds for other
|
|
libraries may be duplicated upon initializing workers, causing each worker to return
|
|
identical random numbers. (See :ref:`this section <dataloader-workers-random-seed>` in FAQ.).
|
|
|
|
In :attr:`worker_init_fn`, you may access the PyTorch seed set for each worker
|
|
with either :func:`torch.utils.data.get_worker_info().seed <torch.utils.data.get_worker_info>`
|
|
or :func:`torch.initial_seed()`, and use it to seed other libraries before data
|
|
loading.
|
|
|
|
Memory Pinning
|
|
--------------
|
|
|
|
Host to GPU copies are much faster when they originate from pinned (page-locked)
|
|
memory. See :ref:`cuda-memory-pinning` for more details on when and how to use
|
|
pinned memory generally.
|
|
|
|
For data loading, passing :attr:`pin_memory=True` to a
|
|
:class:`~torch.utils.data.DataLoader` will automatically put the fetched data
|
|
Tensors in pinned memory, and thus enables faster data transfer to CUDA-enabled
|
|
GPUs.
|
|
|
|
The default memory pinning logic only recognizes Tensors and maps and iterables
|
|
containing Tensors. By default, if the pinning logic sees a batch that is a
|
|
custom type (which will occur if you have a :attr:`collate_fn` that returns a
|
|
custom batch type), or if each element of your batch is a custom type, the
|
|
pinning logic will not recognize them, and it will return that batch (or those
|
|
elements) without pinning the memory. To enable memory pinning for custom
|
|
batch or data type(s), define a :meth:`pin_memory` method on your custom
|
|
type(s).
|
|
|
|
See the example below.
|
|
|
|
Example::
|
|
|
|
class SimpleCustomBatch:
|
|
def __init__(self, data):
|
|
transposed_data = list(zip(*data))
|
|
self.inp = torch.stack(transposed_data[0], 0)
|
|
self.tgt = torch.stack(transposed_data[1], 0)
|
|
|
|
# custom memory pinning method on custom type
|
|
def pin_memory(self):
|
|
self.inp = self.inp.pin_memory()
|
|
self.tgt = self.tgt.pin_memory()
|
|
return self
|
|
|
|
def collate_wrapper(batch):
|
|
return SimpleCustomBatch(batch)
|
|
|
|
inps = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
|
|
tgts = torch.arange(10 * 5, dtype=torch.float32).view(10, 5)
|
|
dataset = TensorDataset(inps, tgts)
|
|
|
|
loader = DataLoader(dataset, batch_size=2, collate_fn=collate_wrapper,
|
|
pin_memory=True)
|
|
|
|
for batch_ndx, sample in enumerate(loader):
|
|
print(sample.inp.is_pinned())
|
|
print(sample.tgt.is_pinned())
|
|
|
|
|
|
.. autoclass:: DataLoader
|
|
.. autoclass:: Dataset
|
|
.. autoclass:: IterableDataset
|
|
.. autoclass:: TensorDataset
|
|
.. autoclass:: StackDataset
|
|
.. autoclass:: ConcatDataset
|
|
.. autoclass:: ChainDataset
|
|
.. autoclass:: Subset
|
|
.. autofunction:: torch.utils.data._utils.collate.collate
|
|
.. autofunction:: torch.utils.data.default_collate
|
|
.. autofunction:: torch.utils.data.default_convert
|
|
.. autofunction:: torch.utils.data.get_worker_info
|
|
.. autofunction:: torch.utils.data.random_split
|
|
.. autoclass:: torch.utils.data.Sampler
|
|
.. autoclass:: torch.utils.data.SequentialSampler
|
|
.. autoclass:: torch.utils.data.RandomSampler
|
|
.. autoclass:: torch.utils.data.SubsetRandomSampler
|
|
.. autoclass:: torch.utils.data.WeightedRandomSampler
|
|
.. autoclass:: torch.utils.data.BatchSampler
|
|
.. autoclass:: torch.utils.data.distributed.DistributedSampler
|
|
|
|
|
|
.. These modules are documented as part of torch/data listing them here for
|
|
.. now until we have a clearer fix
|
|
.. py:module:: torch.utils.data.datapipes
|
|
.. py:module:: torch.utils.data.datapipes.dataframe
|
|
.. py:module:: torch.utils.data.datapipes.iter
|
|
.. py:module:: torch.utils.data.datapipes.map
|
|
.. py:module:: torch.utils.data.datapipes.utils
|