63 lines
2.8 KiB
ReStructuredText
63 lines
2.8 KiB
ReStructuredText
torch.utils.bottleneck
|
|
======================
|
|
|
|
.. automodule:: torch.utils.bottleneck
|
|
.. currentmodule:: torch.utils.bottleneck
|
|
|
|
`torch.utils.bottleneck` is a tool that can be used as an initial step for
|
|
debugging bottlenecks in your program. It summarizes runs of your script with
|
|
the Python profiler and PyTorch's autograd profiler.
|
|
|
|
Run it on the command line with
|
|
|
|
::
|
|
|
|
python -m torch.utils.bottleneck /path/to/source/script.py [args]
|
|
|
|
where [args] are any number of arguments to `script.py`, or run
|
|
``python -m torch.utils.bottleneck -h`` for more usage instructions.
|
|
|
|
.. warning::
|
|
Because your script will be profiled, please ensure that it exits in a
|
|
finite amount of time.
|
|
|
|
.. warning::
|
|
Due to the asynchronous nature of CUDA kernels, when running against
|
|
CUDA code, the cProfile output and CPU-mode autograd profilers may
|
|
not show correct timings: the reported CPU time reports the amount of time
|
|
used to launch the kernels but does not include the time the kernel
|
|
spent executing on a GPU unless the operation does a synchronize.
|
|
Ops that do synchronize appear to be extremely expensive under regular
|
|
CPU-mode profilers.
|
|
In these case where timings are incorrect, the CUDA-mode autograd profiler
|
|
may be helpful.
|
|
|
|
.. note::
|
|
To decide which (CPU-only-mode or CUDA-mode) autograd profiler output to
|
|
look at, you should first check if your script is CPU-bound
|
|
("CPU total time is much greater than CUDA total time").
|
|
If it is CPU-bound, looking at the results of the CPU-mode autograd
|
|
profiler will help. If on the other hand your script spends most of its
|
|
time executing on the GPU, then it makes sense to start
|
|
looking for responsible CUDA operators in the output of the CUDA-mode
|
|
autograd profiler.
|
|
|
|
Of course the reality is much more complicated and your script might not be
|
|
in one of those two extremes depending on the part of the model you're
|
|
evaluating. If the profiler outputs don't help, you could try looking at
|
|
the result of :func:`torch.autograd.profiler.emit_nvtx()` with ``nvprof``.
|
|
However, please take into account that the NVTX overhead is very high and
|
|
often gives a heavily skewed timeline. Similarly, ``Intel® VTune™ Profiler``
|
|
helps to analyze performance on Intel platforms further with
|
|
:func:`torch.autograd.profiler.emit_itt()`.
|
|
|
|
.. warning::
|
|
If you are profiling CUDA code, the first profiler that ``bottleneck`` runs
|
|
(cProfile) will include the CUDA startup time (CUDA buffer allocation cost)
|
|
in its time reporting. This should not matter if your bottlenecks result
|
|
in code much slower than the CUDA startup time.
|
|
|
|
For more complicated uses of the profilers (like in a multi-GPU case),
|
|
please see https://docs.python.org/3/library/profile.html
|
|
or :func:`torch.autograd.profiler.profile()` for more information.
|