103 lines
3.4 KiB
ReStructuredText
103 lines
3.4 KiB
ReStructuredText
.. currentmodule:: torch.cuda._sanitizer
|
|
|
|
CUDA Stream Sanitizer
|
|
=====================
|
|
|
|
.. note::
|
|
This is a prototype feature, which means it is at an early stage
|
|
for feedback and testing, and its components are subject to change.
|
|
|
|
Overview
|
|
--------
|
|
|
|
.. automodule:: torch.cuda._sanitizer
|
|
|
|
|
|
Usage
|
|
------
|
|
|
|
Here is an example of a simple synchronization error in PyTorch:
|
|
|
|
::
|
|
|
|
import torch
|
|
|
|
a = torch.rand(4, 2, device="cuda")
|
|
|
|
with torch.cuda.stream(torch.cuda.Stream()):
|
|
torch.mul(a, 5, out=a)
|
|
|
|
The ``a`` tensor is initialized on the default stream and, without any synchronization
|
|
methods, modified on a new stream. The two kernels will run concurrently on the same tensor,
|
|
which might cause the second kernel to read uninitialized data before the first one was able
|
|
to write it, or the first kernel might overwrite part of the result of the second.
|
|
When this script is run on the commandline with:
|
|
::
|
|
|
|
TORCH_CUDA_SANITIZER=1 python example_error.py
|
|
|
|
the following output is printed by CSAN:
|
|
|
|
::
|
|
|
|
============================
|
|
CSAN detected a possible data race on tensor with data pointer 139719969079296
|
|
Access by stream 94646435460352 during kernel:
|
|
aten::mul.out(Tensor self, Tensor other, *, Tensor(a!) out) -> Tensor(a!)
|
|
writing to argument(s) self, out, and to the output
|
|
With stack trace:
|
|
File "example_error.py", line 6, in <module>
|
|
torch.mul(a, 5, out=a)
|
|
...
|
|
File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
|
|
stack_trace = traceback.StackSummary.extract(
|
|
|
|
Previous access by stream 0 during kernel:
|
|
aten::rand(int[] size, *, int? dtype=None, Device? device=None) -> Tensor
|
|
writing to the output
|
|
With stack trace:
|
|
File "example_error.py", line 3, in <module>
|
|
a = torch.rand(10000, device="cuda")
|
|
...
|
|
File "pytorch/torch/cuda/_sanitizer.py", line 364, in _handle_kernel_launch
|
|
stack_trace = traceback.StackSummary.extract(
|
|
|
|
Tensor was allocated with stack trace:
|
|
File "example_error.py", line 3, in <module>
|
|
a = torch.rand(10000, device="cuda")
|
|
...
|
|
File "pytorch/torch/cuda/_sanitizer.py", line 420, in _handle_memory_allocation
|
|
traceback.StackSummary.extract(
|
|
|
|
This gives extensive insight into the origin of the error:
|
|
|
|
- A tensor was incorrectly accessed from streams with ids: 0 (default stream) and 94646435460352 (new stream)
|
|
- The tensor was allocated by invoking ``a = torch.rand(10000, device="cuda")``
|
|
- The faulty accesses were caused by operators
|
|
- ``a = torch.rand(10000, device="cuda")`` on stream 0
|
|
- ``torch.mul(a, 5, out=a)`` on stream 94646435460352
|
|
- The error message also displays the schemas of the invoked operators, along with a note
|
|
showing which arguments of the operators correspond to the affected tensor.
|
|
|
|
- In the example, it can be seen that tensor ``a`` corresponds to arguments ``self``, ``out``
|
|
and the ``output`` value of the invoked operator ``torch.mul``.
|
|
|
|
.. seealso::
|
|
The list of supported torch operators and their schemas can be viewed
|
|
:doc:`here <torch>`.
|
|
|
|
The bug can be fixed by forcing the new stream to wait for the default stream:
|
|
|
|
::
|
|
|
|
with torch.cuda.stream(torch.cuda.Stream()):
|
|
torch.cuda.current_stream().wait_stream(torch.cuda.default_stream())
|
|
torch.mul(a, 5, out=a)
|
|
|
|
When the script is run again, there are no errors reported.
|
|
|
|
API Reference
|
|
-------------
|
|
|
|
.. autofunction:: enable_cuda_sanitizer
|