PyTorch torch.cuda_AI使用导航教程

原文： PyTorch torch.cuda

该软件包增加了对 CUDA 张量类型的支持，该类型实现与 CPU 张量相同的功能，但是它们利用 GPU 进行计算。

它是延迟初始化的，因此您始终可以导入它，并使用 is_available() 确定您的系统是否支持 CUDA。

CUDA 语义具有有关使用 CUDA 的更多详细信息。

torch.cuda.current_blas_handle()¶

返回 cublasHandle_t 指向当前 cuBLAS 句柄的指针

torch.cuda.current_device()¶

返回当前所选设备的索引。

torch.cuda.current_stream(device=None)¶

返回给定设备的当前选择的 Stream 。

参数

设备 (torch设备 或 python：int ，可选 )–所选设备。如果 device 为None(默认值），则返回由 current_device() 给定的当前设备当前选择的 Stream 。

torch.cuda.default_stream(device=None)¶

返回给定设备的默认 Stream 。

Parameters

设备 (torch设备 或 python：int ，可选 )–所选设备。如果 device 为None(默认值），则返回由 current_device() 给定的当前设备的默认 Stream 。

class torch.cuda.device(device)¶

更改所选设备的上下文管理器。

Parameters

设备 (torch设备 或 python：int )–选择的设备索引。如果此参数为负整数或None，则为空。

torch.cuda.device_count()¶

返回可用的 GPU 数量。

class torch.cuda.device_of(obj)¶

将当前设备更改为给定对象的上下文管理器。

您可以将张量和存储都用作参数。如果未在 GPU 上分配给定对象，则为空操作。

Parameters

obj (tensor 或存储）–在所选设备上分配的对象。

torch.cuda.get_device_capability(device=None)¶

获取设备的 CUDA 功能。

Parameters

设备 (torch设备 或 python：int ，可选 )–要为其返回设备功能的设备。如果此参数为负整数，则此函数为空操作。如果 device 为None，则使用 current_device() 给定的当前设备。

退货

设备的主要和次要 CUDA 功能

返回类型

元组(int，int）

torch.cuda.get_device_name(device=None)¶

获取设备的名称。

Parameters

设备 (torch设备 或 python：int ，可选 )–要为其返回名称的设备。如果此参数为负整数，则此函数为空操作。如果 device 为None，则使用 current_device() 给定的当前设备。

torch.cuda.init()¶

初始化 PyTorch 的 CUDA 状态。如果您通过 PyTorch 的 C API 与 PyTorch 进行交互，则可能需要显式调用此方法，因为在进行初始化之前，CUDA 功能的 Python 绑定才可以。普通用户不需要此，因为所有 PyTorch 的 CUDA 方法都会自动按需初始化 CUDA 状态。

如果 CUDA 状态已经初始化，则不执行任何操作。

torch.cuda.ipc_collect()¶

CUDA IPC 释放后，Force 将收集 GPU 内存。

注意

检查是否可以从内存中清除任何已发送的 CUDA 张量。如果没有活动计数器，则强制关闭用于引用计数的共享内存文件。当生产者进程停止主动发送张量并希望释放未使用的内存时，此选项很有用。

torch.cuda.is_available()¶

返回一个布尔值，指示 CUDA 当前是否可用。

torch.cuda.is_initialized()¶

返回 PyTorch 的 CUDA 状态是否已初始化。

torch.cuda.set_device(device)¶

设置当前设备。

不推荐使用此功能，而推荐使用 device 。在大多数情况下，最好使用CUDA_VISIBLE_DEVICES环境变量。

Parameters

设备 (torch设备 或 python：int )–选定的设备。如果此参数为负，则此函数为空操作。

torch.cuda.stream(stream)¶

选择给定流的上下文管理器。

在其上下文中排队的所有 CUDA 内核都将排队在选定的流上。

Parameters

流 (流)–选择的流。如果经理是None，则为空手。

Note

流是按设备的。如果所选的流不在当前设备上，则此功能还将更改当前设备以匹配该流。

torch.cuda.synchronize(device=None)¶

等待 CUDA 设备上所有流中的所有内核完成。

Parameters

设备 (torch设备 或 python：int ，可选 )–要同步的设备。如果 device 为None，则使用 current_device() 给定的当前设备。

随机数发生器

torch.cuda.get_rng_state(device='cuda')¶

以 ByteTensor 的形式返回指定 GPU 的随机数生成器状态。

Parameters

设备 (torch设备 或 python：int ，可选 )–返回 RNG 状态的设备。默认值：'cuda'(即，当前 CUDA 设备torch.device('cuda')）。

警告

该函数会急切地初始化 CUDA。

torch.cuda.get_rng_state_all()¶

返回表示所有设备的随机数状态的 ByteTensor 元组。

torch.cuda.set_rng_state(new_state, device='cuda')¶

设置指定 GPU 的随机数生成器状态。

Parameters

new_state (torch.ByteTensor )–所需状态
设备 (torch设备 或 python：int ，可选 )–设置 RNG 状态的设备。默认值：'cuda'(即，当前 CUDA 设备torch.device('cuda')）。

torch.cuda.set_rng_state_all(new_states)¶

设置所有设备的随机数生成器状态。

Parameters

new_state (Torch.ByteTensor 的元组）–每个设备的所需状态

torch.cuda.manual_seed(seed)¶

设置种子以为当前 GPU 生成随机数。如果没有 CUDA，则可以安全地调用此函数；在这种情况下，它会被静默忽略。

Parameters

种子 (python：int )–所需的种子。

Warning

如果您使用的是多 GPU 模型，则此功能不足以获得确定性。要播种所有 GPU，请使用 manual_seed_all() 。

torch.cuda.manual_seed_all(seed)¶

设置用于在所有 GPU 上生成随机数的种子。如果没有 CUDA，则可以安全地调用此函数；在这种情况下，它会被静默忽略。

Parameters

seed (python:int) – The desired seed.

torch.cuda.seed()¶

将用于生成随机数的种子设置为当前 GPU 的随机数。如果没有 CUDA，则可以安全地调用此函数；在这种情况下，它会被静默忽略。

Warning

如果您使用的是多 GPU 模型，则此功能将仅在一个 GPU 上初始化种子。要初始化所有 GPU，请使用 seed_all() 。

torch.cuda.seed_all()¶

将在所有 GPU 上生成随机数的种子设置为随机数。如果没有 CUDA，则可以安全地调用此函数；在这种情况下，它会被静默忽略。

torch.cuda.initial_seed()¶

返回当前 GPU 的当前随机种子。

Warning

This function eagerly initializes CUDA.

传播集体

torch.cuda.comm.broadcast(tensor, devices)¶

向多个 GPU 广播张量。

Parameters

张量 (tensor)–张量要广播。
设备(可迭代）–可以在其中广播的设备的可迭代方式。请注意，它应该像(src，dst1，dst2，…），其第一个元素是要从中广播的源设备。

Returns

包含tensor副本的元组，放置在与devices的索引相对应的设备上。

torch.cuda.comm.broadcast_coalesced(tensors, devices, buffer_size=10485760)¶

将序列张量广播到指定的 GPU。首先将小张量合并到缓冲区中以减少同步次数。

Parameters

张量(序列）–要广播的张量。
devices (Iterable) – an iterable of devices among which to broadcast. Note that it should be like (src, dst1, dst2, …), the first element of which is the source device to broadcast from.
buffer_size (python：int )–用于合并的缓冲区的最大大小

Returns

A tuple containing copies of the tensor, placed on devices corresponding to indices from devices.

torch.cuda.comm.reduce_add(inputs, destination=None)¶

来自多个 GPU 的张量求和。

所有输入应具有匹配的形状。

Parameters

输入(可迭代 [ tensor ] )–可累加的张量。
目标 (python：int ，可选）–将放置输出的设备(默认值：当前设备）。

Returns

包含所有输入的元素和的张量，放置在destination设备上。

torch.cuda.comm.scatter(tensor, devices, chunk_sizes=None, dim=0, streams=None)¶

在多个 GPU 上分散张量。

Parameters

张量 (tensor)–张量散布。
设备(可迭代 [ python：int ] )–可迭代的 int，指定张量在哪个设备中应该分散。
chunk_sizes (可迭代 [ python：int ] ，可选））–每个设备上要放置的块的大小。它的长度应与devices相匹配，并且总和应等于tensor.size(dim)。如果未指定，则张量将分为相等的块。
暗淡的 (python：int ，可选）–沿张量分块的尺寸。

Returns

包含tensor块的元组，分布在给定的devices中。

torch.cuda.comm.gather(tensors, dim=0, destination=None)¶

收集来自多个 GPU 的张量。

与dim不同的所有维度中的张量大小必须匹配。

Parameters

张量(可迭代 [ tensor ] )–张量的可迭代集合。
暗淡的 (python：int )–将张量连接在一起的尺寸。
目标 (python：int ，可选）–输出设备(-1 表示 CPU，默认值：当前设备）

Returns

位于destination设备上的张量，这是tensors与dim并置的结果。

流和事件

class torch.cuda.Stream¶

CUDA 流周围的包装器。

CUDA 流是属于特定设备的线性执行序列，独立于其他流。有关详细信息，请参见 CUDA 语义。

Parameters

设备 (torch设备 或 python：int ，可选 )–在其上分配流的设备。如果 device 为None(默认值）或负整数，则将使用当前设备。
优先级 (python：int ，可选）–流的优先级。较低的数字表示较高的优先级。

query()¶

检查所有提交的工作是否已完成。

Returns

一个布尔值，指示该流中的所有内核是否已完成。

record_event(event=None)¶

记录事件。

Parameters

事件 (事件，可选）–记录事件。如果未给出，将分配一个新的。

Returns

记录的事件。

synchronize()¶

等待此流中的所有内核完成。

Note

这是对cudaStreamSynchronize()的包装：有关更多信息，请参见 CUDA 流文档。

wait_event(event)¶

使所有将来提交到流的工作都等待事件。

Parameters

事件 (事件)–等待的事件。

Note

这是对cudaStreamWaitEvent()的包装：有关更多信息，请参见 CUDA 流文档。

该函数无需等待event就返回：仅影响以后的操作。

wait_stream(stream)¶

与另一个流同步。

提交给该流的所有将来的工作将等到调用完成时提交给给定流的所有内核。

Parameters

流 (流)–要同步的流。

Note

该函数返回而无需等待 stream 中当前排队的内核：仅影响将来的操作。

class torch.cuda.Event¶

CUDA 事件的包装器。

CUDA 事件是同步标记，可用于监视设备的进度，准确测量时序并同步 CUDA 流。

当第一次记录该事件或将其导出到另一个进程时，基础 CUDA 事件将被延迟初始化。创建后，只有同一设备上的流才能记录该事件。但是，任何设备上的流都可以等待事件。

Parameters

enable_timing (bool ，可选）–指示事件是否应该测量时间(默认值：False）
阻止 (bool ，可选）–如果True， wait() 将被阻止(默认：False）
进程间 (bool )–如果True，则事件可以在进程之间共享(默认值：False）

elapsed_time(end_event)¶

返回记录事件之后到记录 end_event 之前经过的时间(以毫秒为单位）。

classmethod from_ipc_handle(device, handle)¶

从给定设备上的 IPC 句柄重构事件。

ipc_handle()¶

返回此事件的 IPC 句柄。如果尚未录制，则该事件将使用当前设备。

query()¶

检查事件当前捕获的所有工作是否已完成。

Returns

一个布尔值，指示当前由事件捕获的所有工作是否已完成。

record(stream=None)¶

在给定的流中记录事件。

如果未指定流，则使用torch.cuda.current_stream()。流的设备必须与活动的设备匹配。

synchronize()¶

等待事件完成。

等待直到此事件中当前捕获的所有工作完成。这样可以防止 CPU 线程在事件完成之前继续执行。

Note

这是cudaEventSynchronize()的包装：有关更多信息，请参见 CUDA 事件文档。

wait(stream=None)¶

使所有将来提交给定流的工作都等待此事件。

如果未指定流，则使用torch.cuda.current_stream()。

内存管理

torch.cuda.empty_cache()¶

释放当前由缓存分配器保留的所有未占用的缓存内存，以便这些内存可在其他 GPU 应用程序中使用，并在 <cite>nvidia-smi</cite> 中可见。

Note

empty_cache() 不会增加 PyTorch 可用的 GPU 内存量。但是，在某些情况下，它可能有助于减少 GPU 内存的碎片。有关 GPU 内存管理的更多详细信息，请参见内存管理。

torch.cuda.memory_stats(device=None)¶

返回给定设备的 CUDA 内存分配器统计信息的字典。

此函数的返回值是统计字典，每个字典都是非负整数。

核心统计数据：

"allocated.{all,large_pool,small_pool}.{current,peak,allocated,freed}"：内存分配器接收到的分配请求数。
"allocated_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}"：分配的内存量。
"segment.{all,large_pool,small_pool}.{current,peak,allocated,freed}"：来自cudaMalloc()的保留段数。
"reserved_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}"：保留的内存量。
"active.{all,large_pool,small_pool}.{current,peak,allocated,freed}"：活动存储块的数量。
"active_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}"：活动内存量。
"inactive_split.{all,large_pool,small_pool}.{current,peak,allocated,freed}"：非活动，不可释放的存储块的数量。
"inactive_split_bytes.{all,large_pool,small_pool}.{current,peak,allocated,freed}"：非活动，不可释放的内存量。

对于这些核心统计信息，值细分如下。

泳池类型：

all：所有内存池的组合统计信息。
large_pool：大型分配池的统计信息(截至 2019 年 10 月，>大小= 1MB 分配）。
small_pool：小型分配池的统计信息(截至 2019 年 10 月，<大小为 1MB 分配）。

指标类型：

current：此度量标准的当前值。
peak：此指标的最大值。
allocated：此指标的历史总数增长。
freed：此指标的历史总数下降。

除了核心统计信息之外，我们还提供了一些简单的事件计数器：

"num_alloc_retries"：导致高速缓存刷新并重试的cudaMalloc调用失败的次数。
"num_ooms"：抛出的内存不足错误数。

Parameters

设备 (torch设备 或 python：int ，可选 )–所选设备。如果 device 为None(默认值），则返回由 current_device() 给定的当前设备的统计信息。

Note

有关 GPU 内存管理的更多详细信息，请参见内存管理。

torch.cuda.memory_summary(device=None, abbreviated=False)¶

返回给定设备的当前内存分配器统计信息的可读记录。

这对于在训练期间或处理内存不足异常时定期显示很有用。

Parameters

设备 (torch设备 或 python：int ，可选 )–所选设备。如果 device 为None(默认值），则返回由 current_device() 给定的当前设备的打印输出。
缩写为 (bool ，可选）–是否返回缩写摘要(默认值：False）。

Note

See Memory management for more details about GPU memory management.

torch.cuda.memory_snapshot()¶

返回所有设备上 CUDA 内存分配器状态的快照。

解释此函数的输出需要熟悉内存分配器内部。

Note

See Memory management for more details about GPU memory management.

torch.cuda.memory_allocated(device=None)¶

返回给定设备的张量占用的当前 GPU 内存(以字节为单位）。

Parameters

设备 (torch设备 或 python：int ，可选 )–所选设备。如果 device 为None(默认值），则返回由 current_device() 给定的当前设备的统计信息。

Note

这可能少于 <cite>nvidia-smi</cite> 中显示的数量，因为某些未使用的内存可以由缓存分配器保存，并且某些上下文需要在 GPU 上创建。有关 GPU 内存管理的更多详细信息，请参见内存管理。

torch.cuda.max_memory_allocated(device=None)¶

返回给定设备的张量占用的最大 GPU 内存(以字节为单位）。

默认情况下，这将返回自此程序开始以来的峰值分配内存。 reset_peak_stats()可用于重置跟踪该指标的起点。例如，这两个功能可以测量训练循环中每个迭代的峰值分配内存使用量。

Parameters

device (torch.device or python:int__, optional) – selected device. Returns statistic for the current device, given by current_device(), if device is None (default).

Note

See Memory management for more details about GPU memory management.

torch.cuda.reset_max_memory_allocated(device=None)¶

重置用于跟踪给定设备的张量占用的最大 GPU 内存的起点。

有关详细信息，请参见 max_memory_allocated() 。

Parameters

device (torch.device or python:int__, optional) – selected device. Returns statistic for the current device, given by current_device(), if device is None (default).

Warning

现在，此函数调用reset_peak_memory_stats()，它将重置/ all /峰值内存状态。

Note

See Memory management for more details about GPU memory management.

torch.cuda.memory_reserved(device=None)¶

返回给定设备由缓存分配器管理的当前 GPU 内存，以字节为单位。

Parameters

device (torch.device or python:int__, optional) – selected device. Returns statistic for the current device, given by current_device(), if device is None (default).

Note

See Memory management for more details about GPU memory management.

torch.cuda.max_memory_reserved(device=None)¶

返回给定设备的缓存分配器管理的最大 GPU 内存(以字节为单位）。

默认情况下，这将返回自此程序开始以来的峰值缓存内存。 reset_peak_stats()可用于重置跟踪该指标的起点。例如，这两个功能可以测量训练循环中每次迭代的峰值缓存内存量。

Parameters

device (torch.device or python:int__, optional) – selected device. Returns statistic for the current device, given by current_device(), if device is None (default).

Note

See Memory management for more details about GPU memory management.

torch.cuda.memory_cached(device=None)¶

不推荐使用；参见 memory_reserved() 。

torch.cuda.max_memory_cached(device=None)¶

不推荐使用；参见 max_memory_reserved() 。

torch.cuda.reset_max_memory_cached(device=None)¶

重置跟踪由给定设备的缓存分配器管理的最大 GPU 内存的起点。

Parameters

device (torch.device or python:int__, optional) – selected device. Returns statistic for the current device, given by current_device(), if device is None (default).

Warning

This function now calls reset_peak_memory_stats(), which resets /all/ peak memory stats.

Note

See Memory management for more details about GPU memory management.

NVIDIA 工具扩展(NVTX）

torch.cuda.nvtx.mark(msg)¶

描述在某个时刻发生的瞬时事件。

Parameters

msg (字符串）–与事件关联的 ASCII 消息。

torch.cuda.nvtx.range_push(msg)¶

将范围推入嵌套范围跨度的堆栈中。返回从零开始的范围的深度。

Parameters

msg (字符串）–与范围关联的 ASCII 消息

torch.cuda.nvtx.range_pop()¶

从嵌套范围跨度堆栈中弹出范围。返回结束范围的从零开始的深度。