Performance Optimization
Anonymous contributor
Published Feb 5, 2025
Contribute to Docs
In PyTorch, the torch.cuda
library can be used to set up and run CUDA operations. An example of this function is .Stream()
, which can set a linear sequence of execution to do asynchronous tasks.
Syntax
torch.cuda.Stream(device, priority)
device
: An integer indicating a device that holds the stream (optional). If this parameter is set toNone
(default) or negative integer, the current device will be chosen.priority
: A set range of negative, zero, or positive integers. The lower the number, the higher the priority. The value is set to0
by default. If the value falls beyond the allowed priority range, the nearest correct priority will be mapped automatically (highest for large negative numbers and lowest for large positive numbers).
Example
The following example demonstrates a heavy calculation with and without .Stream()
:
import torchimport time# Verify GPU selection else use CPUdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")# Heavy calculation functiondef heavy_computation(tensor):return tensor**2 + (tensor**3) * (tensor.sin() * tensor.cos()) + tensor.tan()# Sample number (reduce size to prevent OOM issues)size = 10**7 # Instead of 10**9data = torch.randn(size, device=device, dtype=torch.float16) # Reduced dtype to float16# Create 2 Streamsstream1 = torch.cuda.Stream(device=device)stream2 = torch.cuda.Stream(device=device)# Synchronize all kernels before time trackingtorch.cuda.synchronize()# Start time tracking for Stream computationstart_time = time.time()# Asynchronous execution with streamswith torch.cuda.stream(stream1):result1 = heavy_computation(data[:size // 2])with torch.cuda.stream(stream2):result2 = heavy_computation(data[size // 2:])# Synchronize all kernels before measuring timetorch.cuda.synchronize()end_time = time.time()print(f"Time taken with streams: {end_time - start_time:.3f} seconds")# Sequential computationtorch.cuda.synchronize()start_time = time.time()result1_seq = heavy_computation(data[:size // 2])result2_seq = heavy_computation(data[size // 2:])torch.cuda.synchronize()end_time = time.time()print(f"Time taken without streams: {end_time - start_time:.3f} seconds")
See more on .time()
explanation.
Note: The Stream calculation varies on the complexity of command execution. If the calculation is simple, the stream might slow the operation instead.
The following will be the output for the sample size = 10\*\*7
:
Time taken with streams: 0.035 secondsTime taken without streams: 0.003 seconds
All contributors
- Anonymous contributor
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn PyTorch on Codecademy
- Career path
Data Scientist: Machine Learning Specialist
Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.Includes 27 CoursesWith Professional CertificationBeginner Friendly90 hours - Free course
Intro to PyTorch and Neural Networks
Learn how to use PyTorch to build, train, and test artificial neural networks in this course.Intermediate3 hours