前段时间因为工作原因重新浏览了 torch.compile
中几个重要的函数调用栈,现将这些内容分享给有兴趣通过源代码理解 torch.compile
行为的朋友。
本文使用 PyTorch 官方提供的 docker 镜像: pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel
,PyTorch 版本为 2.4.0。
本文所用示例代码:
import torch
import torch.nn as nn
from collections import OrderedDict
@torch._inductor.config.patch(
post_grad_fusion_options={"batch_linear_post_grad": {"require_fbgemm": False}}
)
def test():
n, h = 32, 128
repeats = 3
layers = OrderedDict()
for i in range(repeats):
layers[f"fc_{i}"] = nn.Linear(h, h)
layers[f"ln_{i}"] = nn.LayerNorm(h)
layers[f"silu_{i}"] = nn.SiLU()
model = nn.Sequential(layers).cuda().half()
x = torch.randn((n, h), device="cuda", dtype=torch.float16, requires_grad=True)
dy = torch.randn_like(x)
compiled = torch.compile(model, mode="reduce-overhead")
for _ in range(4):
y = compiled(x)
y.backward(dy)
if __name__ == "__main__":
test()
TorchDynamo Link to heading
使用 TorchDynamo 来 trace 函数:
- [P022] > torch/_dynamo/output_graph.py#L1247:compile_and_call_fx_graph [New]
- [P021] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P020] > torch/_dynamo/output_graph.py#L957:compile_subgraph [New]
- [P019] > torch/_dynamo/symbolic_convert.py#L2613:_return [New]
- [P018] > torch/_dynamo/symbolic_convert.py#L2641:RETURN_VALUE [New]
- [P017] > torch/_dynamo/symbolic_convert.py#L777:step
- [P016] > torch/_dynamo/symbolic_convert.py#L889:run [New]
- [P015] > torch/_dynamo/symbolic_convert.py#L2450:run [New]
- [P014] > torch/_dynamo/convert_frame.py#L559:transform [New]
- [P013] > torch/_dynamo/convert_frame.py#L157:_fn [New]
- [P012] > torch/_dynamo/bytecode_transformation.py#L1177:transform_code_object [New]
- [P011] > torch/_dynamo/convert_frame.py#L606:compile_inner [New]
- [P010] > torch/_dynamo/utils.py#L219:time_wrapper [New]
- [P009] > torch/_dynamo/convert_frame.py#L522:_compile [New]
- [P008] > /opt/conda/lib/python3.11/contextlib.py#L78:inner [New]
- [P007] > torch/_strobelight/compile_time_profiler.py#L124:profile_compile_time [New]
- [P006] > torch/_utils_internal.py#L80:wrapper_function [New]
- [P005] > torch/_dynamo/convert_frame.py#L383:call [New]
- [P004] > torch/_dynamo/convert_frame.py#L938:call [New]
- [P003] > torch/_dynamo/convert_frame.py#L1062:call [New]
- [P002] > torch/_dynamo/eval_frame.py#L399:_fn [New]
- [P001] > torch/nn/modules/module.py#L1555:_call_impl [New]
- [P000] > torch/nn/modules/module.py#L1549:_wrapped_call_impl [New]
在 AOTAutograd 之前运行 pre_grad_passes
,替换捕获到的计算图中的算子:
- [P032] > torch/_inductor/fx_passes/pre_grad.py#L108:pre_grad_passes [New]
- [P031] > torch/_inductor/compile_fx.py#L243:_recursive_pre_grad_passes [New]
- [P030] > torch/_inductor/compile_fx.py#L1250:compile_fx
- [P029] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P028] > torch/_inductor/compile_fx.py#L1250:compile_fx [New]
- [P027] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P026] > torch/init.py#L1948:call [New]
- [P025] > torch/_dynamo/repro/after_dynamo.py#L72:call [New]
- [P024] > torch/_dynamo/output_graph.py#L1361:call_user_compiler [New]
- [P023] > torch/_dynamo/utils.py#L219:time_wrapper
- [P022] > torch/_dynamo/output_graph.py#L1247:compile_and_call_fx_graph [New]
- [P021] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P020] > torch/_dynamo/output_graph.py#L957:compile_subgraph [New]
- [P019] > torch/_dynamo/symbolic_convert.py#L2613:_return [New]
- [P018] > torch/_dynamo/symbolic_convert.py#L2641:RETURN_VALUE [New]
- [P017] > torch/_dynamo/symbolic_convert.py#L777:step
AOTAutograd & PrimTorch Link to heading
AOTAutograd 通过 trace 得到正向传播和反向传播的 joint graph:
- [P036] > torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py#L218:aot_dispatch_autograd_graph [New]
- [P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
- [P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
- [P033] > torch/_dynamo/utils.py#L219:time_wrapper
- [P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
- [P031] > torch/_dynamo/backends/common.py#L22:call [New]
- [P030] > torch/_inductor/compile_fx.py#L1250:compile_fx
- [P029] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P028] > torch/_inductor/compile_fx.py#L1250:compile_fx [New]
- [P027] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P026] > torch/init.py#L1948:call [New]
- [P025] > torch/_dynamo/repro/after_dynamo.py#L72:call [New]
- [P024] > torch/_dynamo/output_graph.py#L1361:call_user_compiler [New]
- [P023] > torch/_dynamo/utils.py#L219:time_wrapper
- [P022] > torch/_dynamo/output_graph.py#L1247:compile_and_call_fx_graph [New]
通过 functorch
把待编译函数转为符合函数式编程原则的函数:
- [P037] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L343:create_functionalized_fn [New]
- [P036] > torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py#L218:aot_dispatch_autograd_graph [New]
- [P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
- [P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
- [P033] > torch/_dynamo/utils.py#L219:time_wrapper
- [P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
通过 PrimTorch 分解算子:
- [P076] > torch/_refs/init.py#L3148:native_layer_norm
- [P075] > torch/_prims_common/wrappers.py#L256:_fn
- [P074] > torch/_decomp/init.py#L78:_fn
- [P073] > torch/fx/experimental/proxy_tensor.py#L1438:maybe_handle_decomp
- [P072] > torch/fx/experimental/proxy_tensor.py#L306:proxy_call
- [P071] > torch/fx/experimental/proxy_tensor.py#L783:inner_torch_dispatch
- [P070] > torch/fx/experimental/proxy_tensor.py#L752:torch_dispatch
- [P069] > torch/utils/_stats.py#L16:wrapper
- [P068] > torch/_subclasses/functional_tensor.py#L316:torch_dispatch
- [P067] > torch/nn/functional.py#L2561:layer_norm
- [P066] > torch/fx/experimental/proxy_tensor.py#L701:torch_function
- [P065] > torch/overrides.py#L1583:handle_torch_function [New]
- [P064] > torch/nn/functional.py#L2561:layer_norm
- [P063] > torch/nn/modules/normalization.py#L201:forward
- [P062] > torch/nn/modules/module.py#L1555:_call_impl
- [P061] > torch/nn/modules/module.py#L1549:_wrapped_call_impl
- [P060] > torch/fx/_symbolic_trace.py#L792:forward
- [P059] > torch/fx/experimental/proxy_tensor.py#L566:call_module
- [P058] > torch/fx/_symbolic_trace.py#L790:module_call_wrapper
- [P057] > torch/fx/interpreter.py#L299:call_module
- [P056] > torch/fx/interpreter.py#L185:run_node
- [P055] > torch/fx/experimental/symbolic_shapes.py#L5455:run_node
- [P054] > torch/fx/interpreter.py#L107:run
- [P053] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L733:functional_call
- [P052] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L103:inner_fn [New]
- [P051] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L187:inner_fn [New]
- [P050] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L252:inner_fn_with_anomaly [New]
- [P049] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L351:_functionalized_f_helper [New]
- [P048] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L598:joint_helper [New]
- [P047] > torch/fx/experimental/proxy_tensor.py#L652:wrapped [New]
- [P046] > torch/fx/_symbolic_trace.py#L673:flatten_fn [New]
- [P045] > torch/fx/_symbolic_trace.py#L686:trace [New]
- [P044] > torch/_dynamo/eval_frame.py#L596:_fn
- [P043] > torch/fx/experimental/proxy_tensor.py#L636:dispatch_trace [New]
- [P042] > torch/_dynamo/eval_frame.py#L596:_fn [New]
- [P041] > torch/_compile.py#L21:inner [New]
- [P040] > torch/fx/experimental/proxy_tensor.py#L1299:_trace_inner [New]
- [P039] > torch/fx/experimental/proxy_tensor.py#L1365:trace [New]
- [P038] > torch/fx/experimental/proxy_tensor.py#L1419:wrapped [New]
- [P037] > torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py#L40:_create_graph [New]
- [P036] > torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py#L218:aot_dispatch_autograd_graph [New]
- [P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
- [P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
- [P033] > torch/_dynamo/utils.py#L219:time_wrapper
- [P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
AOTAutograd 划分 joint graph 为 forward graph 和 backward graph:
- [P037] > torch/_functorch/partitioners.py#L1638:min_cut_rematerialization_partition [New]
- [P036] > torch/_inductor/compile_fx.py#L1436:partition_fn [New]
- [P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
- [P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
- [P033] > torch/_dynamo/utils.py#L219:time_wrapper
- [P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
TorchInductor Link to heading
运行 post_grad_passes
,优化计算图:
- [P048] > torch/_inductor/fx_passes/post_grad.py#L68:post_grad_passes [New]
- [P047] > torch/_inductor/compile_fx.py#L259:_recursive_post_grad_passes [New]
- [P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
- [P045] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P044] > torch/_inductor/compile_fx.py#L419:compile_fx_inner [New]
- [P043] > torch/_dynamo/utils.py#L219:time_wrapper
- [P042] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P041] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P040] > torch/_inductor/debug.py#L301:inner [New]
- [P039] > torch/_dynamo/repro/after_aot.py#L67:debug_wrapper [New]
- [P038] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P037] > torch/_inductor/compile_fx.py#L1344:fw_compiler_base [New]
- [P036] > torch/_dynamo/utils.py#L219:time_wrapper
- [P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
- [P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
- [P033] > torch/_dynamo/utils.py#L219:time_wrapper
- [P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
- [P031] > torch/_dynamo/backends/common.py#L22:call [New]
- [P030] > torch/_inductor/compile_fx.py#L1250:compile_fx
- [P029] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P028] > torch/_inductor/compile_fx.py#L1250:compile_fx [New]
- [P027] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P026] > torch/init.py#L1948:call [New]
- [P025] > torch/_dynamo/repro/after_dynamo.py#L72:call [New]
- [P024] > torch/_dynamo/output_graph.py#L1361:call_user_compiler [New]
- [P023] > torch/_dynamo/utils.py#L219:time_wrapper
- [P022] > torch/_dynamo/output_graph.py#L1247:compile_and_call_fx_graph [New]
get_fusion_candidates
以 BFS 的方式搜索依次搜索输入节点,检查是否能够融合算子:
- [P053] > torch/_inductor/fx_passes/group_batch_fusion.py#L181:_addmm_node_can_be_fused [New]
- [P052] > torch/_inductor/fx_passes/group_batch_fusion.py#L195:match
- [P051] > torch/_inductor/fx_passes/group_batch_fusion.py#L1137:get_fusion_candidates
- [P050] > torch/_inductor/fx_passes/group_batch_fusion.py#L1180:apply_group_batch_fusion
- [P049] > torch/_inductor/fx_passes/group_batch_fusion.py#L1218:group_batch_fusion_passes
- [P048] > torch/_inductor/fx_passes/post_grad.py#L68:post_grad_passes [New]
- [P047] > torch/_inductor/compile_fx.py#L259:_recursive_post_grad_passes [New]
- [P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
通过 graph lowering 把计算图转为 TorchInductor 的 IR:
- [P050] > torch/_inductor/graph.py#L1171:run_node [New]
- [P049] > torch/fx/interpreter.py#L107:run
- [P048] > torch/_inductor/graph.py#L728:run [New]
- [P047] > torch/_dynamo/utils.py#L219:time_wrapper
- [P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
分析图中的依赖关系:
- [P053] > torch/_inductor/scheduler.py#L1498:compute_dependencies [New]
- [P052] > torch/_inductor/scheduler.py#L1350:init [New]
- [P051] > torch/_dynamo/utils.py#L219:time_wrapper
- [P050] > torch/_inductor/graph.py#L1629:codegen [New]
- [P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
- [P048] > torch/_dynamo/utils.py#L219:time_wrapper
- [P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
- [P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
TorchInductor 通过 fuse_nodes
做 vertical & horizontal fusion,它先通过 get_possible_fusions
获取可以融合的算子组合,先决条件是算子用到了相同的 buffer,然后通过 can_fuse
检查是否可以融合。
- [P058] > torch/_inductor/codegen/cuda_combined_scheduling.py#L42:can_fuse_horizontal [New]
- [P057] > torch/_inductor/scheduler.py#L2262:can_fuse
- [P056] > torch/_inductor/scheduler.py#L2097:check_all_pairs
- [P055] > torch/_inductor/scheduler.py#L2090:get_possible_fusions [New]
- [P054] > torch/_inductor/scheduler.py#L2052:fuse_nodes_once [New]
- [P053] > torch/_inductor/scheduler.py#L1814:fuse_nodes [New]
- [P052] > torch/_inductor/scheduler.py#L1350:init [New]
- [P051] > torch/_dynamo/utils.py#L219:time_wrapper
- [P050] > torch/_inductor/graph.py#L1629:codegen [New]
- [P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
- [P048] > torch/_dynamo/utils.py#L219:time_wrapper
- [P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
- [P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
Backend 生成 ATen:
- [P055] > torch/_inductor/codegen/wrapper.py#L667:generate_extern_kernel_out [New]
- [P054] > torch/_inductor/ir.py#L4545:codegen [New]
- [P053] > torch/_inductor/scheduler.py#L2621:codegen_extern_call [New]
- [P052] > torch/_inductor/scheduler.py#L2684:codegen [New]
- [P051] > torch/_dynamo/utils.py#L219:time_wrapper
- [P050] > torch/_inductor/graph.py#L1629:codegen [New]
- [P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
- [P048] > torch/_dynamo/utils.py#L219:time_wrapper
- [P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
- [P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
使用 Triton 编译:
- [P055] > torch/_inductor/codegen/simd.py#L1285:codegen_node_schedule [New]
- [P054] > torch/_inductor/codegen/simd.py#L1129:codegen_node [New]
- [P053] > torch/_inductor/codegen/cuda_combined_scheduling.py#L68:codegen_node [New]
- [P052] > torch/_inductor/scheduler.py#L2684:codegen [New]
- [P051] > torch/_dynamo/utils.py#L219:time_wrapper
- [P050] > torch/_inductor/graph.py#L1629:codegen [New]
- [P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
- [P048] > torch/_dynamo/utils.py#L219:time_wrapper
- [P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
- [P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
为编译过的子图生成调用它们的 wrapper code:
- [P052] > torch/_inductor/codegen/wrapper.py#L729:generate [New]
- [P051] > torch/_dynamo/utils.py#L219:time_wrapper
- [P050] > torch/_inductor/graph.py#L1629:codegen [New]
- [P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
- [P048] > torch/_dynamo/utils.py#L219:time_wrapper
- [P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
- [P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
Guard Link to heading
为编译好的子图生成 guard:
- [P015] > torch/fx/experimental/symbolic_shapes.py#L3606:produce_guards [New]
- [P014] > torch/_dynamo/guards.py#L1673:SHAPE_ENV [New]
- [P013] > torch/_guards.py#L258:create
- [P012] > torch/_dynamo/guards.py#L2076:init [New]
- [P011] > torch/_dynamo/convert_frame.py#L606:compile_inner [New]
- [P010] > torch/_dynamo/utils.py#L219:time_wrapper [New]
- [P009] > torch/_dynamo/convert_frame.py#L522:_compile [New]
- [P008] > /opt/conda/lib/python3.11/contextlib.py#L78:inner [New]
- [P007] > torch/_strobelight/compile_time_profiler.py#L124:profile_compile_time [New]
- [P006] > torch/_utils_internal.py#L80:wrapper_function [New]
- [P005] > torch/_dynamo/convert_frame.py#L383:call [New]
- [P004] > torch/_dynamo/convert_frame.py#L938:call [New]
- [P003] > torch/_dynamo/convert_frame.py#L1062:call [New]
- [P002] > torch/_dynamo/eval_frame.py#L399:_fn [New]
- [P001] > torch/nn/modules/module.py#L1555:_call_impl [New]
- [P000] > torch/nn/modules/module.py#L1549:_wrapped_call_impl [New]
生成 check_fn
,确保编译过的子图可用,否则重新编译:
- [P013] > torch/_dynamo/guards.py#L2164:compile_check_fn [New]
- [P012] > torch/_dynamo/guards.py#L2076:init [New]
- [P011] > torch/_dynamo/convert_frame.py#L606:compile_inner [New]
- [P010] > torch/_dynamo/utils.py#L219:time_wrapper [New]
- [P009] > torch/_dynamo/convert_frame.py#L522:_compile [New]
- [P008] > /opt/conda/lib/python3.11/contextlib.py#L78:inner [New]
- [P007] > torch/_strobelight/compile_time_profiler.py#L124:profile_compile_time [New]
- [P006] > torch/_utils_internal.py#L80:wrapper_function [New]
- [P005] > torch/_dynamo/convert_frame.py#L383:call [New]
- [P004] > torch/_dynamo/convert_frame.py#L938:call [New]
- [P003] > torch/_dynamo/convert_frame.py#L1062:call [New]
- [P002] > torch/_dynamo/eval_frame.py#L399:_fn [New]
- [P001] > torch/nn/modules/module.py#L1555:_call_impl [New]
- [P000] > torch/nn/modules/module.py#L1549:_wrapped_call_impl [New]
CUDA Graph Link to heading
torch.compile
的 reduce-overhead
模式下会自动添加 CUDA Graph 来减小运行时开销,发生在 TorchInductor 中。首先检查子图中是否包含与 CUDA Graph 不兼容的算子:
- [P046] > torch/_inductor/utils.py#L658:get_first_incompatible_cudagraph_node [New]
- [P045] > torch/_inductor/utils.py#L700:has_incompatible_cudagraph_ops [New]
- [P044] > torch/_inductor/compile_fx.py#L419:compile_fx_inner [New]
- [P043] > torch/_dynamo/utils.py#L219:time_wrapper
- [P042] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P041] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P040] > torch/_inductor/debug.py#L301:inner [New]
- [P039] > torch/_dynamo/repro/after_aot.py#L67:debug_wrapper [New]
- [P038] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P037] > torch/_inductor/compile_fx.py#L1344:fw_compiler_base [New]
- [P036] > torch/_dynamo/utils.py#L219:time_wrapper
- [P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
- [P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
- [P033] > torch/_dynamo/utils.py#L219:time_wrapper
- [P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
- [P031] > torch/_dynamo/backends/common.py#L22:call [New]
- [P030] > torch/_inductor/compile_fx.py#L1250:compile_fx
- [P029] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
- [P028] > torch/_inductor/compile_fx.py#L1250:compile_fx [New]
把子图转为 CUDA Graph:
- [P046] > torch/_inductor/compile_fx.py#L949:cudagraphify [New]
- [P045] > torch/_dynamo/utils.py#L219:time_wrapper
- [P044] > torch/_inductor/compile_fx.py#L419:compile_fx_inner [New]
捕获子图为 CUDA Graph 发生在:
- [P020] > torch/cuda/graphs.py#L55:capture_begin [New]
- [P019] > torch/cuda/graphs.py#L170:enter [New]
- [P018] > torch/_inductor/cudagraph_trees.py#L1745:init [New]
- [P017] > torch/_inductor/cudagraph_trees.py#L269:get_tree_manager [New]
- [P016] > torch/_inductor/cudagraph_trees.py#L382:cudagraphify [New]
- [P015] > torch/_inductor/cudagraph_trees.py#L356:deferred_cudagraphify [New]
- [P014] > torch/_inductor/compile_fx.py#L988:run [New]
- [P013] > torch/_inductor/codecache.py#L1129:call [New]
- [P012] > torch/_functorch/_aot_autograd/runtime_wrappers.py#L437:wrapper [New]
- [P011] > torch/_functorch/_aot_autograd/utils.py#L110:call_func_at_runtime_with_args
- [P010] > torch/_functorch/_aot_autograd/runtime_wrappers.py#L1431:forward [New]
- [P009] > torch/autograd/function.py#L558:apply
- [P008] > torch/_functorch/_aot_autograd/utils.py#L93:g [New]
- [P007] > torch/_functorch/_aot_autograd/utils.py#L110:call_func_at_runtime_with_args [New]
- [P006] > torch/_functorch/_aot_autograd/runtime_wrappers.py#L185:runtime_wrapper [New]
- [P005] > torch/_functorch/aot_autograd.py#L983:forward [New]
- [P004] > torch/_dynamo/eval_frame.py#L596:_fn
- [P003] > torch/_dynamo/external_utils.py#L36:inner [New]
- [P002] > torch/_dynamo/eval_frame.py#L399:_fn [New]
- [P001] > torch/nn/modules/module.py#L1555:_call_impl [New]
- [P000] > torch/nn/modules/module.py#L1549:_wrapped_call_impl [New]
在运行时 replay CUDA Graph:
- [P022] > torch/cuda/graphs.py#L85:replay
- [P021] > torch/_inductor/cudagraph_trees.py#L1112:run_graph [New]
- [P020] > torch/_inductor/cudagraph_trees.py#L998:run [New]
- [P019] > torch/_inductor/cudagraph_trees.py#L2023:execute_node [New]
- [P018] > torch/_inductor/cudagraph_trees.py#L1891:_run
- [P017] > torch/_inductor/cudagraph_trees.py#L1839:run
- [P016] > torch/_inductor/compile_fx.py#L942:run
- [P015] > torch/_inductor/cudagraph_trees.py#L356:deferred_cudagraphify
- [P014] > torch/_inductor/compile_fx.py#L988:run