torch.compile 重要步骤函数调用栈

September 1, 2024 4-minute read

Fei Kong

PyTorch • torch.compile • Source Code • TorchDynamo • AOTAutograd • TorchInductor

前段时间因为工作原因重新浏览了 torch.compile 中几个重要的函数调用栈，现将这些内容分享给有兴趣通过源代码理解 torch.compile 行为的朋友。

本文使用 PyTorch 官方提供的 docker 镜像: pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel，PyTorch 版本为 2.4.0。

本文所用示例代码:

import torch
import torch.nn as nn
from collections import OrderedDict

@torch._inductor.config.patch(
    post_grad_fusion_options={"batch_linear_post_grad": {"require_fbgemm": False}}
)
def test():
    n, h = 32, 128
    repeats = 3
    layers = OrderedDict()
    for i in range(repeats):
        layers[f"fc_{i}"] = nn.Linear(h, h)
        layers[f"ln_{i}"] = nn.LayerNorm(h)
        layers[f"silu_{i}"] = nn.SiLU()
    model = nn.Sequential(layers).cuda().half()
    x = torch.randn((n, h), device="cuda", dtype=torch.float16, requires_grad=True)
    dy = torch.randn_like(x)

    compiled = torch.compile(model, mode="reduce-overhead")
    for _ in range(4):
        y = compiled(x)
        y.backward(dy)

if __name__ == "__main__":
    test()

TorchDynamo Link to heading

使用 TorchDynamo 来 trace 函数:

[P022] > torch/_dynamo/output_graph.py#L1247:compile_and_call_fx_graph [New]
[P021] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P020] > torch/_dynamo/output_graph.py#L957:compile_subgraph [New]
[P019] > torch/_dynamo/symbolic_convert.py#L2613:_return [New]
[P018] > torch/_dynamo/symbolic_convert.py#L2641:RETURN_VALUE [New]
[P017] > torch/_dynamo/symbolic_convert.py#L777:step
[P016] > torch/_dynamo/symbolic_convert.py#L889:run [New]
[P015] > torch/_dynamo/symbolic_convert.py#L2450:run [New]
[P014] > torch/_dynamo/convert_frame.py#L559:transform [New]
[P013] > torch/_dynamo/convert_frame.py#L157:_fn [New]
[P012] > torch/_dynamo/bytecode_transformation.py#L1177:transform_code_object [New]
[P011] > torch/_dynamo/convert_frame.py#L606:compile_inner [New]
[P010] > torch/_dynamo/utils.py#L219:time_wrapper [New]
[P009] > torch/_dynamo/convert_frame.py#L522:_compile [New]
[P008] > /opt/conda/lib/python3.11/contextlib.py#L78:inner [New]
[P007] > torch/_strobelight/compile_time_profiler.py#L124:profile_compile_time [New]
[P006] > torch/_utils_internal.py#L80:wrapper_function [New]
[P005] > torch/_dynamo/convert_frame.py#L383:call [New]
[P004] > torch/_dynamo/convert_frame.py#L938:call [New]
[P003] > torch/_dynamo/convert_frame.py#L1062:call [New]
[P002] > torch/_dynamo/eval_frame.py#L399:_fn [New]
[P001] > torch/nn/modules/module.py#L1555:_call_impl [New]
[P000] > torch/nn/modules/module.py#L1549:_wrapped_call_impl [New]

在 AOTAutograd 之前运行 pre_grad_passes，替换捕获到的计算图中的算子:

[P032] > torch/_inductor/fx_passes/pre_grad.py#L108:pre_grad_passes [New]
[P031] > torch/_inductor/compile_fx.py#L243:_recursive_pre_grad_passes [New]
[P030] > torch/_inductor/compile_fx.py#L1250:compile_fx
[P029] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P028] > torch/_inductor/compile_fx.py#L1250:compile_fx [New]
[P027] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P026] > torch/init.py#L1948:call [New]
[P025] > torch/_dynamo/repro/after_dynamo.py#L72:call [New]
[P024] > torch/_dynamo/output_graph.py#L1361:call_user_compiler [New]
[P023] > torch/_dynamo/utils.py#L219:time_wrapper
[P022] > torch/_dynamo/output_graph.py#L1247:compile_and_call_fx_graph [New]
[P021] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P020] > torch/_dynamo/output_graph.py#L957:compile_subgraph [New]
[P019] > torch/_dynamo/symbolic_convert.py#L2613:_return [New]
[P018] > torch/_dynamo/symbolic_convert.py#L2641:RETURN_VALUE [New]
[P017] > torch/_dynamo/symbolic_convert.py#L777:step

AOTAutograd & PrimTorch Link to heading

AOTAutograd 通过 trace 得到正向传播和反向传播的 joint graph:

[P036] > torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py#L218:aot_dispatch_autograd_graph [New]
[P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
[P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
[P033] > torch/_dynamo/utils.py#L219:time_wrapper
[P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
[P031] > torch/_dynamo/backends/common.py#L22:call [New]
[P030] > torch/_inductor/compile_fx.py#L1250:compile_fx
[P029] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P028] > torch/_inductor/compile_fx.py#L1250:compile_fx [New]
[P027] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P026] > torch/init.py#L1948:call [New]
[P025] > torch/_dynamo/repro/after_dynamo.py#L72:call [New]
[P024] > torch/_dynamo/output_graph.py#L1361:call_user_compiler [New]
[P023] > torch/_dynamo/utils.py#L219:time_wrapper
[P022] > torch/_dynamo/output_graph.py#L1247:compile_and_call_fx_graph [New]

通过 functorch 把待编译函数转为符合函数式编程原则的函数:

[P037] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L343:create_functionalized_fn [New]
[P036] > torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py#L218:aot_dispatch_autograd_graph [New]
[P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
[P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
[P033] > torch/_dynamo/utils.py#L219:time_wrapper
[P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]

通过 PrimTorch 分解算子:

[P076] > torch/_refs/init.py#L3148:native_layer_norm
[P075] > torch/_prims_common/wrappers.py#L256:_fn
[P074] > torch/_decomp/init.py#L78:_fn
[P073] > torch/fx/experimental/proxy_tensor.py#L1438:maybe_handle_decomp
[P072] > torch/fx/experimental/proxy_tensor.py#L306:proxy_call
[P071] > torch/fx/experimental/proxy_tensor.py#L783:inner_torch_dispatch
[P070] > torch/fx/experimental/proxy_tensor.py#L752:torch_dispatch
[P069] > torch/utils/_stats.py#L16:wrapper
[P068] > torch/_subclasses/functional_tensor.py#L316:torch_dispatch
[P067] > torch/nn/functional.py#L2561:layer_norm
[P066] > torch/fx/experimental/proxy_tensor.py#L701:torch_function
[P065] > torch/overrides.py#L1583:handle_torch_function [New]
[P064] > torch/nn/functional.py#L2561:layer_norm
[P063] > torch/nn/modules/normalization.py#L201:forward
[P062] > torch/nn/modules/module.py#L1555:_call_impl
[P061] > torch/nn/modules/module.py#L1549:_wrapped_call_impl
[P060] > torch/fx/_symbolic_trace.py#L792:forward
[P059] > torch/fx/experimental/proxy_tensor.py#L566:call_module
[P058] > torch/fx/_symbolic_trace.py#L790:module_call_wrapper
[P057] > torch/fx/interpreter.py#L299:call_module
[P056] > torch/fx/interpreter.py#L185:run_node
[P055] > torch/fx/experimental/symbolic_shapes.py#L5455:run_node
[P054] > torch/fx/interpreter.py#L107:run
[P053] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L733:functional_call
[P052] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L103:inner_fn [New]
[P051] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L187:inner_fn [New]
[P050] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L252:inner_fn_with_anomaly [New]
[P049] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L351:_functionalized_f_helper [New]
[P048] > torch/_functorch/_aot_autograd/traced_function_transforms.py#L598:joint_helper [New]
[P047] > torch/fx/experimental/proxy_tensor.py#L652:wrapped [New]
[P046] > torch/fx/_symbolic_trace.py#L673:flatten_fn [New]
[P045] > torch/fx/_symbolic_trace.py#L686:trace [New]
[P044] > torch/_dynamo/eval_frame.py#L596:_fn
[P043] > torch/fx/experimental/proxy_tensor.py#L636:dispatch_trace [New]
[P042] > torch/_dynamo/eval_frame.py#L596:_fn [New]
[P041] > torch/_compile.py#L21:inner [New]
[P040] > torch/fx/experimental/proxy_tensor.py#L1299:_trace_inner [New]
[P039] > torch/fx/experimental/proxy_tensor.py#L1365:trace [New]
[P038] > torch/fx/experimental/proxy_tensor.py#L1419:wrapped [New]
[P037] > torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py#L40:_create_graph [New]
[P036] > torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py#L218:aot_dispatch_autograd_graph [New]
[P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
[P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
[P033] > torch/_dynamo/utils.py#L219:time_wrapper
[P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]

AOTAutograd 划分 joint graph 为 forward graph 和 backward graph:

[P037] > torch/_functorch/partitioners.py#L1638:min_cut_rematerialization_partition [New]
[P036] > torch/_inductor/compile_fx.py#L1436:partition_fn [New]
[P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
[P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
[P033] > torch/_dynamo/utils.py#L219:time_wrapper
[P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]

TorchInductor Link to heading

运行 post_grad_passes，优化计算图:

[P048] > torch/_inductor/fx_passes/post_grad.py#L68:post_grad_passes [New]
[P047] > torch/_inductor/compile_fx.py#L259:_recursive_post_grad_passes [New]
[P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]
[P045] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P044] > torch/_inductor/compile_fx.py#L419:compile_fx_inner [New]
[P043] > torch/_dynamo/utils.py#L219:time_wrapper
[P042] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P041] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P040] > torch/_inductor/debug.py#L301:inner [New]
[P039] > torch/_dynamo/repro/after_aot.py#L67:debug_wrapper [New]
[P038] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P037] > torch/_inductor/compile_fx.py#L1344:fw_compiler_base [New]
[P036] > torch/_dynamo/utils.py#L219:time_wrapper
[P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
[P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
[P033] > torch/_dynamo/utils.py#L219:time_wrapper
[P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
[P031] > torch/_dynamo/backends/common.py#L22:call [New]
[P030] > torch/_inductor/compile_fx.py#L1250:compile_fx
[P029] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P028] > torch/_inductor/compile_fx.py#L1250:compile_fx [New]
[P027] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P026] > torch/init.py#L1948:call [New]
[P025] > torch/_dynamo/repro/after_dynamo.py#L72:call [New]
[P024] > torch/_dynamo/output_graph.py#L1361:call_user_compiler [New]
[P023] > torch/_dynamo/utils.py#L219:time_wrapper
[P022] > torch/_dynamo/output_graph.py#L1247:compile_and_call_fx_graph [New]

get_fusion_candidates 以 BFS 的方式搜索依次搜索输入节点，检查是否能够融合算子:

[P053] > torch/_inductor/fx_passes/group_batch_fusion.py#L181:_addmm_node_can_be_fused [New]
[P052] > torch/_inductor/fx_passes/group_batch_fusion.py#L195:match
[P051] > torch/_inductor/fx_passes/group_batch_fusion.py#L1137:get_fusion_candidates
[P050] > torch/_inductor/fx_passes/group_batch_fusion.py#L1180:apply_group_batch_fusion
[P049] > torch/_inductor/fx_passes/group_batch_fusion.py#L1218:group_batch_fusion_passes
[P048] > torch/_inductor/fx_passes/post_grad.py#L68:post_grad_passes [New]
[P047] > torch/_inductor/compile_fx.py#L259:_recursive_post_grad_passes [New]
[P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]

通过 graph lowering 把计算图转为 TorchInductor 的 IR:

[P050] > torch/_inductor/graph.py#L1171:run_node [New]
[P049] > torch/fx/interpreter.py#L107:run
[P048] > torch/_inductor/graph.py#L728:run [New]
[P047] > torch/_dynamo/utils.py#L219:time_wrapper
[P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]

分析图中的依赖关系:

[P053] > torch/_inductor/scheduler.py#L1498:compute_dependencies [New]
[P052] > torch/_inductor/scheduler.py#L1350:init [New]
[P051] > torch/_dynamo/utils.py#L219:time_wrapper
[P050] > torch/_inductor/graph.py#L1629:codegen [New]
[P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
[P048] > torch/_dynamo/utils.py#L219:time_wrapper
[P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
[P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]

TorchInductor 通过 fuse_nodes 做 vertical & horizontal fusion，它先通过 get_possible_fusions 获取可以融合的算子组合，先决条件是算子用到了相同的 buffer，然后通过 can_fuse 检查是否可以融合。

[P058] > torch/_inductor/codegen/cuda_combined_scheduling.py#L42:can_fuse_horizontal [New]
[P057] > torch/_inductor/scheduler.py#L2262:can_fuse
[P056] > torch/_inductor/scheduler.py#L2097:check_all_pairs
[P055] > torch/_inductor/scheduler.py#L2090:get_possible_fusions [New]
[P054] > torch/_inductor/scheduler.py#L2052:fuse_nodes_once [New]
[P053] > torch/_inductor/scheduler.py#L1814:fuse_nodes [New]
[P052] > torch/_inductor/scheduler.py#L1350:init [New]
[P051] > torch/_dynamo/utils.py#L219:time_wrapper
[P050] > torch/_inductor/graph.py#L1629:codegen [New]
[P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
[P048] > torch/_dynamo/utils.py#L219:time_wrapper
[P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
[P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]

Backend 生成 ATen:

[P055] > torch/_inductor/codegen/wrapper.py#L667:generate_extern_kernel_out [New]
[P054] > torch/_inductor/ir.py#L4545:codegen [New]
[P053] > torch/_inductor/scheduler.py#L2621:codegen_extern_call [New]
[P052] > torch/_inductor/scheduler.py#L2684:codegen [New]
[P051] > torch/_dynamo/utils.py#L219:time_wrapper
[P050] > torch/_inductor/graph.py#L1629:codegen [New]
[P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
[P048] > torch/_dynamo/utils.py#L219:time_wrapper
[P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
[P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]

使用 Triton 编译:

[P055] > torch/_inductor/codegen/simd.py#L1285:codegen_node_schedule [New]
[P054] > torch/_inductor/codegen/simd.py#L1129:codegen_node [New]
[P053] > torch/_inductor/codegen/cuda_combined_scheduling.py#L68:codegen_node [New]
[P052] > torch/_inductor/scheduler.py#L2684:codegen [New]
[P051] > torch/_dynamo/utils.py#L219:time_wrapper
[P050] > torch/_inductor/graph.py#L1629:codegen [New]
[P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
[P048] > torch/_dynamo/utils.py#L219:time_wrapper
[P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
[P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]

为编译过的子图生成调用它们的 wrapper code:

[P052] > torch/_inductor/codegen/wrapper.py#L729:generate [New]
[P051] > torch/_dynamo/utils.py#L219:time_wrapper
[P050] > torch/_inductor/graph.py#L1629:codegen [New]
[P049] > torch/_inductor/graph.py#L1673:compile_to_module [New]
[P048] > torch/_dynamo/utils.py#L219:time_wrapper
[P047] > torch/_inductor/graph.py#L1722:compile_to_fn [New]
[P046] > torch/_inductor/compile_fx.py#L678:fx_codegen_and_compile [New]

Guard Link to heading

为编译好的子图生成 guard:

[P015] > torch/fx/experimental/symbolic_shapes.py#L3606:produce_guards [New]
[P014] > torch/_dynamo/guards.py#L1673:SHAPE_ENV [New]
[P013] > torch/_guards.py#L258:create
[P012] > torch/_dynamo/guards.py#L2076:init [New]
[P011] > torch/_dynamo/convert_frame.py#L606:compile_inner [New]
[P010] > torch/_dynamo/utils.py#L219:time_wrapper [New]
[P009] > torch/_dynamo/convert_frame.py#L522:_compile [New]
[P008] > /opt/conda/lib/python3.11/contextlib.py#L78:inner [New]
[P007] > torch/_strobelight/compile_time_profiler.py#L124:profile_compile_time [New]
[P006] > torch/_utils_internal.py#L80:wrapper_function [New]
[P005] > torch/_dynamo/convert_frame.py#L383:call [New]
[P004] > torch/_dynamo/convert_frame.py#L938:call [New]
[P003] > torch/_dynamo/convert_frame.py#L1062:call [New]
[P002] > torch/_dynamo/eval_frame.py#L399:_fn [New]
[P001] > torch/nn/modules/module.py#L1555:_call_impl [New]
[P000] > torch/nn/modules/module.py#L1549:_wrapped_call_impl [New]

生成 check_fn，确保编译过的子图可用，否则重新编译:

[P013] > torch/_dynamo/guards.py#L2164:compile_check_fn [New]
[P012] > torch/_dynamo/guards.py#L2076:init [New]
[P011] > torch/_dynamo/convert_frame.py#L606:compile_inner [New]
[P010] > torch/_dynamo/utils.py#L219:time_wrapper [New]
[P009] > torch/_dynamo/convert_frame.py#L522:_compile [New]
[P008] > /opt/conda/lib/python3.11/contextlib.py#L78:inner [New]
[P007] > torch/_strobelight/compile_time_profiler.py#L124:profile_compile_time [New]
[P006] > torch/_utils_internal.py#L80:wrapper_function [New]
[P005] > torch/_dynamo/convert_frame.py#L383:call [New]
[P004] > torch/_dynamo/convert_frame.py#L938:call [New]
[P003] > torch/_dynamo/convert_frame.py#L1062:call [New]
[P002] > torch/_dynamo/eval_frame.py#L399:_fn [New]
[P001] > torch/nn/modules/module.py#L1555:_call_impl [New]
[P000] > torch/nn/modules/module.py#L1549:_wrapped_call_impl [New]

CUDA Graph Link to heading

torch.compile 的 reduce-overhead 模式下会自动添加 CUDA Graph 来减小运行时开销，发生在 TorchInductor 中。首先检查子图中是否包含与 CUDA Graph 不兼容的算子:

[P046] > torch/_inductor/utils.py#L658:get_first_incompatible_cudagraph_node [New]
[P045] > torch/_inductor/utils.py#L700:has_incompatible_cudagraph_ops [New]
[P044] > torch/_inductor/compile_fx.py#L419:compile_fx_inner [New]
[P043] > torch/_dynamo/utils.py#L219:time_wrapper
[P042] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P041] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P040] > torch/_inductor/debug.py#L301:inner [New]
[P039] > torch/_dynamo/repro/after_aot.py#L67:debug_wrapper [New]
[P038] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P037] > torch/_inductor/compile_fx.py#L1344:fw_compiler_base [New]
[P036] > torch/_dynamo/utils.py#L219:time_wrapper
[P035] > torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py#L221:aot_dispatch_autograd [New]
[P034] > torch/_functorch/aot_autograd.py#L420:create_aot_dispatcher_function [New]
[P033] > torch/_dynamo/utils.py#L219:time_wrapper
[P032] > torch/_functorch/aot_autograd.py#L856:aot_module_simplified [New]
[P031] > torch/_dynamo/backends/common.py#L22:call [New]
[P030] > torch/_inductor/compile_fx.py#L1250:compile_fx
[P029] > /opt/conda/lib/python3.11/contextlib.py#L78:inner
[P028] > torch/_inductor/compile_fx.py#L1250:compile_fx [New]

把子图转为 CUDA Graph:

[P046] > torch/_inductor/compile_fx.py#L949:cudagraphify [New]
[P045] > torch/_dynamo/utils.py#L219:time_wrapper
[P044] > torch/_inductor/compile_fx.py#L419:compile_fx_inner [New]

捕获子图为 CUDA Graph 发生在:

[P020] > torch/cuda/graphs.py#L55:capture_begin [New]
[P019] > torch/cuda/graphs.py#L170:enter [New]
[P018] > torch/_inductor/cudagraph_trees.py#L1745:init [New]
[P017] > torch/_inductor/cudagraph_trees.py#L269:get_tree_manager [New]
[P016] > torch/_inductor/cudagraph_trees.py#L382:cudagraphify [New]
[P015] > torch/_inductor/cudagraph_trees.py#L356:deferred_cudagraphify [New]
[P014] > torch/_inductor/compile_fx.py#L988:run [New]
[P013] > torch/_inductor/codecache.py#L1129:call [New]
[P012] > torch/_functorch/_aot_autograd/runtime_wrappers.py#L437:wrapper [New]
[P011] > torch/_functorch/_aot_autograd/utils.py#L110:call_func_at_runtime_with_args
[P010] > torch/_functorch/_aot_autograd/runtime_wrappers.py#L1431:forward [New]
[P009] > torch/autograd/function.py#L558:apply
[P008] > torch/_functorch/_aot_autograd/utils.py#L93:g [New]
[P007] > torch/_functorch/_aot_autograd/utils.py#L110:call_func_at_runtime_with_args [New]
[P006] > torch/_functorch/_aot_autograd/runtime_wrappers.py#L185:runtime_wrapper [New]
[P005] > torch/_functorch/aot_autograd.py#L983:forward [New]
[P004] > torch/_dynamo/eval_frame.py#L596:_fn
[P003] > torch/_dynamo/external_utils.py#L36:inner [New]
[P002] > torch/_dynamo/eval_frame.py#L399:_fn [New]
[P001] > torch/nn/modules/module.py#L1555:_call_impl [New]
[P000] > torch/nn/modules/module.py#L1549:_wrapped_call_impl [New]

在运行时 replay CUDA Graph:

[P022] > torch/cuda/graphs.py#L85:replay
[P021] > torch/_inductor/cudagraph_trees.py#L1112:run_graph [New]
[P020] > torch/_inductor/cudagraph_trees.py#L998:run [New]
[P019] > torch/_inductor/cudagraph_trees.py#L2023:execute_node [New]
[P018] > torch/_inductor/cudagraph_trees.py#L1891:_run
[P017] > torch/_inductor/cudagraph_trees.py#L1839:run
[P016] > torch/_inductor/compile_fx.py#L942:run
[P015] > torch/_inductor/cudagraph_trees.py#L356:deferred_cudagraphify
[P014] > torch/_inductor/compile_fx.py#L988:run