About
I’m a Senior Engineer at NVIDIA, chasing the Speed of Light โ the theoretical peak performance that every GPU workload aspires to reach. At NVIDIA, I learned that true optimization isn’t about clever tricks; it’s about relentlessly measuring, understanding, and eliminating every wasted cycle until you’re as close to SOL as physics allows.
My research interests span high-performance computing, artificial intelligence, and computer architecture. I work on pushing state-of-the-art deep learning models to industry-leading performance across domains including speech recognition, machine translation, image classification & detection, and generative AI.
This blog is where I document my learnings โ the insights, techniques, and hard-won lessons from the pursuit of peak efficiency. In my spare time, I build developer tools in Python, CUDA, and PyTorch to make both everyday workflows and deep learning research faster and more productive.
๐ Deep Learning Models Link to heading
Selected training optimizations I’ve contributed to:
- MLPerf Flux.1 (2025): MLPerf Training Benchmark Suite, round v5.1
- MLPerf Stable Diffusion (2023โ2025): MLPerf Training Benchmark Suite, rounds v3.1โv5.0
- SE(3)-Transformer (2022): DGLPyTorch/DrugDiscovery/SE3Transformer
- EfficientNet & EfficientDet (2020โ2021): TensorFlow2/Classification/ConvNets, PyTorch/Detection/Efficientdet
- MLPerf GNMT (2018โ2020): MLPerf Training Benchmark Suite, rounds v0.5-v0.7
๐ง Open Source Contributions Link to heading
Key deep learning building blocks I’ve developed:
- Focal Loss (2021): apex/contrib/focal_loss
- Distributed Fused Adam (2019): DistributedFusedAdam
- Softmax Cross Entropy & Label Smoothing (2019): apex/contrib/xentropy
๐ฌ Contact Link to heading
Feel free to reach out via Zhihu, LinkedIn, or leave a comment on any post.