Xia Li

About Me

I am to be a tenure-track associate professor at School of Biomedical Engineering, Shanghai Jiao Tong University from spring of 2026. Now, I am a postdoctoral researcher at Institute for Biomedical Engineering, ETH Zurich, Switzerland, working with Prof. Marco Stampanoni. I obtained my Ph.D. degree from Department of Computer Science, ETH Zurich, Switzerland, in Dec. 2024, under the supervision of Prof. Joachim M. Buhmann, Prof. Antony J. Lomax and Dr. Ye Zhang. Before that I received my master’s degree from the Peking University, China, in Jul. 2017, advised by Prof. Hong Liu and Prof. Zhouchen Lin. Besdies, I got bachelor’s degree from School of Computer Science, Beijing University of Posts and Communications, China, in Jul. 2016. I was an intern at Bytedance and Google. I am an Area Chair of ICML, as well as reviewer for top-tier journals and conferences in the domain of machine learning, computer vision and medical physics.

I am serving as an Area Chair for ICML 2026
I am serving as an Area Chair for ICML 2026

From ICML: We invite you to serve as an area chair (AC) for ICML 2026 based on your expertise in important sub-areas of machine learning and your prior experience in the peer review process. We sincerely hope you will accept this invitation and help us make ICML 2026 a success.

Nov 20, 2025

3 papers accepted to phiRO in 2025.
3 papers accepted to phiRO in 2025.

They are: Contour-informed inter-patient deformable registration for more reliable voxel-based analysis of Head-and-Neck Cancer patients Gaussian primitives for deformable image registration A proof-of-concept study of direct magnetic resonance imaging-based proton dose calculation for brain tumors via neural networks with Monte Carlo-comparable accuracy

Nov 18, 2025

2 papers accepted to Medical Physics in 2025.
2 papers accepted to Medical Physics in 2025.

They are: Continuous sPatial-Temporal Deformable Image Registration and 4D Frame Interpolation Diffusion Schrödinger bridge models for high-quality MR-to-CT synthesis for proton treatment planning

Nov 18, 2025

Enhancing Brain MRI Super-Resolution Through Multi-Slice Aware Matching and Fusion
Enhancing Brain MRI Super-Resolution Through Multi-Slice Aware Matching and Fusion

A Multi-Slice Aware Matching and Fusion (MSAMF) network for brain MRI super-resolution that utilizes multi-slice reference images through multi-scale matching and fusion mechanisms to generate high-quality super-resolution images.

Oct 1, 2025

Gaussian Representation for Deformable Image Registration
Gaussian Representation for Deformable Image Registration

This study presented an optimization-based DIR ap-proach that employed an explicit Gaussian representation to achieveefficient DVF estimation, strong generalization, and high interpretabil-ity.

Jul 9, 2025

A proof-of-concept study of direct magnetic resonance imaging-based proton dose calculation for brain tumors via neural networks with Monte Carlo-comparable accuracy
A proof-of-concept study of direct magnetic resonance imaging-based proton dose calculation for brain tumors via neural networks with Monte Carlo-comparable accuracy

This study demonstrated the feasibility of MC-quality proton dose calculations directly from MR images for brain tumor patients, achieving comparable accuracy with faster computation and simplified implementation.

Jul 5, 2025

Exploring the effect of training set size and number of categories on ice crystal classification through a contrastive semi-supervised learning algorithm
Exploring the effect of training set size and number of categories on ice crystal classification through a contrastive semi-supervised learning algorithm

A contrastive semi-supervised learning (CSSL) algorithm for ice crystal classification that reduces manual labeling effort by 90% (154 hours saved) while maintaining high accuracy, achieving 89.6% accuracy with only 25% of labeled data.

Jun 27, 2025

Uncertainty-Aware Testing-Time Optimization for 3D Human Pose Estimation
Uncertainty-Aware Testing-Time Optimization for 3D Human Pose Estimation

In this paper, we propose an Uncertainty-Aware testing-time Optimization (UAO) framework for 3D human pose estimation. During the training process, we propose the GUMLP to estimate 3D results and uncertainty values for each joint. For test-time optimization, our UAO framework freezes the pre-trained network parameters and optimizes a latent state initialized by the input 2D pose. To constrain the optimization direction in both 2D and 3D spaces, projection and uncertainty constraints are applied. Extensive experiments show that our approach achieves state-of-the-art performance on two popular datasets

Jun 15, 2025

MiLNet: Multiplex Interactive Learning Network for RGB-T Semantic Segmentation
MiLNet: Multiplex Interactive Learning Network for RGB-T Semantic Segmentation

A novel module-free Multiplex Interactive Learning Network (MiLNet) for RGB-T semantic segmentation that integrates multi-model, multi-modal, and multi-level feature learning through asymmetric simulated learning and inverse hierarchical fusion strategies.

Mar 3, 2025

1 paper accepted to TIP in 2025.
1 paper accepted to TIP in 2025.

They are: MiLNet: Multiplex Interactive Learning Network for RGB-T Semantic Segmentation

Feb 9, 2025

Diffusion Schrödinger bridge models for high-quality MR-to-CT synthesis for proton treatment planning
Diffusion Schrödinger bridge models for high-quality MR-to-CT synthesis for proton treatment planning

A diffusion Schrödinger bridge model for high-quality MR-to-CT synthesis for proton treatment planning.

Jan 1, 2025

A Prior Causality‐Guided Multi‐View Diffusion Network for Brain Disorder Classification
A Prior Causality‐Guided Multi‐View Diffusion Network for Brain Disorder Classification

A Prior Causality-Guided Multi-View Diffusion Network (PCMDN) for brain disorder classification that leverages prior causality knowledge and multi-view diffusion processes to improve classification accuracy.

Jan 1, 2025

HYRE: Hybrid Regressor for 3D Human Pose and Shape Estimation
HYRE: Hybrid Regressor for 3D Human Pose and Shape Estimation

A novel Hybrid Regressor (HYRE) that combines parametric and non-parametric paradigms for 3D human pose and shape estimation, bridging the gap between physically plausible and pixel-aligned results through joint learning.

Dec 25, 2024

IceDetectNet: a rotated object detection algorithm for classifying components of aggregated ice crystals with a multi-label classification scheme
IceDetectNet: a rotated object detection algorithm for classifying components of aggregated ice crystals with a multi-label classification scheme

A rotated object detection algorithm (IceDetectNet) with a multi-label classification scheme for classifying components of aggregated ice crystals, achieving 92% accuracy for aggregate/non-aggregate detection and 86% for basic habit classification.

Dec 19, 2024

High Performance Computing Framework for Variable Selection on Genome-wide Association Studies
High Performance Computing Framework for Variable Selection on Genome-wide Association Studies

A high-performance computing framework for variable selection in GWAS that integrates state-of-the-art methods and employs novel optimization strategies for efficient processing of high-dimensional data with sparse characteristics.

Dec 3, 2024

UVMap-ID: A Controllable and Personalized UV Map Generative Model
UVMap-ID: A Controllable and Personalized UV Map Generative Model

A controllable and personalized UV Map generative model that fine-tunes a pre-trained text-to-image diffusion model with a face fusion module for ID-driven customized generation, addressing the challenges of personalized texture map generation and quality evaluation.

Oct 28, 2024

Generating Synthetic Computed Tomography for Radiotherapy: SynthRAD2023 Challenge Report
Generating Synthetic Computed Tomography for Radiotherapy: SynthRAD2023 Challenge Report

SynthRAD2023 challenge report comparing synthetic CT generation methods for radiotherapy using multi-center data, evaluating both image similarity and dose-based metrics for MRI-to-CT and CBCT-to-CT tasks.

Oct 1, 2024

A Unified Generation-Registration Framework for Improved MR-based CT Synthesis in Proton Therapy
A Unified Generation-Registration Framework for Improved MR-based CT Synthesis in Proton Therapy

This study conclusively demonstrates that a holistic MR-based CT synthesis approach, integrating both image-to-image translation and deformable registration, significantly improves the precision and quality of sCT generation, particularly for the challenging body area with varied anatomic changes between corresponding MR and CT.

Aug 13, 2024

ICCR 2024 Rising Star Competition
ICCR 2024 Rising Star Competition

Rising Star Competition by WIMP

Jul 10, 2024

Neural Graphics Primitives-based Deformable Image Registration for On-the-fly Motion Extraction
Neural Graphics Primitives-based Deformable Image Registration for On-the-fly Motion Extraction

In this study, we have successfully integrated NGP into DIR, a novel contribution that significantly enhances the accuracy and efficiency of medical image alignment as demonstrated on the DIR-lab dataset. The NGPDIR framework exhibits robust performance across various metrics, particularly in landmark alignment precision and the accommodation of anatomical sliding boundaries. This advancement not only propels the DIR field forward but also opens new avenues for real-time clinical applications, potentially transforming patient care with its rapid, reliable imaging capabilities.

Jul 8, 2024

ModelNet-O: A large-scale synthetic dataset for occlusion-aware point cloud classification
ModelNet-O: A large-scale synthetic dataset for occlusion-aware point cloud classification

A large-scale synthetic dataset ModelNet-O for occlusion-aware point cloud classification, featuring diverse occlusion patterns and complex object arrangements to evaluate model robustness under occlusion conditions.

Jun 19, 2024

Neural Clustering based Visual Representation Learning
Neural Clustering based Visual Representation Learning

A neural clustering framework (FEC) for visual representation learning that views feature extraction as selecting representatives from data, automatically capturing data distribution and providing interpretable cluster assignments.

Jun 17, 2024

VG4D: Vision-Language Model Goes 4D Video Recognition
VG4D: Vision-Language Model Goes 4D Video Recognition

A Vision-Language Model Goes 4D (VG4D) framework that transfers VLM knowledge to 4D point cloud networks for improved video recognition, achieving state-of-the-art performance on action recognition datasets.

May 13, 2024

ESTRO 2024 Proffered Paper
ESTRO 2024 Proffered Paper

Proffered paper presentation for continuous deformable image registration

May 4, 2024

Continuous sPatial-Temporal Deformable Image Registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods
Continuous sPatial-Temporal Deformable Image Registration (CPT-DIR) for motion modelling in radiotherapy: beyond classic voxel-based methods

In summary, the innovative CPT-DIR approach, integrating principles of INR and LDDMM, represents a11 significant departure from traditional voxel-based methods in DIR. By adopting a paradigm of continuous motion modelling, we transcend the limitations inherent in voxel-based representations, offering a more robust, automatic and versatile solution. Leveraging spatial continuity, we effectively handle the intricacies of sliding organ boundaries, while temporal continuity alleviates the complexities associated with significant anatomical changes over time. The tangible benefits are evident in its superior performance compared to classic B-splines methods. CPT-DIR consistently achieves better performance by all kinds of evaluation matrices. Additionally, the efficiency gains are substantial, with registration times slashed by more than half.

May 1, 2024

Towards robust referring image segmentation
Towards robust referring image segmentation

We propose a novel ranking loss function, named Bi-directional Exponential Angular Triplet Loss, to help learn an angularly separable common feature space by explicitly constraining the included angles between embedding vectors.

Mar 5, 2024

Towards open vocabulary learning: A survey
Towards open vocabulary learning: A survey

This survey offers a detailed examination of the latest developments in open vocabulary learning in computer vision, which appears to be a first of its kind. We provide an overview of the necessary background knowledge, which includes fundamental concepts and introductory knowledge of detection, segmentation, and vision language pre-training. Following that, we summarize more than 50 different models used for various scene understanding tasks. For each task, we categorize the methods based on their technical viewpoint. Additionally, we provide information regarding several closely related domains. In the experiment section, we provide a detailed description of the settings and compare results fairly. Finally, we summarize several challenges and also point out several future research directions for open vocabulary learning.

Feb 5, 2024

Uncertainty-aware MR-based CT synthesis for robust proton therapy planning of brain tumour
Uncertainty-aware MR-based CT synthesis for robust proton therapy planning of brain tumour

The enhanced framework incorporates 3D uncertainty prediction and generates high-quality sCTs from MR images. The framework also facilitates conditioned robust optimisation, bolstering proton plan robustness against network prediction errors. The innovative feature of uncertainty visualisation and robust analyses contribute to evaluating sCT clinical utility for individual patients.

Feb 1, 2024

Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting
Audio–visual keyword transformer for unconstrained sentence‐level keyword spotting

An Audio–Visual Keyword Transformer (AVKT) network for keyword spotting in unconstrained video clips, using transformer classifier with learnable CLS tokens and decision fusion to achieve high accuracy in both clean and noisy conditions.

Feb 1, 2024

FedLPA: One-shot Federated Learning with Layer-Wise Posterior Aggregation
FedLPA: One-shot Federated Learning with Layer-Wise Posterior Aggregation

A novel one-shot federated learning method that performs layer-wise posterior aggregation, achieving superior performance by aggregating posterior distributions instead of point estimates.

Jan 1, 2024

Our abstract is selected as a Proffered Paper by ESTRO 2024!
Our abstract is selected as a Proffered Paper by ESTRO 2024!

Beyond Voxel-Based Methods: Continuous Motion Modeling for Enhanced Deformable Image Registration

Dec 18, 2023

Skeleton-in-context: Unified skeleton sequence modeling with in-context learning
Skeleton-in-context: Unified skeleton sequence modeling with in-context learning

In this work, we propose the Skeleton-in-Context, designed to process multiple skeleton-base tasks simultaneously after just one training time. Specifically, we build a novel skeleton-based in-context benchmark covering various tasks. In particular, we propose skeleton prompts composed of TGP and TUP, which solve the overfitting problem of skeleton sequence data trained under the training framework commonly applied in previous 2D and 3D in-context models. Besides, we demonstrate that our model can generalize to different datasets and new tasks, such as motion completion. We hope our research builds the first step in the exploration of in-context learning for skeleton-based sequences, which paves the way for further research in this area.

Dec 15, 2023

Explore In-Context Learning for 3D Point Cloud Understanding
Explore In-Context Learning for 3D Point Cloud Understanding

We propose Point-In-Context (PIC), the first framework adopting the in-context learning paradigm for 3D point cloud understanding. Specifically, we set up an extensive dataset of point cloud pairs with four fundamental tasks to achieve in-context ability. We propose effective designs that facilitate the training and solve the inherited information leakage problem. PIC shows its excellent learning capacity, achieves comparable results with single-task models, and outperforms multitask models on all four tasks. Besides, it shows good generalization ability to out-of-distribution samples and unseen tasks and has great potential via selecting higher-quality prompts. We hope it paves the way for further exploration of in-context learning in the 3D modalities.

Dec 10, 2023

Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video
Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video

This paper proposes the Pose and Mesh Co-Evolution network (PMCE), a new two-stage pose-to-mesh framework for recovering 3D human mesh from a monocular video. PMCE frst estimates 3D human pose motion in terms of spatial and temporal domains, then performs image-guided pose and mesh interactions by our proposed AdaLN that injects body shape information while preserving their spatial structure. Extensive experiments on popular datasets show that PMCE outperforms state-of-the-art methods in both perframe accuracy and temporal consistency. We hope that our approach will spark further research in 3D human motion estimation considering both pose and shape consistency.

Oct 2, 2023

Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation
Betrayed by captions: Joint caption grounding and generation for open vocabulary instance segmentation

This paper presents a joint Caption Grounding and Generation (CGG) framework for instance-level open vocabulary segmentation. The main contributions are: (1) using fine-grained object nouns in captions to improve grounding with object queries. (2) using captions as supervision signals to extract rich information from other words helps identify novel categories. To our knowledge, this paper is the first to unify segmentation and caption generation for open vocabulary learning. The proposed framework significantly improves OVIS and OSPS and comparable results on OVOD without pre-training on large-scale datasets.

Oct 2, 2023

Interweaved Graph and Attention Network for 3D Human Pose Estimation
Interweaved Graph and Attention Network for 3D Human Pose Estimation

An Interweaved Graph and Attention Network (IGANet) for 3D human pose estimation that enables bidirectional communication between GCNs and attentions, capturing both global and local correlations in human skeleton representations.

Jun 4, 2023

Gator: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose
Gator: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose

A Graph-Aware Transformer (GATOR) framework for 3D human mesh recovery from 2D pose, combining Graph-Aware Transformer encoder and Motion-Disentangled Regression decoder to capture joint-joint, joint-vertex, and vertex-vertex relations.

Jun 4, 2023

ESTRO 2023 Mini-Oral
ESTRO 2023 Mini-Oral

Mini-oral presentation for uncertainty-conditioned MR-guided proton therapy

May 13, 2023

Our abstract is selected as a Mini-Oral presentation by ESTRO 2023!
Our abstract is selected as a Mini-Oral presentation by ESTRO 2023!

Uncertainty-aware MR-base CT synthesis for robust proton planning of skull-based tumour

Dec 20, 2022

Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models
Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models

In this paper, we decompose the feed-forward DNN and find a more reasonable basic unit layer, which shows a close relationship with the proximal operator. Based on it, we propose new equilibrium models, OptEqs, and explore their underlying optimization problems thoroughly. We provide two strategies to introduce customized regularizations to the equilibrium points, and achieve significant performance improvement in experiments. We highlight that by modifying the underlying optimization problems, we can create more effective network architectures. Our work may inspire more interpretable equilibrium models from the optimization perspective.

Jun 10, 2022

PCLoss: Fashion landmark estimation with position constraint loss
PCLoss: Fashion landmark estimation with position constraint loss

In this paper, we design a Position Constraint Loss (PCLoss) for fashion landmark estimation, which incorporates the position correlation into landmark estimation models. Specifically, the PCLoss adds a regular term for each landmark to regularize their relative positions. Compared with other alternatives, our PCLoss effectively mitigates the outliers and duplicate detection problems without modifying existing CNN architectures. In addition, our skeleton-like optimization method further strengthens the position constraints between landmarks. The proposed method can be applied to both regression and heatmap based methods and it provides a novel perspective towards position relation learning in key point estimation tasks. Extensive experimental results on three challenging datasets, DeepFashion, FLD and FashionAI, demonstrate that our method outperforms other state-of-the-art methods. The experiment on COCO 2017 shows the potential applications of PCLoss for other key point estimation tasks, which can be explored more in future work.

Oct 1, 2021

Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation
Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation

We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation.

Sep 12, 2021

Towards efficient scene understanding via squeeze reasoning
Towards efficient scene understanding via squeeze reasoning

Jul 30, 2021

Quasi-dense similarity learning for multiple object tracking
Quasi-dense similarity learning for multiple object tracking

We present Quasi-Dense Similarity Learning, which densely samples hundreds of region proposals on a pair of images for contrastive learning.

Feb 20, 2021

PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation
PointFlow: Flowing Semantics Through Points for Aerial Image Segmentation

Feb 13, 2021

Bi-directional Exponential Angular Triplet Loss for RGB-Infrared Person Re-Identification
Bi-directional Exponential Angular Triplet Loss for RGB-Infrared Person Re-Identification

We propose a novel ranking loss function, named Bi-directional Exponential Angular Triplet Loss, to help learn an angularly separable common feature space by explicitly constraining the included angles between embedding vectors.

Dec 12, 2020

Is Attention Better Than Matrix Decomposition?
Is Attention Better Than Matrix Decomposition?

Self-attention is not better than the matrix decomposition~(MD) model developed 20 years ago regarding the performance and computational cost for encoding the long-distance dependencies.

Sep 28, 2020

Improving Semantic Segmentation via Decoupled Body and Edge Supervision
Improving Semantic Segmentation via Decoupled Body and Edge Supervision

Jul 3, 2020

Position Constraint Loss For Fashion Landmark Estimation
Position Constraint Loss For Fashion Landmark Estimation

A Position Constraint Loss (PCLoss) method for fashion landmark estimation that constrains error landmark locations by utilizing position relationships, applicable to both regression and heatmap-based methods without modifying network structure.

May 4, 2020

Spatial Pyramid Based Graph Reasoning for Semantic Segmentation
Spatial Pyramid Based Graph Reasoning for Semantic Segmentation

Feb 24, 2020