Computer Vision and Pattern Recognition 101
★ Whole-Body Conditioned Egocentric Video Prediction
We train models to Predict Ego-centric Video from human Actions (PEVA), given
the past video and an action represented by the relative 3D body pose. By
conditioning on kinematic pose trajectories, structured by the joint hierarchy
of the body, our model learns to simulate how physical human actions shape the
environment from a first-person point of view. We train an auto-regressive
conditional diffusion transformer on Nymeria, a large-scale dataset of
real-world egocentric video and body pose capture. We further design a
hierarchical evaluation protocol with increasingly challenging tasks, enabling
a comprehensive analysis of the model's embodied prediction and control
abilities. Our work represents an initial attempt to tackle the challenges of
modeling complex real-world environments and embodied agent behaviors with
video prediction from the perspective of a human.
comment: Project Page: https://dannytran123.github.io/PEVA
☆ SiM3D: Single-instance Multiview Multimodal and Multisetup 3D Anomaly Detection Benchmark
Alex Costanzino, Pierluigi Zama Ramirez, Luigi Lella, Matteo Ragaglia, Alessandro Oliva, Giuseppe Lisanti, Luigi Di Stefano
We propose SiM3D, the first benchmark considering the integration of
multiview and multimodal information for comprehensive 3D anomaly detection and
segmentation (ADS), where the task is to produce a voxel-based Anomaly Volume.
Moreover, SiM3D focuses on a scenario of high interest in manufacturing:
single-instance anomaly detection, where only one object, either real or
synthetic, is available for training. In this respect, SiM3D stands out as the
first ADS benchmark that addresses the challenge of generalising from synthetic
training data to real test data. SiM3D includes a novel multimodal multiview
dataset acquired using top-tier industrial sensors and robots. The dataset
features multiview high-resolution images (12 Mpx) and point clouds (7M points)
for 333 instances of eight types of objects, alongside a CAD model for each
type. We also provide manually annotated 3D segmentation GTs for anomalous test
samples. To establish reference baselines for the proposed multiview 3D ADS
task, we adapt prominent singleview methods and assess their performance using
novel metrics that operate on Anomaly Volumes.
☆ SAM4D: Segment Anything in Camera and LiDAR Streams ICCV2025
We present SAM4D, a multi-modal and temporal foundation model designed for
promptable segmentation across camera and LiDAR streams. Unified Multi-modal
Positional Encoding (UMPE) is introduced to align camera and LiDAR features in
a shared 3D space, enabling seamless cross-modal prompting and interaction.
Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA),
which leverages ego-motion compensation to enhance temporal consistency and
long-horizon feature retrieval, ensuring robust segmentation across dynamically
changing autonomous driving scenes. To avoid annotation bottlenecks, we develop
a multi-modal automated data engine that synergizes VFM-driven video masklets,
spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This
framework generates camera-LiDAR aligned pseudo-labels at a speed orders of
magnitude faster than human annotation while preserving VFM-derived semantic
fidelity in point cloud representations. We conduct extensive experiments on
the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal
segmentation ability and great potential in data annotation of proposed SAM4D.
comment: Accepted by ICCV2025, Project Page: https://SAM4D-Project.github.io
☆ HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Recent progress in vision-language segmentation has significantly advanced
grounded visual understanding. However, these models often exhibit
hallucinations by producing segmentation masks for objects not grounded in the
image content or by incorrectly labeling irrelevant regions. Existing
evaluation protocols for segmentation hallucination primarily focus on label or
textual hallucinations without manipulating the visual context, limiting their
capacity to diagnose critical failures. In response, we introduce
HalluSegBench, the first benchmark specifically designed to evaluate
hallucinations in visual grounding through the lens of counterfactual visual
reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual
instance pairs spanning 281 unique object classes, and a set of newly
introduced metrics that quantify hallucination sensitivity under visually
coherent scene edits. Experiments on HalluSegBench with state-of-the-art
vision-language segmentation models reveal that vision-driven hallucinations
are significantly more prevalent than label-driven ones, with models often
persisting in false segmentation, highlighting the need for counterfactual
reasoning to diagnose grounding fidelity.
comment: Project webpage: https://plan-lab.github.io/hallusegbench/
☆ DeOcc-1-to-3: 3D De-Occlusion from a Single Image via Self-Supervised Multi-View Diffusion
Reconstructing 3D objects from a single image is a long-standing challenge,
especially under real-world occlusions. While recent diffusion-based view
synthesis models can generate consistent novel views from a single RGB image,
they generally assume fully visible inputs and fail when parts of the object
are occluded. This leads to inconsistent views and degraded 3D reconstruction
quality. To overcome this limitation, we propose an end-to-end framework for
occlusion-aware multi-view generation. Our method directly synthesizes six
structurally consistent novel views from a single partially occluded image,
enabling downstream 3D reconstruction without requiring prior inpainting or
manual annotations. We construct a self-supervised training pipeline using the
Pix2Gestalt dataset, leveraging occluded-unoccluded image pairs and
pseudo-ground-truth views to teach the model structure-aware completion and
view consistency. Without modifying the original architecture, we fully
fine-tune the view synthesis model to jointly learn completion and multi-view
generation. Additionally, we introduce the first benchmark for occlusion-aware
reconstruction, encompassing diverse occlusion levels, object categories, and
mask patterns. This benchmark provides a standardized protocol for evaluating
future methods under partial occlusions. Our code is available at
https://github.com/Quyans/DeOcc123.
☆ StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning ICCV 2025
Recently, Mamba-based methods have demonstrated impressive performance in
point cloud representation learning by leveraging State Space Model (SSM) with
the efficient context modeling ability and linear complexity. However, these
methods still face two key issues that limit the potential of SSM: Destroying
the adjacency of 3D points during SSM processing and failing to retain
long-sequence memory as the input length increases in downstream tasks. To
address these issues, we propose StruMamba3D, a novel paradigm for
self-supervised point cloud representation learning. It enjoys several merits.
First, we design spatial states and use them as proxies to preserve spatial
dependencies among points. Second, we enhance the SSM with a state-wise update
strategy and incorporate a lightweight convolution to facilitate interactions
between spatial states for efficient structure modeling. Third, our method
reduces the sensitivity of pre-trained Mamba-based models to varying input
lengths by introducing a sequence length-adaptive strategy. Experimental
results across four downstream tasks showcase the superior performance of our
method. In addition, our method attains the SOTA 95.1% accuracy on ModelNet40
and 92.75% accuracy on the most challenging split of ScanObjectNN without
voting strategy.
comment: Accepted by ICCV 2025
☆ Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval ACL 2025
Cross-modal image-text retrieval is challenging because of the diverse
possible associations between content from different modalities. Traditional
methods learn a single-vector embedding to represent semantics of each sample,
but struggle to capture nuanced and diverse relationships that can exist across
modalities. Set-based approaches, which represent each sample with multiple
embeddings, offer a promising alternative, as they can capture richer and more
diverse relationships. In this paper, we show that, despite their promise,
these set-based representations continue to face issues including sparse
supervision and set collapse, which limits their effectiveness. To address
these challenges, we propose Maximal Pair Assignment Similarity to optimize
one-to-one matching between embedding sets which preserve semantic diversity
within the set. We also introduce two loss functions to further enhance the
representations: Global Discriminative Loss to enhance distinction among
embeddings, and Intra-Set Divergence Loss to prevent collapse within each set.
Our method achieves state-of-the-art performance on MS-COCO and Flickr30k
without relying on external data.
comment: Accepted at the 63rd Annual Meeting of the Association for
Computational Linguistics (ACL 2025 Main)
☆ ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers ICCV
Research in quantum machine learning has recently proliferated due to the
potential of quantum computing to accelerate machine learning. An area of
machine learning that has not yet been explored is neural ordinary differential
equation (neural ODE) based residual neural networks (ResNets), which aim to
improve the effectiveness of neural networks using the principles of ordinary
differential equations. In this work, we present our insights about why analog
Rydberg atom quantum computers are especially well-suited for ResNets. We also
introduce ResQ, a novel framework to optimize the dynamics of Rydberg atom
quantum computers to solve classification problems in machine learning using
analog quantum neural ODEs.
comment: ResQ will appear in the Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 2025
☆ Exploring the Design Space of 3D MLLMs for CT Report Generation
Multimodal Large Language Models (MLLMs) have emerged as a promising way to
automate Radiology Report Generation (RRG). In this work, we systematically
investigate the design space of 3D MLLMs, including visual input
representation, projectors, Large Language Models (LLMs), and fine-tuning
techniques for 3D CT report generation. We also introduce two knowledge-based
report augmentation methods that improve performance on the GREEN score by up
to 10\%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our
results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely
independent of the size of LLM under the same training protocol. We also show
that larger volume size does not always improve performance if the original ViT
was pre-trained on a smaller volume size. Lastly, we show that using a
segmentation mask along with the CT volume improves performance. The code is
publicly available at https://github.com/bowang-lab/AMOS-MM-Solution
☆ WAFT: Warping-Alone Field Transforms for Optical Flow
We introduce Warping-Alone Field Transforms (WAFT), a simple and effective
method for optical flow. WAFT is similar to RAFT but replaces cost volume with
high-resolution warping, achieving better accuracy with lower memory cost. This
design challenges the conventional wisdom that constructing cost volumes is
necessary for strong performance. WAFT is a simple and flexible
meta-architecture with minimal inductive biases and reliance on custom designs.
Compared with existing methods, WAFT ranks 1st on Spring and KITTI benchmarks,
achieves the best zero-shot generalization on KITTI, while being up to 4.1x
faster than methods with similar performance. Code and model weights are
available at https://github.com/princeton-vl/WAFT.
☆ MADrive: Memory-Augmented Driving Scene Modeling
Polina Karpikova, Daniil Selikhanovych, Kirill Struminsky, Ruslan Musaev, Maria Golitsyna, Dmitry Baranchuk
Recent advances in scene reconstruction have pushed toward highly realistic
modeling of autonomous driving (AD) environments using 3D Gaussian splatting.
However, the resulting reconstructions remain closely tied to the original
observations and struggle to support photorealistic synthesis of significantly
altered or novel driving scenarios. This work introduces MADrive, a
memory-augmented reconstruction framework designed to extend the capabilities
of existing scene reconstruction methods by replacing observed vehicles with
visually similar 3D assets retrieved from a large-scale external memory bank.
Specifically, we release MAD-Cars, a curated dataset of ${\sim}70$K 360{\deg}
car videos captured in the wild and present a retrieval module that finds the
most similar car instances in the memory bank, reconstructs the corresponding
3D assets from video, and integrates them into the target scene through
orientation alignment and relighting. The resulting replacements provide
complete multi-view representations of vehicles in the scene, enabling
photorealistic synthesis of substantially altered configurations, as
demonstrated in our experiments. Project page:
https://yandex-research.github.io/madrive/
☆ G$^{2}$D: Boosting Multimodal Learning with Gradient-Guided Distillation ICCV 2025
Multimodal learning aims to leverage information from diverse data modalities
to achieve more comprehensive performance. However, conventional multimodal
models often suffer from modality imbalance, where one or a few modalities
dominate model optimization, leading to suboptimal feature representation and
underutilization of weak modalities. To address this challenge, we introduce
Gradient-Guided Distillation (G$^{2}$D), a knowledge distillation framework
that optimizes the multimodal model with a custom-built loss function that
fuses both unimodal and multimodal objectives. G$^{2}$D further incorporates a
dynamic sequential modality prioritization (SMP) technique in the learning
process to ensure each modality leads the learning process, avoiding the
pitfall of stronger modalities overshadowing weaker ones. We validate G$^{2}$D
on multiple real-world datasets and show that G$^{2}$D amplifies the
significance of weak modalities while training and outperforms state-of-the-art
methods in classification and regression tasks. Our code is available at
https://github.com/rAIson-Lab/G2D.
comment: Accepted at ICCV 2025
☆ GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation ICCV 2025
Wentao Hu, Shunkai Li, Ziqiao Peng, Haoxian Zhang, Fan Shi, Xiaoqiang Liu, Pengfei Wan, Di Zhang, Hui Tian
Creating high-quality, generalizable speech-driven 3D talking heads remains a
persistent challenge. Previous methods achieve satisfactory results for fixed
viewpoints and small-scale audio variations, but they struggle with large head
rotations and out-of-distribution (OOD) audio. Moreover, they are constrained
by the need for time-consuming, identity-specific training. We believe the core
issue lies in the lack of sufficient 3D priors, which limits the extrapolation
capabilities of synthesized talking heads. To address this, we propose
GGTalker, which synthesizes talking heads through a combination of
generalizable priors and identity-specific adaptation. We introduce a two-stage
Prior-Adaptation training strategy to learn Gaussian head priors and adapt to
individual characteristics. We train Audio-Expression and Expression-Visual
priors to capture the universal patterns of lip movements and the general
distribution of head textures. During the Customized Adaptation, individual
speaking styles and texture details are precisely modeled. Additionally, we
introduce a color MLP to generate fine-grained, motion-aligned textures and a
Body Inpainter to blend rendered results with the background, producing
indistinguishable, photorealistic video frames. Comprehensive experiments show
that GGTalker achieves state-of-the-art performance in rendering quality, 3D
consistency, lip-sync accuracy, and training efficiency.
comment: ICCV 2025, Project page: https://vincenthu19.github.io/GGTalker/
☆ Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration
Jiahe Chen, Jiaying He, Qian Shao, Qiyuan Chen, Jiahe Ying, Hongxia Xu, Jintai Chen, Jianwei Zheng, Jian Wu
Large Vision-Language Models (LVLMs) have demonstrated significant
advancements in multimodal understanding, yet they are frequently hampered by
hallucination-the generation of text that contradicts visual input. Existing
training-free decoding strategies exhibit critical limitations, including the
use of static constraints that do not adapt to semantic drift during
generation, inefficiency stemming from the need for multiple forward passes,
and degradation of detail due to overly rigid intervention rules. To overcome
these challenges, this paper introduces Dynamic Logits Calibration (DLC), a
novel training-free decoding framework designed to dynamically align text
generation with visual evidence at inference time. At the decoding phase, DLC
step-wise employs CLIP to assess the semantic alignment between the input image
and the generated text sequence. Then, the Relative Visual Advantage (RVA) of
candidate tokens is evaluated against a dynamically updated contextual
baseline, adaptively adjusting output logits to favor tokens that are visually
grounded. Furthermore, an adaptive weighting mechanism, informed by a real-time
context alignment score, carefully balances the visual guidance while ensuring
the overall quality of the textual output. Extensive experiments conducted
across diverse benchmarks and various LVLM architectures (such as LLaVA,
InstructBLIP, and MiniGPT-4) demonstrate that DLC significantly reduces
hallucinations, outperforming current methods while maintaining high inference
efficiency by avoiding multiple forward passes. Overall, we present an
effective and efficient decoding-time solution to mitigate hallucinations,
thereby enhancing the reliability of LVLMs for more practices. Code will be
released on Github.
☆ Lightweight Physics-Informed Zero-Shot Ultrasound Plane Wave Denoising
Ultrasound Coherent Plane Wave Compounding (CPWC) enhances image contrast by
combining echoes from multiple steered transmissions. While increasing the
number of angles generally improves image quality, it drastically reduces the
frame rate and can introduce blurring artifacts in fast-moving targets.
Moreover, compounded images remain susceptible to noise, particularly when
acquired with a limited number of transmissions. We propose a zero-shot
denoising framework tailored for low-angle CPWC acquisitions, which enhances
contrast without relying on a separate training dataset. The method divides the
available transmission angles into two disjoint subsets, each used to form
compound images that include higher noise levels. The new compounded images are
then used to train a deep model via a self-supervised residual learning scheme,
enabling it to suppress incoherent noise while preserving anatomical
structures. Because angle-dependent artifacts vary between the subsets while
the underlying tissue response is similar, this physics-informed pairing allows
the network to learn to disentangle the inconsistent artifacts from the
consistent tissue signal. Unlike supervised methods, our model requires no
domain-specific fine-tuning or paired data, making it adaptable across
anatomical regions and acquisition setups. The entire pipeline supports
efficient training with low computational cost due to the use of a lightweight
architecture, which comprises only two convolutional layers. Evaluations on
simulation, phantom, and in vivo data demonstrate superior contrast enhancement
and structure preservation compared to both classical and deep learning-based
denoising methods.
☆ Towards Reliable Detection of Empty Space: Conditional Marked Point Processes for Object Detection
Deep neural networks have set the state-of-the-art in computer vision tasks
such as bounding box detection and semantic segmentation. Object detectors and
segmentation models assign confidence scores to predictions, reflecting the
model's uncertainty in object detection or pixel-wise classification. However,
these confidence estimates are often miscalibrated, as their architectures and
loss functions are tailored to task performance rather than probabilistic
foundation. Even with well calibrated predictions, object detectors fail to
quantify uncertainty outside detected bounding boxes, i.e., the model does not
make a probability assessment of whether an area without detected objects is
truly free of obstacles. This poses a safety risk in applications such as
automated driving, where uncertainty in empty areas remains unexplored. In this
work, we propose an object detection model grounded in spatial statistics.
Bounding box data matches realizations of a marked point process, commonly used
to describe the probabilistic occurrence of spatial point events identified as
bounding box centers, where marks are used to describe the spatial extension of
bounding boxes and classes. Our statistical framework enables a
likelihood-based training and provides well-defined confidence estimates for
whether a region is drivable, i.e., free of objects. We demonstrate the
effectiveness of our method through calibration assessments and evaluation of
performance.
comment: 15 pages, 4 figures, 3 tables
☆ TITAN: Query-Token based Domain Adaptive Adversarial Learning ICCV 2025
We focus on the source-free domain adaptive object detection (SF-DAOD)
problem when source data is unavailable during adaptation and the model must
adapt to an unlabeled target domain. The majority of approaches for the problem
employ a self-supervised approach using a student-teacher (ST) framework where
pseudo-labels are generated via a source-pretrained model for further
fine-tuning. We observe that the performance of a student model often degrades
drastically, due to the collapse of the teacher model, primarily caused by high
noise in pseudo-labels, resulting from domain bias, discrepancies, and a
significant domain shift across domains. To obtain reliable pseudo-labels, we
propose a Target-based Iterative Query-Token Adversarial Network (TITAN), which
separates the target images into two subsets: those similar to the source
(easy) and those dissimilar (hard). We propose a strategy to estimate variance
to partition the target domain. This approach leverages the insight that higher
detection variances correspond to higher recall and greater similarity to the
source domain. Also, we incorporate query-token-based adversarial modules into
a student-teacher baseline framework to reduce the domain gaps between two
feature representations. Experiments conducted on four natural imaging datasets
and two challenging medical datasets have substantiated the superior
performance of TITAN compared to existing state-of-the-art (SOTA)
methodologies. We report an mAP improvement of +22.7, +22.2, +21.1, and +3.7
percent over the current SOTA on C2F, C2B, S2C, and K2C benchmarks,
respectively.
comment: ICCV 2025
☆ Global and Local Entailment Learning for Natural World Imagery ICCV 2025
Learning the hierarchical structure of data in vision-language models is a
significant challenge. Previous works have attempted to address this challenge
by employing entailment learning. However, these approaches fail to model the
transitive nature of entailment explicitly, which establishes the relationship
between order and semantics within a representation space. In this work, we
introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the
explicit modeling of transitivity-enforced entailment. Our proposed framework
optimizes for the partial order of concepts within vision-language models. By
leveraging our framework, we develop a hierarchical vision-language foundation
model capable of representing the hierarchy in the Tree of Life. Our
experiments on hierarchical species classification and hierarchical retrieval
tasks demonstrate the enhanced performance of our models compared to the
existing state-of-the-art models. Our code and models are open-sourced at
https://vishu26.github.io/RCME/index.html.
comment: Accepted at ICCV 2025
☆ Logios : An open source Greek Polytonic Optical Character Recognition system
In this paper, we present an Optical Character Recognition (OCR) system
specifically designed for the accurate recognition and digitization of Greek
polytonic texts. By leveraging the combined strengths of convolutional layers
for feature extraction and recurrent layers for sequence learning, our system
addresses the unique challenges posed by Greek polytonic scripts. This approach
aims to overcome the limitations of traditional OCR methods, offering
significant improvements in accuracy and efficiency. We release the underlying
model as an open-source library and make our OCR platform available for
academic use.
☆ Evaluation of Traffic Signals for Daily Traffic Pattern
The turning movement count data is crucial for traffic signal design,
intersection geometry planning, traffic flow, and congestion analysis. This
work proposes three methods called dynamic, static, and hybrid configuration
for TMC-based traffic signals. A vision-based tracking system is developed to
estimate the TMC of six intersections in Las Vegas using traffic cameras. The
intersection design, route (e.g. vehicle movement directions), and signal
configuration files with compatible formats are synthesized and imported into
Simulation of Urban MObility for signal evaluation with realistic data. The
initial experimental results based on estimated waiting times indicate that the
cycle time of 90 and 120 seconds works best for all intersections. In addition,
four intersections show better performance for dynamic signal timing
configuration, and the other two with lower performance have a lower ratio of
total vehicle count to total lanes of the intersection leg. Since daily traffic
flow often exhibits a bimodal pattern, we propose a hybrid signal method that
switches between dynamic and static methods, adapting to peak and off-peak
traffic conditions for improved flow management. So, a built-in traffic
generator module creates vehicle routes for 4 hours, including peak hours, and
a signal design module produces signal schedule cycles according to static,
dynamic, and hybrid methods. Vehicle count distributions are weighted
differently for each zone (i.e., West, North, East, South) to generate diverse
traffic patterns. The extended experimental results for 6 intersections with 4
hours of simulation time imply that zone-based traffic pattern distributions
affect signal design selection. Although the static method works great for
evenly zone-based traffic distribution, the hybrid method works well for highly
weighted traffic at intersection pairs of the West-East and North-South zones.
☆ Spatial Mental Modeling from Limited Views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
Can Vision Language Models (VLMs) imagine the full scene from just a few
views, like humans do? Humans form spatial mental models, internal
representations of unseen space, to reason about layout, perspective, and
motion. Our new MindCube benchmark with 21,154 questions across 3,268 images
exposes this critical gap, where existing VLMs exhibit near-random performance.
Using MindCube, we systematically evaluate how well VLMs build robust spatial
mental models through representing positions (cognitive mapping), orientations
(perspective-taking), and dynamics (mental simulation for "what-if" movements).
We then explore three approaches to help VLMs approximate spatial mental
models, including unseen intermediate views, natural language reasoning chains,
and cognitive maps. The significant improvement comes from a synergistic
approach, "map-then-reason", that jointly trains the model to first generate a
cognitive map and then reason upon it. By training models to reason over these
internal maps, we boosted accuracy from 37.8% to 60.8% (+23.0%). Adding
reinforcement learning pushed performance even further to 70.7% (+32.9%). Our
key insight is that such scaffolding of spatial mental models, actively
constructing and utilizing internal structured spatial representations with
flexible reasoning processes, significantly improves understanding of
unobservable space.
comment: Preprint version
☆ Rethinking Oversaturation in Classifier-Free Guidance via Low Frequency
Classifier-free guidance (CFG) succeeds in condition diffusion models that
use a guidance scale to balance the influence of conditional and unconditional
terms. A high guidance scale is used to enhance the performance of the
conditional term. However, the high guidance scale often results in
oversaturation and unrealistic artifacts. In this paper, we introduce a new
perspective based on low-frequency signals, identifying the accumulation of
redundant information in these signals as the key factor behind oversaturation
and unrealistic artifacts. Building on this insight, we propose low-frequency
improved classifier-free guidance (LF-CFG) to mitigate these issues.
Specifically, we introduce an adaptive threshold-based measurement to pinpoint
the locations of redundant information. We determine a reasonable threshold by
analyzing the change rate of low-frequency information between prior and
current steps. We then apply a down-weight strategy to reduce the impact of
redundant information in the low-frequency signals. Experimental results
demonstrate that LF-CFG effectively alleviates oversaturation and unrealistic
artifacts across various diffusion models, including Stable Diffusion-XL,
Stable Diffusion 2.1, 3.0, 3.5, and SiT-XL.
☆ A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario
Underground mining operations face significant safety challenges that make
emergency response capabilities crucial. While robots have shown promise in
assisting with search and rescue operations, their effectiveness depends on
reliable miner detection capabilities. Deep learning algorithms offer potential
solutions for automated miner detection, but require comprehensive training
datasets, which are currently lacking for underground mining environments. This
paper presents a novel thermal imaging dataset specifically designed to enable
the development and validation of miner detection systems for potential
emergency applications. We systematically captured thermal imagery of various
mining activities and scenarios to create a robust foundation for detection
algorithms. To establish baseline performance metrics, we evaluated several
state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11,
and RT-DETR on our dataset. While not exhaustive of all possible emergency
situations, this dataset serves as a crucial first step toward developing
reliable thermal-based miner detection systems that could eventually be
deployed in real emergency scenarios. This work demonstrates the feasibility of
using thermal imaging for miner detection and establishes a foundation for
future research in this critical safety application.
☆ ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
While end-to-end video-to-audio generation has greatly improved, producing
high-fidelity audio that authentically captures the nuances of visual content
remains challenging. Like professionals in the creative industries, such
generation requires sophisticated reasoning about items such as visual
dynamics, acoustic environments, and temporal relationships. We present
\textbf{ThinkSound}, a novel framework that leverages Chain-of-Thought (CoT)
reasoning to enable stepwise, interactive audio generation and editing for
videos. Our approach decomposes the process into three complementary stages:
foundational foley generation that creates semantically coherent soundscapes,
interactive object-centric refinement through precise user interactions, and
targeted editing guided by natural language instructions. At each stage, a
multimodal large language model generates contextually aligned CoT reasoning
that guides a unified audio foundation model. Furthermore, we introduce
\textbf{AudioCoT}, a comprehensive dataset with structured reasoning
annotations that establishes connections between visual content, textual
descriptions, and sound synthesis. Experiments demonstrate that ThinkSound
achieves state-of-the-art performance in video-to-audio generation across both
audio metrics and CoT metrics and excels in out-of-distribution Movie Gen Audio
benchmark. The demo page is available at https://ThinkSound-Demo.github.io.
☆ Controllable 3D Placement of Objects with Scene-Aware Diffusion Models
Image editing approaches have become more powerful and flexible with the
advent of powerful text-conditioned generative models. However, placing objects
in an environment with a precise location and orientation still remains a
challenge, as this typically requires carefully crafted inpainting masks or
prompts. In this work, we show that a carefully designed visual map, combined
with coarse object masks, is sufficient for high quality object placement. We
design a conditioning signal that resolves ambiguities, while being flexible
enough to allow for changing of shapes or object orientations. By building on
an inpainting model, we leave the background intact by design, in contrast to
methods that model objects and background jointly. We demonstrate the
effectiveness of our method in the automotive setting, where we compare
different conditioning signals in novel object placement tasks. These tasks are
designed to measure edit quality not only in terms of appearance, but also in
terms of pose and location accuracy, including cases that require non-trivial
shape changes. Lastly, we show that fine location control can be combined with
appearance control to place existing objects in precise locations in a scene.
☆ Benchmarking Deep Learning and Vision Foundation Models for Atypical vs. Normal Mitosis Classification with Cross-Dataset Evaluation
Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Rutger A. Fick, Thomas Conrad, Jonas Ammeling, Nils Porsche, Robert Klopfleisch, Christopher Kaltenecker, Katharina Breininger, Marc Aubreville, Christof A. Bertram
Atypical mitoses mark a deviation in the cell division process that can be an
independent prognostically relevant marker for tumor malignancy. However, their
identification remains challenging due to low prevalence, at times subtle
morphological differences from normal mitoses, low inter-rater agreement among
pathologists, and class imbalance in datasets. Building on the Atypical Mitosis
dataset for Breast Cancer (AMi-Br), this study presents a comprehensive
benchmark comparing deep learning approaches for automated atypical mitotic
figure (AMF) classification, including baseline models, foundation models with
linear probing, and foundation models fine-tuned with low-rank adaptation
(LoRA). For rigorous evaluation, we further introduce two new hold-out AMF
datasets - AtNorM-Br, a dataset of mitoses from the The TCGA breast cancer
cohort, and AtNorM-MD, a multi-domain dataset of mitoses from the MIDOG++
training set. We found average balanced accuracy values of up to 0.8135,
0.7696, and 0.7705 on the in-domain AMi-Br and the out-of-domain AtNorm-Br and
AtNorM-MD datasets, respectively, with the results being particularly good for
LoRA-based adaptation of the Virchow-line of foundation models. Our work shows
that atypical mitosis classification, while being a challenging problem, can be
effectively addressed through the use of recent advances in transfer learning
and model fine-tuning techniques. We make available all code and data used in
this paper in this github repository:
https://github.com/DeepMicroscopy/AMi-Br_Benchmark.
☆ HyperSORT: Self-Organising Robust Training with hyper-networks MICCAI 2025
Medical imaging datasets often contain heterogeneous biases ranging from
erroneous labels to inconsistent labeling styles. Such biases can negatively
impact deep segmentation networks performance. Yet, the identification and
characterization of such biases is a particularly tedious and challenging task.
In this paper, we introduce HyperSORT, a framework using a hyper-network
predicting UNets' parameters from latent vectors representing both the image
and annotation variability. The hyper-network parameters and the latent vector
collection corresponding to each data sample from the training set are jointly
learned. Hence, instead of optimizing a single neural network to fit a dataset,
HyperSORT learns a complex distribution of UNet parameters where low density
areas can capture noise-specific patterns while larger modes robustly segment
organs in differentiated but meaningful manners. We validate our method on two
3D abdominal CT public datasets: first a synthetically perturbed version of the
AMOS dataset, and TotalSegmentator, a large scale dataset containing real
unknown biases and errors. Our experiments show that HyperSORT creates a
structured mapping of the dataset allowing the identification of relevant
systematic biases and erroneous samples. Latent space clusters yield UNet
parameters performing the segmentation task in accordance with the underlying
learned systematic bias. The code and our analysis of the TotalSegmentator
dataset are made available: https://github.com/ImFusionGmbH/HyperSORT
comment: Accepted at MICCAI 2025
☆ EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting
Efficient three-dimensional reconstruction and real-time visualization are
critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian
Splatting (3DGS) has demonstrated remarkable performance in efficient 3D
reconstruction and rendering. Most 3DGS-based Simultaneous Localization and
Mapping (SLAM) methods only rely on the appearance constraints for optimizing
both 3DGS and camera poses. However, in endoscopic scenarios, the challenges
include photometric inconsistencies caused by non-Lambertian surfaces and
dynamic motion from breathing affects the performance of SLAM systems. To
address these issues, we additionally introduce optical flow loss as a
geometric constraint, which effectively constrains both the 3D structure of the
scene and the camera motion. Furthermore, we propose a depth regularisation
strategy to mitigate the problem of photometric inconsistencies and ensure the
validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve
scene representation in the SLAM system, we improve the 3DGS refinement
strategy by focusing on viewpoints corresponding to Keyframes with suboptimal
rendering quality frames, achieving better rendering results. Extensive
experiments on the C3VD static dataset and the StereoMIS dynamic dataset
demonstrate that our method outperforms existing state-of-the-art methods in
novel view synthesis and pose estimation, exhibiting high performance in both
static and dynamic surgical scenes. The source code will be publicly available
upon paper acceptance.
☆ XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation
Achieving fine-grained control over subject identity and semantic attributes
(pose, style, lighting) in text-to-image generation, particularly for multiple
subjects, often undermines the editability and coherence of Diffusion
Transformers (DiTs). Many approaches introduce artifacts or suffer from
attribute entanglement. To overcome these challenges, we propose a novel
multi-subject controlled generation model XVerse. By transforming reference
images into offsets for token-specific text-stream modulation, XVerse allows
for precise and independent control for specific subject without disrupting
image latents or features. Consequently, XVerse offers high-fidelity, editable
multi-subject image synthesis with robust control over individual subject
characteristics and semantic attributes. This advancement significantly
improves personalized and complex scene generation capabilities.
comment: Project Page: https://bytedance.github.io/XVerse Github Link:
https://github.com/bytedance/XVerse
☆ Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction ICCV 2025
This paper presents an end-to-end framework for reconstructing 3D parametric
curves directly from multi-view edge maps. Contrasting with existing two-stage
methods that follow a sequential ``edge point cloud reconstruction and
parametric curve fitting'' pipeline, our one-stage approach optimizes 3D
parametric curves directly from 2D edge maps, eliminating error accumulation
caused by the inherent optimization gap between disconnected stages. However,
parametric curves inherently lack suitability for rendering-based multi-view
optimization, necessitating a complementary representation that preserves their
geometric properties while enabling differentiable rendering. We propose a
novel bi-directional coupling mechanism between parametric curves and
edge-oriented Gaussian components. This tight correspondence formulates a
curve-aware Gaussian representation, \textbf{CurveGaussian}, that enables
differentiable rendering of 3D curves, allowing direct optimization guided by
multi-view evidence. Furthermore, we introduce a dynamically adaptive topology
optimization framework during training to refine curve structures through
linearization, merging, splitting, and pruning operations. Comprehensive
evaluations on the ABC dataset and real-world benchmarks demonstrate our
one-stage method's superiority over two-stage alternatives, particularly in
producing cleaner and more robust reconstructions. Additionally, by directly
optimizing parametric curves, our method significantly reduces the parameter
count during training, achieving both higher efficiency and superior
performance compared to existing approaches.
comment: Code: https://github.com/zhirui-gao/Curve-Gaussian Accepted by ICCV
2025
☆ FastRef:Fast Prototype Refinement for Few-Shot Industrial Anomaly Detection
Few-shot industrial anomaly detection (FS-IAD) presents a critical challenge
for practical automated inspection systems operating in data-scarce
environments. While existing approaches predominantly focus on deriving
prototypes from limited normal samples, they typically neglect to
systematically incorporate query image statistics to enhance prototype
representativeness. To address this issue, we propose FastRef, a novel and
efficient prototype refinement framework for FS-IAD. Our method operates
through an iterative two-stage process: (1) characteristic transfer from query
features to prototypes via an optimizable transformation matrix, and (2)
anomaly suppression through prototype alignment. The characteristic transfer is
achieved through linear reconstruction of query features from prototypes, while
the anomaly suppression addresses a key observation in FS-IAD that unlike
conventional IAD with abundant normal prototypes, the limited-sample setting
makes anomaly reconstruction more probable. Therefore, we employ optimal
transport (OT) for non-Gaussian sampled features to measure and minimize the
gap between prototypes and their refined counterparts for anomaly suppression.
For comprehensive evaluation, we integrate FastRef with three competitive
prototype-based FS-IAD methods: PatchCore, FastRecon, WinCLIP, and AnomalyDINO.
Extensive experiments across four benchmark datasets of MVTec, ViSA, MPDD and
RealIAD demonstrate both the effectiveness and computational efficiency of our
approach under 1/2/4-shots.
comment: 18pages, 7figures, 6tables
☆ GenFlow: Interactive Modular System for Image Generation
Generative art unlocks boundless creative possibilities, yet its full
potential remains untapped due to the technical expertise required for advanced
architectural concepts and computational workflows. To bridge this gap, we
present GenFlow, a novel modular framework that empowers users of all skill
levels to generate images with precision and ease. Featuring a node-based
editor for seamless customization and an intelligent assistant powered by
natural language processing, GenFlow transforms the complexity of workflow
creation into an intuitive and accessible experience. By automating deployment
processes and minimizing technical barriers, our framework makes cutting-edge
generative art tools available to everyone. A user study demonstrated GenFlow's
ability to optimize workflows, reduce task completion times, and enhance user
understanding through its intuitive interface and adaptive features. These
results position GenFlow as a groundbreaking solution that redefines
accessibility and efficiency in the realm of generative art.
☆ CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection ICCV 2025
Zhixin Cheng, Jiacheng Deng, Xinjun Li, Xiaotian Yin, Bohao Liao, Baoqun Yin, Wenfei Yang, Tianzhu Zhang
Detection-free methods typically follow a coarse-to-fine pipeline, extracting
image and point cloud features for patch-level matching and refining dense
pixel-to-point correspondences. However, differences in feature channel
attention between images and point clouds may lead to degraded matching
results, ultimately impairing registration accuracy. Furthermore, similar
structures in the scene could lead to redundant correspondences in cross-modal
matching. To address these issues, we propose Channel Adaptive Adjustment
Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances
intra-modal features and suppresses cross-modal sensitivity, while GOS replaces
local selection with global optimization. Experiments on RGB-D Scenes V2 and
7-Scenes demonstrate the superiority of our method, achieving state-of-the-art
performance in image-to-point cloud registration.
comment: ICCV 2025 accepted
☆ ToosiCubix: Monocular 3D Cuboid Labeling via Vehicle Part Annotations
Many existing methods for 3D cuboid annotation of vehicles rely on expensive
and carefully calibrated camera-LiDAR or stereo setups, limiting their
accessibility for large-scale data collection. We introduce ToosiCubix, a
simple yet powerful approach for annotating ground-truth cuboids using only
monocular images and intrinsic camera parameters. Our method requires only
about 10 user clicks per vehicle, making it highly practical for adding 3D
annotations to existing datasets originally collected without specialized
equipment. By annotating specific features (e.g., wheels, car badge,
symmetries) across different vehicle parts, we accurately estimate each
vehicle's position, orientation, and dimensions up to a scale ambiguity (8
DoF). The geometric constraints are formulated as an optimization problem,
which we solve using a coordinate descent strategy, alternating between
Perspective-n-Points (PnP) and least-squares subproblems. To handle common
ambiguities such as scale and unobserved dimensions, we incorporate
probabilistic size priors, enabling 9 DoF cuboid placements. We validate our
annotations against the KITTI and Cityscapes3D datasets, demonstrating that our
method offers a cost-effective and scalable solution for high-quality 3D cuboid
annotation.
☆ CoPa-SG: Dense Scene Graphs with Parametric and Proto-Relations
Julian Lorenz, Mrunmai Phatak, Robin Schön, Katja Ludwig, Nico Hörmann, Annemarie Friedrich, Rainer Lienhart
2D scene graphs provide a structural and explainable framework for scene
understanding. However, current work still struggles with the lack of accurate
scene graph data. To overcome this data bottleneck, we present CoPa-SG, a
synthetic scene graph dataset with highly precise ground truth and exhaustive
relation annotations between all objects. Moreover, we introduce parametric and
proto-relations, two new fundamental concepts for scene graphs. The former
provides a much more fine-grained representation than its traditional
counterpart by enriching relations with additional parameters such as angles or
distances. The latter encodes hypothetical relations in a scene graph and
describes how relations would form if new objects are placed in the scene.
Using CoPa-SG, we compare the performance of various scene graph generation
models. We demonstrate how our new relation types can be integrated in
downstream applications to enhance planning and reasoning capabilities.
☆ ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
Cinematography, the fundamental visual language of film, is essential for
conveying narrative, emotion, and aesthetic quality. While recent
Vision-Language Models (VLMs) demonstrate strong general visual understanding,
their proficiency in comprehending the nuanced cinematic grammar embedded
within individual shots remains largely unexplored and lacks robust evaluation.
This critical gap limits both fine-grained visual comprehension and the
precision of AI-assisted video generation. To address this, we introduce
\textbf{ShotBench}, a comprehensive benchmark specifically designed for
cinematic language understanding. It features over 3.5k expert-annotated QA
pairs from images and video clips, meticulously curated from over 200 acclaimed
(predominantly Oscar-nominated) films and spanning eight key cinematography
dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their
substantial limitations: even the top-performing model achieves less than 60\%
average accuracy, particularly struggling with fine-grained visual cues and
complex spatial reasoning. To catalyze advancement in this domain, we construct
\textbf{ShotQA}, a large-scale multimodal dataset comprising approximately 70k
cinematic QA pairs. Leveraging ShotQA, we develop \textbf{ShotVL} through
supervised fine-tuning and Group Relative Policy Optimization. ShotVL
significantly outperforms all existing open-source and proprietary models on
ShotBench, establishing new \textbf{state-of-the-art} performance. We
open-source our models, data, and code to foster rapid progress in this crucial
area of AI-driven cinematic understanding and generation.
☆ Generalizable Neural Electromagnetic Inverse Scattering
Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in
applications such as medical imaging, where the goal is to reconstruct the
relative permittivity from scattered electromagnetic field. This inverse
process is inherently ill-posed and highly nonlinear, making it particularly
challenging. A recent machine learning-based approach, Img-Interiors, shows
promising results by leveraging continuous implicit functions. However, it
requires case-specific optimization, lacks generalization to unseen data, and
fails under sparse transmitter setups (e.g., with only one transmitter). To
address these limitations, we revisit EISP from a physics-informed perspective,
reformulating it as a two stage inverse transmission-scattering process. This
formulation reveals the induced current as a generalizable intermediate
representation, effectively decoupling the nonlinear scattering process from
the ill-posed inverse problem. Built on this insight, we propose the first
generalizable physics-driven framework for EISP, comprising a current estimator
and a permittivity solver, working in an end-to-end manner. The current
estimator explicitly learns the induced current as a physical bridge between
the incident and scattered field, while the permittivity solver computes the
relative permittivity directly from the estimated induced current. This design
enables data-driven training and generalizable feed-forward prediction of
relative permittivity on unseen data while maintaining strong robustness to
transmitter sparsity. Extensive experiments show that our method outperforms
state-of-the-art approaches in reconstruction accuracy, generalization, and
robustness. This work offers a fundamentally new perspective on electromagnetic
inverse scattering and represents a major step toward cost-effective practical
solutions for electromagnetic imaging.
☆ PanSt3R: Multi-view Consistent Panoptic Segmentation ICCV 2025
Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, Gabriela Csurka
Panoptic segmentation of 3D scenes, involving the segmentation and
classification of object instances in a dense 3D reconstruction of a scene, is
a challenging problem, especially when relying solely on unposed 2D images.
Existing approaches typically leverage off-the-shelf models to extract
per-frame 2D panoptic segmentations, before optimizing an implicit geometric
representation (often based on NeRF) to integrate and fuse the 2D predictions.
We argue that relying on 2D panoptic segmentation for a problem inherently 3D
and multi-view is likely suboptimal as it fails to leverage the full potential
of spatial relationships across views. In addition to requiring camera
parameters, these approaches also necessitate computationally expensive
test-time optimization for each scene. Instead, in this work, we propose a
unified and integrated approach PanSt3R, which eliminates the need for
test-time optimization by jointly predicting 3D geometry and multi-view
panoptic segmentation in a single forward pass. Our approach builds upon recent
advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view
version of DUSt3R, and enhances it with semantic awareness and multi-view
panoptic segmentation capabilities. We additionally revisit the standard
post-processing mask merging procedure and introduce a more principled approach
for multi-view segmentation. We also introduce a simple method for generating
novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS.
Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable,
and achieves state-of-the-art performance on several benchmarks, while being
orders of magnitude faster than existing methods.
comment: Accepted at ICCV 2025
☆ Automatic Reviewers Assignment to a Research Paper Based on Allied References and Publications Weight
Everyday, a vast stream of research documents is submitted to conferences,
anthologies, journals, newsletters, annual reports, daily papers, and various
periodicals. Many such publications use independent external specialists to
review submissions. This process is called peer review, and the reviewers are
called referees. However, it is not always possible to pick the best referee
for reviewing. Moreover, new research fields are emerging in every sector, and
the number of research papers is increasing dramatically. To review all these
papers, every journal assigns a small team of referees who may not be experts
in all areas. For example, a research paper in communication technology should
be reviewed by an expert from the same field. Thus, efficiently selecting the
best reviewer or referee for a research paper is a big challenge.
In this research, we propose and implement program that uses a new strategy
to automatically select the best reviewers for a research paper. Every research
paper contains references at the end, usually from the same area. First, we
collect the references and count authors who have at least one paper in the
references. Then, we automatically browse the web to extract research topic
keywords. Next, we search for top researchers in the specific topic and count
their h-index, i10-index, and citations for the first n authors. Afterward, we
rank the top n authors based on a score and automatically browse their
homepages to retrieve email addresses. We also check their co-authors and
colleagues online and discard them from the list. The remaining top n authors,
generally professors, are likely the best referees for reviewing the research
paper.
comment: IEEE Conference Proceedings (5 Pages)
☆ Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models
Haoyang Wu, Tsun-Hsuan Wang, Mathias Lechner, Ramin Hasani, Jennifer A. Eckhoff, Paul Pak, Ozanan R. Meireles, Guy Rosman, Yutong Ban, Daniela Rus
Surgical workflow analysis is essential in robot-assisted surgeries, yet the
long duration of such procedures poses significant challenges for comprehensive
video analysis. Recent approaches have predominantly relied on transformer
models; however, their quadratic attention mechanism restricts efficient
processing of lengthy surgical videos. In this paper, we propose a novel
hierarchical input-dependent state space model that leverages the linear
scaling property of state space models to enable decision making on full-length
videos while capturing both local and global dynamics. Our framework
incorporates a temporally consistent visual feature extractor, which appends a
state space model head to a visual feature extractor to propagate temporal
information. The proposed model consists of two key modules: a
local-aggregation state space model block that effectively captures intricate
local dynamics, and a global-relation state space model block that models
temporal dependencies across the entire video. The model is trained using a
hybrid discrete-continuous supervision strategy, where both signals of discrete
phase labels and continuous phase progresses are propagated through the
network. Experiments have shown that our method outperforms the current
state-of-the-art methods by a large margin (+2.8% on Cholec80, +4.3% on
MICCAI2016, and +12.9% on Heichole datasets). Code will be publicly available
after paper acceptance.
☆ Multimodal LLMs for Visualization Reconstruction and Understanding
Visualizations are crucial for data communication, yet understanding them
requires comprehension of both visual elements and their underlying data
relationships. Current multimodal large models, while effective in natural
image understanding, struggle with visualization due to their inability to
decode the data-to-visual mapping rules and extract structured information. To
address these challenges, we present a novel dataset and train multimodal
visualization LLMs specifically designed for understanding. Our approach
combines chart images with their corresponding vectorized representations,
encoding schemes, and data features. The proposed vector format enables compact
and accurate reconstruction of visualization content. Experimental results
demonstrate significant improvements in both data extraction accuracy and chart
reconstruction quality.
☆ LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning
Current vision-language models (VLMs) are well-adapted for general visual
understanding tasks. However, they perform inadequately when handling complex
visual tasks related to human poses and actions due to the lack of specialized
vision-language instruction-following data. We introduce a method for
generating such data by integrating human keypoints with traditional visual
features such as captions and bounding boxes, enabling more precise
understanding of human-centric scenes. Our approach constructs a dataset
comprising 200,328 samples tailored to fine-tune models for human-centric
tasks, focusing on three areas: conversation, detailed description, and complex
reasoning. We establish an Extended Human Pose and Action Understanding
Benchmark (E-HPAUB) to assess model performance on human pose and action
understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and
evaluate our resulting LLaVA-Pose model on the benchmark, achieving significant
improvements. Experimental results show an overall improvement of 33.2%
compared to the original LLaVA-1.5-7B model. These findings highlight the
effectiveness of keypoint-integrated data in enhancing multimodal models for
human-centric visual understanding. Code is available at
https://github.com/Ody-trek/LLaVA-Pose.
comment: arXiv admin note: substantial text overlap with arXiv:2409.09306
☆ DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images
Visual grounding in text-rich document images is a critical yet underexplored
challenge for document intelligence and visual question answering (VQA)
systems. We present \drishtikon, a multi-granular visual grounding framework
designed to enhance interpretability and trust in VQA for complex, multilingual
documents. Our approach integrates robust multi-lingual OCR, large language
models, and a novel region matching algorithm to accurately localize answer
spans at block, line, word, and point levels. We curate a new benchmark from
the CircularsVQA test set, providing fine-grained, human-verified annotations
across multiple granularities. Extensive experiments demonstrate that our
method achieves state-of-the-art grounding accuracy, with line-level
granularity offering the best trade-off between precision and recall. Ablation
studies further highlight the benefits of multi-block and multi-line reasoning.
Comparative evaluations with leading vision-language models reveal the
limitations of current VLMs in precise localization, underscoring the
effectiveness of our structured, alignment-based approach. Our findings pave
the way for more robust and interpretable document understanding systems in
real-world, text-centric scenarios. Code and dataset has been made available at
https://github.com/kasuba-badri-vishal/DhrishtiKon.
comment: Work in progress
☆ Continual Self-Supervised Learning with Masked Autoencoders in Remote Sensing
The development of continual learning (CL) methods, which aim to learn new
tasks in a sequential manner from the training data acquired continuously, has
gained great attention in remote sensing (RS). The existing CL methods in RS,
while learning new tasks, enhance robustness towards catastrophic forgetting.
This is achieved by using a large number of labeled training samples, which is
costly and not always feasible to gather in RS. To address this problem, we
propose a novel continual self-supervised learning method in the context of
masked autoencoders (denoted as CoSMAE). The proposed CoSMAE consists of two
components: i) data mixup; and ii) model mixup knowledge distillation. Data
mixup is associated with retaining information on previous data distributions
by interpolating images from the current task with those from the previous
tasks. Model mixup knowledge distillation is associated with distilling
knowledge from past models and the current model simultaneously by
interpolating their model weights to form a teacher for the knowledge
distillation. The two components complement each other to regularize the MAE at
the data and model levels to facilitate better generalization across tasks and
reduce the risk of catastrophic forgetting. Experimental results show that
CoSMAE achieves significant improvements of up to 4.94% over state-of-the-art
CL methods applied to MAE. Our code is publicly available at:
https://git.tu-berlin.de/rsim/CoSMAE.
comment: Accepted to IEEE Geoscience and Remote Sensing Letters. Our code is
available at https://git.tu-berlin.de/rsim/CoSMAE
☆ HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation MICCAI 2025
Surgical Video Synthesis has emerged as a promising research direction
following the success of diffusion models in general-domain video generation.
Although existing approaches achieve high-quality video generation, most are
unconditional and fail to maintain consistency with surgical actions and
phases, lacking the surgical understanding and fine-grained guidance necessary
for factual simulation. We address these challenges by proposing HieraSurg, a
hierarchy-aware surgical video generation framework consisting of two
specialized diffusion models. Given a surgical phase and an initial frame,
HieraSurg first predicts future coarse-grained semantic changes through a
segmentation prediction model. The final video is then generated by a
second-stage model that augments these temporal segmentation maps with
fine-grained visual features, leading to effective texture rendering and
integration of semantic information in the video space. Our approach leverages
surgical information at multiple levels of abstraction, including surgical
phase, action triplets, and panoptic segmentation maps. The experimental
results on Cholecystectomy Surgical Video Generation demonstrate that the model
significantly outperforms prior work both quantitatively and qualitatively,
showing strong generalization capabilities and the ability to generate higher
frame-rate videos. The model exhibits particularly fine-grained adherence when
provided with existing segmentation maps, suggesting its potential for
practical surgical applications.
comment: Accepted at MICCAI 2025
☆ HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Qize Yang, Shimin Yao, Weixuan Chen, Shenghao Fu, Detao Bai, Jiaxing Zhao, Boyuan Sun, Bowen Yin, Xihan Wei, Jingren Zhou
With the rapid evolution of multimodal large language models, the capacity to
deeply understand and interpret human intentions has emerged as a critical
capability, which demands detailed and thoughtful reasoning. In recent studies,
Reinforcement Learning (RL) has demonstrated potential in enhancing the
reasoning capabilities of Large Language Models (LLMs). Nonetheless, the
challenges associated with adapting RL to multimodal data and formats remain
largely unaddressed. In this paper, we identify two issues in existing
multimodal reasoning models: insufficient global context understanding and
shortcut problems. Insufficient context understanding can happen when a model
misinterprets multimodal context, resulting in incorrect answers. The shortcut
problem occurs when the model overlooks crucial clues in multimodal inputs,
directly addressing the query without considering the multimodal information.
To tackle these issues, we emphasize the necessity for the model to reason with
a clear understanding of the global context within multimodal inputs. This
global context understanding can effectively prevent the model from overlooking
key multimodal cues and ensure a thorough reasoning process. To ensure the
accurate interpretation of multimodal context information, we implement a
context reward judged by a large language model, alongside format and accuracy
rewards. Additionally, to improve complex reasoning capability, we employ the
LLM to assess the logical reward, determining whether the reasoning process
successfully integrates multimodal information with logical methods. We also
introduce a reasoning omni-modal benchmark, IntentBench, aimed at evaluating
models in understanding complex human intentions and emotions. Our proposed
method demonstrates advanced performance across multiple omni-modal benchmarks
compared to other open-source omni-modal models.
☆ WordCon: Word-level Typography Control in Scene Text Rendering
Achieving precise word-level typography control within generated images
remains a persistent challenge. To address it, we newly construct a word-level
controlled scene text dataset and introduce the Text-Image Alignment (TIA)
framework. This framework leverages cross-modal correspondence between text and
local image regions provided by grounding models to enhance the Text-to-Image
(T2I) model training. Furthermore, we propose WordCon, a hybrid
parameter-efficient fine-tuning (PEFT) method. WordCon reparameterizes
selective key parameters, improving both efficiency and portability. This
allows seamless integration into diverse pipelines, including artistic text
rendering, text editing, and image-conditioned text rendering. To further
enhance controllability, the masked loss at the latent level is applied to
guide the model to concentrate on learning the text region in the image, and
the joint-attention loss provides feature-level supervision to promote
disentanglement between different words. Both qualitative and quantitative
results demonstrate the superiority of our method to the state of the art. The
datasets and source code will be available for academic use.
☆ FairyGen: Storied Cartoon Video from a Single Child-Drawn Character
We propose FairyGen, an automatic system for generating story-driven cartoon
videos from a single child's drawing, while faithfully preserving its unique
artistic style. Unlike previous storytelling methods that primarily focus on
character consistency and basic motion, FairyGen explicitly disentangles
character modeling from stylized background generation and incorporates
cinematic shot design to support expressive and coherent storytelling. Given a
single character sketch, we first employ an MLLM to generate a structured
storyboard with shot-level descriptions that specify environment settings,
character actions, and camera perspectives. To ensure visual consistency, we
introduce a style propagation adapter that captures the character's visual
style and applies it to the background, faithfully retaining the character's
full visual identity while synthesizing style-consistent scenes. A shot design
module further enhances visual diversity and cinematic quality through frame
cropping and multi-view synthesis based on the storyboard. To animate the
story, we reconstruct a 3D proxy of the character to derive physically
plausible motion sequences, which are then used to fine-tune an MMDiT-based
image-to-video diffusion model. We further propose a two-stage motion
customization adapter: the first stage learns appearance features from
temporally unordered frames, disentangling identity from motion; the second
stage models temporal dynamics using a timestep-shift strategy with frozen
identity weights. Once trained, FairyGen directly renders diverse and coherent
video scenes aligned with the storyboard. Extensive experiments demonstrate
that our system produces animations that are stylistically faithful,
narratively structured natural motion, highlighting its potential for
personalized and engaging story animation. The code will be available at
https://github.com/GVCLab/FairyGen
comment: Project Page: https://jayleejia.github.io/FairyGen/ ; Code:
https://github.com/GVCLab/FairyGen
☆ Video Virtual Try-on with Conditional Diffusion Transformer Inpainter
Video virtual try-on aims to naturally fit a garment to a target person in
consecutive video frames. It is a challenging task, on the one hand, the output
video should be in good spatial-temporal consistency, on the other hand, the
details of the given garment need to be preserved well in all the frames.
Naively using image-based try-on methods frame by frame can get poor results
due to severe inconsistency. Recent diffusion-based video try-on methods,
though very few, happen to coincide with a similar solution: inserting temporal
attention into image-based try-on model to adapt it for video try-on task,
which have shown improvements but there still exist inconsistency problems. In
this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement
video virtual try-on as a conditional video inpainting task, which is different
from previous methods. In this way, we start with a video generation problem
instead of an image-based try-on problem, which from the beginning has a better
spatial-temporal consistency. Specifically, at first we build a video
inpainting framework based on Diffusion Transformer with full 3D
spatial-temporal attention, and then we progressively adapt it for video
garment inpainting, with a collection of masking strategies and multi-stage
training. After these steps, the model can inpaint the masked garment area with
appropriate garment pixels according to the prompt with good spatial-temporal
consistency. Finally, as other try-on methods, garment condition is added to
the model to make sure the inpainted garment appearance and details are as
expected. Both quantitative and qualitative experimental results show that ViTI
is superior to previous works.
comment: 10 pages, 6 figures
☆ DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic ICCV 2025
Real-world object detection systems, such as those in autonomous driving and
surveillance, must continuously learn new object categories and simultaneously
adapt to changing environmental conditions. Existing approaches, Class
Incremental Object Detection (CIOD) and Domain Incremental Object Detection
(DIOD) only address one aspect of this challenge. CIOD struggles in unseen
domains, while DIOD suffers from catastrophic forgetting when learning new
classes, limiting their real-world applicability. To overcome these
limitations, we introduce Dual Incremental Object Detection (DuIOD), a more
practical setting that simultaneously handles class and domain shifts in an
exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging
framework that enables stable incremental learning while mitigating sign
conflicts through a novel Directional Consistency Loss. Unlike prior methods,
DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function
as real-time incremental object detectors. To comprehensively evaluate both
retention and adaptation, we introduce the Retention-Adaptability Index (RAI),
which combines the Average Retention Index (Avg RI) for catastrophic forgetting
and the Average Generalization Index for domain adaptability into a common
ground. Extensive experiments on the Pascal Series and Diverse Weather Series
demonstrate DuET's effectiveness, achieving a +13.12% RAI improvement while
preserving 89.3% Avg RI on the Pascal Series (4 tasks), as well as a +11.39%
RAI improvement with 88.57% Avg RI on the Diverse Weather Series (3 tasks),
outperforming existing methods.
comment: Accepted at ICCV 2025
☆ Temporal Rate Reduction Clustering for Human Motion Segmentation ICCV 2025
Human Motion Segmentation (HMS), which aims to partition videos into
non-overlapping human motions, has attracted increasing research attention
recently. Existing approaches for HMS are mainly dominated by subspace
clustering methods, which are grounded on the assumption that high-dimensional
temporal data align with a Union-of-Subspaces (UoS) distribution. However, the
frames in video capturing complex human motions with cluttered backgrounds may
not align well with the UoS distribution. In this paper, we propose a novel
approach for HMS, named Temporal Rate Reduction Clustering
($\text{TR}^2\text{C}$), which jointly learns structured representations and
affinity to segment the frame sequences in video. Specifically, the structured
representations learned by $\text{TR}^2\text{C}$ maintain temporally consistent
and align well with a UoS structure, which is favorable for the HMS task. We
conduct extensive experiments on five benchmark HMS datasets and achieve
state-of-the-art performances with different feature extractors.
comment: The paper is accepted by ICCV 2025. The first two authors are equally
contributed
☆ GANet-Seg: Adversarial Learning for Brain Tumor Segmentation with Hybrid Generative Models
This work introduces a novel framework for brain tumor segmentation
leveraging pre-trained GANs and Unet architectures. By combining a global
anomaly detection module with a refined mask generation network, the proposed
model accurately identifies tumor-sensitive regions and iteratively enhances
segmentation precision using adversarial loss constraints. Multi-modal MRI data
and synthetic image augmentation are employed to improve robustness and address
the challenge of limited annotated datasets. Experimental results on the BraTS
dataset demonstrate the effectiveness of the approach, achieving high
sensitivity and accuracy in both lesion-wise Dice and HD95 metrics than the
baseline. This scalable method minimizes the dependency on fully annotated
data, paving the way for practical real-world applications in clinical
settings.
☆ DiMPLe -- Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation
We introduce DiMPLe (Disentangled Multi-Modal Prompt Learning), a novel
approach to disentangle invariant and spurious features across vision and
language modalities in multi-modal learning. Spurious correlations in visual
data often hinder out-of-distribution (OOD) performance. Unlike prior methods
focusing solely on image features, DiMPLe disentangles features within and
across modalities while maintaining consistent alignment, enabling better
generalization to novel classes and robustness to distribution shifts. Our
method combines three key objectives: (1) mutual information minimization
between invariant and spurious features, (2) spurious feature regularization,
and (3) contrastive learning on invariant features. Extensive experiments
demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when
averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in
base class accuracy and 44.31 in novel class accuracy.
☆ Real-Time ESFP: Estimating, Smoothing, Filtering, and Pose-Mapping
This paper presents ESFP, an end-to-end pipeline that converts monocular RGB
video into executable joint trajectories for a low-cost 4-DoF desktop arm. ESFP
comprises four sequential modules. (1) Estimating: ROMP lifts each frame to a
24-joint 3-D skeleton. (2) Smoothing: the proposed HPSTM-a sequence-to-sequence
Transformer with self-attention-combines long-range temporal context with a
differentiable forward-kinematics decoder, enforcing constant bone lengths and
anatomical plausibility while jointly predicting joint means and full
covariances. (3) Filtering: root-normalized trajectories are variance-weighted
according to HPSTM's uncertainty estimates, suppressing residual noise. (4)
Pose-Mapping: a geometric retargeting layer transforms shoulder-elbow-wrist
triples into the uArm's polar workspace, preserving wrist orientation.
☆ ReME: A Data-Centric Framework for Training-Free Open-Vocabulary Segmentation ICCV 2025
Training-free open-vocabulary semantic segmentation (OVS) aims to segment
images given a set of arbitrary textual categories without costly model
fine-tuning. Existing solutions often explore attention mechanisms of
pre-trained models, such as CLIP, or generate synthetic data and design complex
retrieval processes to perform OVS. However, their performance is limited by
the capability of reliant models or the suboptimal quality of reference sets.
In this work, we investigate the largely overlooked data quality problem for
this challenging dense scene understanding task, and identify that a
high-quality reference set can significantly benefit training-free OVS. With
this observation, we introduce a data-quality-oriented framework, comprising a
data pipeline to construct a reference set with well-paired segment-text
embeddings and a simple similarity-based retrieval to unveil the essential
effect of data. Remarkably, extensive evaluations on ten benchmark datasets
demonstrate that our method outperforms all existing training-free OVS
approaches, highlighting the importance of data-centric design for advancing
OVS without training. Our code is available at https://github.com/xiweix/ReME .
comment: Accepted to ICCV 2025
☆ BitMark for Infinity: Watermarking Bitwise Autoregressive Image Generative Models
State-of-the-art text-to-image models like Infinity generate photorealistic
images at an unprecedented speed. These models operate in a bitwise
autoregressive manner over a discrete set of tokens that is practically
infinite in size. However, their impressive generative power comes with a
growing risk: as their outputs increasingly populate the Internet, they are
likely to be scraped and reused as training data-potentially by the very same
models. This phenomenon has been shown to lead to model collapse, where
repeated training on generated content, especially from the models' own
previous versions, causes a gradual degradation in performance. A promising
mitigation strategy is watermarking, which embeds human-imperceptible yet
detectable signals into generated images-enabling the identification of
generated content. In this work, we introduce BitMark, a robust bitwise
watermarking framework for Infinity. Our method embeds a watermark directly at
the bit level of the token stream across multiple scales (also referred to as
resolutions) during Infinity's image generation process. Our bitwise watermark
subtly influences the bits to preserve visual fidelity and generation speed
while remaining robust against a spectrum of removal techniques. Furthermore,
it exhibits high radioactivity, i.e., when watermarked generated images are
used to train another image generative model, this second model's outputs will
also carry the watermark. The radioactive traces remain detectable even when
only fine-tuning diffusion or image autoregressive models on images watermarked
with our BitMark. Overall, our approach provides a principled step toward
preventing model collapse in image generative models by enabling reliable
detection of generated outputs.
☆ MedPrompt: LLM-CNN Fusion with Weight Routing for Medical Image Segmentation and Classification
Current medical image analysis systems are typically task-specific, requiring
separate models for classification and segmentation, and lack the flexibility
to support user-defined workflows. To address these challenges, we introduce
MedPrompt, a unified framework that combines a few-shot prompted Large Language
Model (Llama-4-17B) for high-level task planning with a modular Convolutional
Neural Network (DeepFusionLab) for low-level image processing. The LLM
interprets user instructions and generates structured output to dynamically
route task-specific pretrained weights. This weight routing approach avoids
retraining the entire framework when adding new tasks-only task-specific
weights are required, enhancing scalability and deployment. We evaluated
MedPrompt across 19 public datasets, covering 12 tasks spanning 5 imaging
modalities. The system achieves a 97% end-to-end correctness in interpreting
and executing prompt-driven instructions, with an average inference latency of
2.5 seconds, making it suitable for near real-time applications. DeepFusionLab
achieves competitive segmentation accuracy (e.g., Dice 0.9856 on lungs) and
strong classification performance (F1 0.9744 on tuberculosis). Overall,
MedPrompt enables scalable, prompt-driven medical imaging by combining the
interpretability of LLMs with the efficiency of modular CNNs.
comment: 40 pages, 8 Tables, 9 Figures
☆ Unlocking Constraints: Source-Free Occlusion-Aware Seamless Segmentation ICCV 2025
Panoramic image processing is essential for omni-context perception, yet
faces constraints like distortions, perspective occlusions, and limited
annotations. Previous unsupervised domain adaptation methods transfer knowledge
from labeled pinhole data to unlabeled panoramic images, but they require
access to source pinhole data. To address these, we introduce a more practical
task, i.e., Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and
propose its first solution, called UNconstrained Learning Omni-Context
Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni
Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting
without relying on source data or target labels, this framework enhances models
to achieve segmentation with 360{\deg} viewpoint coverage and occlusion-aware
reasoning. Furthermore, we benchmark the proposed SFOASS task through both
real-to-real and synthetic-to-real adaptation settings. Experimental results
show that our source-free method achieves performance comparable to
source-dependent methods, yielding state-of-the-art scores of 10.9 in mAAP and
11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the
source-only method. All data and code will be made publicly available at
https://github.com/yihong-97/UNLOCK.
comment: Accepted to ICCV 2025. All data and code will be made publicly
available at https://github.com/yihong-97/UNLOCK
☆ GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding
Sequential grounding in 3D point clouds (SG3D) refers to locating sequences
of objects by following text instructions for a daily activity with detailed
steps. Current 3D visual grounding (3DVG) methods treat text instructions with
multiple steps as a whole, without extracting useful temporal information from
each step. However, the instructions in SG3D often contain pronouns such as
"it", "here" and "the same" to make language expressions concise. This requires
grounding methods to understand the context and retrieve relevant information
from previous steps to correctly locate object sequences. Due to the lack of an
effective module for collecting related historical information,
state-of-the-art 3DVG methods face significant challenges in adapting to the
SG3D task. To fill this gap, we propose GroundFlow -- a plug-in module for
temporal reasoning on 3D point cloud sequential grounding. Firstly, we
demonstrate that integrating GroundFlow improves the task accuracy of 3DVG
baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark,
even outperforming a 3D large language model pre-trained on various datasets.
Furthermore, we selectively extract both short-term and long-term step
information based on its relevance to the current instruction, enabling
GroundFlow to take a comprehensive view of historical information and maintain
its temporal understanding advantage as step counts increase. Overall, our work
introduces temporal reasoning capabilities to existing 3DVG models and achieves
state-of-the-art performance in the SG3D benchmark across five datasets.
☆ Out-of-Distribution Semantic Occupancy Prediction
Yuheng Zhang, Mengfei Duan, Kunyu Peng, Yuhang Wang, Ruiping Liu, Fei Teng, Kai Luo, Zhiyong Li, Kailun Yang
3D Semantic Occupancy Prediction is crucial for autonomous driving, providing
a dense, semantically rich environmental representation. However, existing
methods focus on in-distribution scenes, making them susceptible to
Out-of-Distribution (OoD) objects and long-tail distributions, which increases
the risk of undetected anomalies and misinterpretations, posing safety hazards.
To address these challenges, we introduce Out-of-Distribution Semantic
Occupancy Prediction, targeting OoD detection in 3D voxel space. To fill the
gaps in the dataset, we propose a Synthetic Anomaly Integration Pipeline that
injects synthetic anomalies while preserving realistic spatial and occlusion
patterns, enabling the creation of two datasets: VAA-KITTI and VAA-KITTI-360.
We introduce OccOoD, a novel framework integrating OoD detection into 3D
semantic occupancy prediction, with Voxel-BEV Progressive Fusion (VBPF)
leveraging an RWKV-based branch to enhance OoD detection via geometry-semantic
fusion. Experimental results demonstrate that OccOoD achieves state-of-the-art
OoD detection with an AuROC of 67.34% and an AuPRCr of 29.21% within a 1.2m
region, while maintaining competitive occupancy prediction performance. The
established datasets and source code will be made publicly available at
https://github.com/7uHeng/OccOoD.
comment: The established datasets and source code will be made publicly
available at https://github.com/7uHeng/OccOoD
☆ Task-Aware KV Compression For Cost-Effective Long Video Understanding
Minghao Qin, Yan Shu, Peitian Zhang, Kun Lun, Huaying Yuan, Juenjie Zhou, Shitao Xiao, Bo Zhao, Zheng Liu
Long-video understanding (LVU) remains a severe challenge for existing
multimodal large language models (MLLMs), primarily due to the prohibitive
computational cost. Recent approaches have explored KV compression to mitigate
this issue, but they often suffer from significant information loss at high
compression ratios. In this paper, we introduce Video-X^2L, which flexibly
preserves critical video information for each LVU task. Video-X^2L involves two
key operations. The first one is called bi-level KV compression. During the
MLLM's pre-filling stage, Video-X^2L generates two types of compressed KVs:
low-compression KVs (L-KVs) to capture fine-grained video details and
high-compression KVs (H-KVs) to offer compact video representations. The second
one is called selective KV re-loading. During the MLLM's decoding stage,
Video-X^2L selectively re-loads L-KVs for the most critical video chunks while
using H-KVs for other less important ones. This allows the MLLM to fully
utilize task-specific information while maintaining the overall compactness.
Video-X^2L is simple yet effective: it is free from additional training and
directly compatible with existing KV-compressible MLLMs. We evaluate Video-X^2L
with a variety of popular LVU benchmarks, including VideoMME, MLVU,
LongVideoBench, and VNBench. Our experiment result shows that Video-X^2L
outperforms existing KV-compression methods by a huge advantage while
substantially saving the computation cost.
comment: 14 pages, 3 figures, 6 tables
☆ Uncover Treasures in DCT: Advancing JPEG Quality Enhancement by Exploiting Latent Correlations
Joint Photographic Experts Group (JPEG) achieves data compression by
quantizing Discrete Cosine Transform (DCT) coefficients, which inevitably
introduces compression artifacts. Most existing JPEG quality enhancement
methods operate in the pixel domain, suffering from the high computational
costs of decoding. Consequently, direct enhancement of JPEG images in the DCT
domain has gained increasing attention. However, current DCT-domain methods
often exhibit limited performance. To address this challenge, we identify two
critical types of correlations within the DCT coefficients of JPEG images.
Building on this insight, we propose an Advanced DCT-domain JPEG Quality
Enhancement (AJQE) method that fully exploits these correlations. The AJQE
method enables the adaptation of numerous well-established pixel-domain models
to the DCT domain, achieving superior performance with reduced computational
complexity. Compared to the pixel-domain counterparts, the DCT-domain models
derived by our method demonstrate a 0.35 dB improvement in PSNR and a 60.5%
increase in enhancement throughput on average.
☆ Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition
Learning semantic representations from point sets of 3D object shapes is
often challenged by significant geometric variations, primarily due to
differences in data acquisition methods. Typically, training data is generated
using point simulators, while testing data is collected with distinct 3D
sensors, leading to a simulation-to-reality (Sim2Real) domain gap that limits
the generalization ability of point classifiers. Current unsupervised domain
adaptation (UDA) techniques struggle with this gap, as they often lack robust,
domain-insensitive descriptors capable of capturing global topological
information, resulting in overfitting to the limited semantic patterns of the
source domain. To address this issue, we introduce a novel Topology-Aware
Modeling (TAM) framework for Sim2Real UDA on object point clouds. Our approach
mitigates the domain gap by leveraging global spatial topology, characterized
by low-level, high-frequency 3D structures, and by modeling the topological
relations of local geometric features through a novel self-supervised learning
task. Additionally, we propose an advanced self-training strategy that combines
cross-domain contrastive learning with self-training, effectively reducing the
impact of noisy pseudo-labels and enhancing the robustness of the adaptation
process. Experimental results on three public Sim2Real benchmarks validate the
effectiveness of our TAM framework, showing consistent improvements over
state-of-the-art methods across all evaluated tasks. The source code of this
work will be available at https://github.com/zou-longkun/TAG.git.
☆ Geometry and Perception Guided Gaussians for Multiview-consistent 3D Generation from a Single Image
Generating realistic 3D objects from single-view images requires natural
appearance, 3D consistency, and the ability to capture multiple plausible
interpretations of unseen regions. Existing approaches often rely on
fine-tuning pretrained 2D diffusion models or directly generating 3D
information through fast network inference or 3D Gaussian Splatting, but their
results generally suffer from poor multiview consistency and lack geometric
detail. To takle these issues, we present a novel method that seamlessly
integrates geometry and perception priors without requiring additional model
training to reconstruct detailed 3D objects from a single image. Specifically,
we train three different Gaussian branches initialized from the geometry prior,
perception prior and Gaussian noise, respectively. The geometry prior captures
the rough 3D shapes, while the perception prior utilizes the 2D pretrained
diffusion model to enhance multiview information. Subsequently, we refine 3D
Gaussian branches through mutual interaction between geometry and perception
priors, further enhanced by a reprojection-based strategy that enforces depth
consistency. Experiments demonstrate the higher-fidelity reconstruction results
of our method, outperforming existing methods on novel view synthesis and 3D
reconstruction, demonstrating robust and consistent 3D object generation.
comment: 10 pages, 5 figures
☆ Robust Deep Learning for Myocardial Scar Segmentation in Cardiac MRI with Noisy Labels MICCAI 2025
Aida Moafi, Danial Moafi, Evgeny M. Mirkes, Gerry P. McCann, Abbas S. Alatrany, Jayanth R. Arnold, Mostafa Mehdipour Ghazi
The accurate segmentation of myocardial scars from cardiac MRI is essential
for clinical assessment and treatment planning. In this study, we propose a
robust deep-learning pipeline for fully automated myocardial scar detection and
segmentation by fine-tuning state-of-the-art models. The method explicitly
addresses challenges of label noise from semi-automatic annotations, data
heterogeneity, and class imbalance through the use of Kullback-Leibler loss and
extensive data augmentation. We evaluate the model's performance on both acute
and chronic cases and demonstrate its ability to produce accurate and smooth
segmentations despite noisy labels. In particular, our approach outperforms
state-of-the-art models like nnU-Net and shows strong generalizability in an
out-of-distribution test set, highlighting its robustness across various
imaging conditions and clinical tasks. These results establish a reliable
foundation for automated myocardial scar quantification and support the broader
clinical adoption of deep learning in cardiac imaging.
comment: MICCAI 2025
☆ Tree-based Semantic Losses: Application to Sparsely-supervised Large Multi-class Hyperspectral Segmentation
Hyperspectral imaging (HSI) shows great promise for surgical applications,
offering detailed insights into biological tissue differences beyond what the
naked eye can perceive. Refined labelling efforts are underway to train vision
systems to distinguish large numbers of subtly varying classes. However,
commonly used learning methods for biomedical segmentation tasks penalise all
errors equivalently and thus fail to exploit any inter-class semantics in the
label space. In this work, we introduce two tree-based semantic loss functions
which take advantage of a hierarchical organisation of the labels. We further
incorporate our losses in a recently proposed approach for training with
sparse, background-free annotations. Extensive experiments demonstrate that our
proposed method reaches state-of-the-art performance on a sparsely annotated
HSI dataset comprising $107$ classes organised in a clinically-defined semantic
tree structure. Furthermore, our method enables effective detection of
out-of-distribution (OOD) pixels without compromising segmentation performance
on in-distribution (ID) pixels.
♻ ☆ Towards Scalable and Generalizable Earth Observation Data Mining via Foundation Model Composition
Foundation models are rapidly transforming Earth Observation data mining by
enabling generalizable and scalable solutions for key tasks such as scene
classification and semantic segmentation. While most efforts in the geospatial
domain have focused on developing large models trained from scratch using
massive Earth Observation datasets, an alternative strategy that remains
underexplored is the reuse and combination of existing pretrained models. In
this study, we investigate whether foundation models pretrained on remote
sensing and general vision datasets can be effectively combined to improve
performance across a diverse set of key Earth Observation tasks. Using the
GEO-Bench benchmark, we evaluate several prominent models, including Prithvi,
Hiera, and DOFA, on eleven datasets covering a range of spatial resolutions,
sensor modalities, and task types. The results show that feature-level
ensembling of smaller pretrained models can match or exceed the performance of
much larger models, while requiring less training time and computational
resources. Moreover, the study highlights the potential of applying knowledge
distillation to transfer the strengths of ensembles into more compact models,
offering a practical path for deploying foundation models in real-world Earth
Observation applications.
♻ ☆ Learning to Be a Transformer to Pinpoint Anomalies
To efficiently deploy strong, often pre-trained feature extractors, recent
Industrial Anomaly Detection and Segmentation (IADS) methods process
low-resolution images, e.g., 224x224 pixels, obtained by downsampling the
original input images. However, while numerous industrial applications demand
the identification of both large and small defects, downsampling the input
image to a low resolution may hinder a method's ability to pinpoint tiny
anomalies. We propose a novel Teacher--Student paradigm to leverage strong
pre-trained features while processing high-resolution input images very
efficiently. The core idea concerns training two shallow MLPs (the Students) by
nominal images so as to mimic the mappings between the patch embeddings induced
by the self-attention layers of a frozen vision Transformer (the Teacher).
Indeed, learning these mappings sets forth a challenging pretext task that
small-capacity models are unlikely to accomplish on out-of-distribution data
such as anomalous images. Our method can spot anomalies from high-resolution
images and runs way faster than competitors, achieving state-of-the-art
performance on MVTec AD and the best segmentation results on VisA. We also
propose novel evaluation metrics to capture robustness to defect size, i.e.,
the ability to preserve good localisation from large anomalies to tiny ones.
Evaluating our method also by these metrics reveals its neatly superior
performance.
comment: Accepted at IEEE Access
♻ ☆ CanFields: Consolidating Diffeomorphic Flows for Non-Rigid 4D Interpolation from Arbitrary-Length Sequences ICCV2025
We introduce Canonical Consolidation Fields (CanFields). This novel method
interpolates arbitrary-length sequences of independently sampled 3D point
clouds into a unified, continuous, and coherent deforming shape. Unlike prior
methods that oversmooth geometry or produce topological and geometric
artifacts, CanFields optimizes fine-detailed geometry and deformation jointly
in an unsupervised fitting with two novel bespoke modules. First, we introduce
a dynamic consolidator module that adjusts the input and assigns confidence
scores, balancing the optimization of the canonical shape and its motion.
Second, we represent the motion as a diffeomorphic flow parameterized by a
smooth velocity field. We have validated our robustness and accuracy on more
than 50 diverse sequences, demonstrating its superior performance even with
missing regions, noisy raw scans, and sparse data. Our project page is at:
https://wangmiaowei.github.io/CanFields.github.io/.
comment: ICCV2025 Accepted
♻ ☆ SimWorld: A Unified Benchmark for Simulator-Conditioned Scene Generation via World Model
With the rapid advancement of autonomous driving technology, a lack of data
has become a major obstacle to enhancing perception model accuracy. Researchers
are now exploring controllable data generation using world models to diversify
datasets. However, previous work has been limited to studying image generation
quality on specific public datasets. There is still relatively little research
on how to build data generation engines for real-world application scenes to
achieve large-scale data generation for challenging scenes. In this paper, a
simulator-conditioned scene generation engine based on world model is proposed.
By constructing a simulation system consistent with real-world scenes,
simulation data and labels, which serve as the conditions for data generation
in the world model, for any scenes can be collected. It is a novel data
generation pipeline by combining the powerful scene simulation capabilities of
the simulation engine with the robust data generation capabilities of the world
model. In addition, a benchmark with proportionally constructed virtual and
real data, is provided for exploring the capabilities of world models in
real-world scenes. Quantitative results show that these generated images
significantly improve downstream perception models performance. Finally, we
explored the generative performance of the world model in urban autonomous
driving scenarios. All the data and code will be available at
https://github.com/Li-Zn-H/SimWorld.
comment: 8 pages, 4 figures
♻ ☆ Chain-of-Sketch: Enabling Global Visual Reasoning
Modern vision models have achieved remarkable success in benchmarks where
local features provide critical information about the target. There is now a
growing interest in tackling tasks requiring more global reasoning, where local
features do not provide significant information. Minsky and Papert put forward
such tasks in 1969 with their connectivity study, exposing the limitations of
the perceptron model. In this paper, we introduce an expanded set of global
visual datasets involving graphs, strings, mazes, and image grids. We show that
large vision models still struggle to learn these tasks efficiently. Similarly,
state-of-the-art multi-modal LLMs perform poorly on these datasets. We explain
this learning inefficiency by means of the 'globality degree' measure. To
mitigate this, we propose a method called chain-of-sketch (CoS). Similar to the
chain-of-thought and scratchpad techniques used in language models, CoS breaks
the original task into intermediate visual steps to help learn a complex task.
In addition, we show that not all CoS strategies perform equally well. Our key
insight is to impose a Markovian structure on the CoS frames. This leads to the
introduction of 'inductive CoS' which achieves better out-of-distribution
generalization and performs well even with smaller models compared to
non-inductive variants.
comment: additional experiments added, title changed from "Visual Scratchpads:
Enabling Global Reasoning in Vision" to "Chain-of-Sketch: Enabling Global
Visual Reasoning"
♻ ☆ QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning ICCV 2025
The practical deployment of diffusion models is still hindered by the high
memory and computational overhead. Although quantization paves a way for model
compression and acceleration, existing methods face challenges in achieving
low-bit quantization efficiently. In this paper, we identify imbalanced
activation distributions as a primary source of quantization difficulty, and
propose to adjust these distributions through weight finetuning to be more
quantization-friendly. We provide both theoretical and empirical evidence
supporting finetuning as a practical and reliable solution. Building on this
approach, we further distinguish two critical types of quantized layers: those
responsible for retaining essential temporal information and those particularly
sensitive to bit-width reduction. By selectively finetuning these layers under
both local and global supervision, we mitigate performance degradation while
enhancing quantization efficiency. Our method demonstrates its efficacy across
three high-resolution image generation tasks, obtaining state-of-the-art
performance across multiple bit-width settings.
comment: ICCV 2025. Code is available at
https://github.com/hatchetProject/QuEST
♻ ☆ AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration ICCV 2025
We present AnyCalib, a method for calibrating the intrinsic parameters of a
camera from a single in-the-wild image, that is agnostic to the camera model.
Current methods are predominantly tailored to specific camera models and/or
require extrinsic cues, such as the direction of gravity, to be visible in the
image. In contrast, we argue that the perspective and distortion cues inherent
in images are sufficient for model-agnostic camera calibration. To demonstrate
this, we frame the calibration process as the regression of the rays
corresponding to each pixel. We show, for the first time, that this
intermediate representation allows for a closed-form recovery of the intrinsics
for a wide range of camera models, including but not limited to: pinhole,
Brown-Conrady and Kannala-Brandt. Our approach also applies to edited --
cropped and stretched -- images. Experimentally, we demonstrate that AnyCalib
consistently outperforms alternative methods, including 3D foundation models,
despite being trained on orders of magnitude less data. Code is available at
https://github.com/javrtg/AnyCalib.
comment: Accepted to ICCV 2025
♻ ☆ EgoM2P: Egocentric Multimodal Multitask Pretraining ICCV 2025
Understanding multimodal signals in egocentric vision, such as RGB video,
depth, camera poses, and gaze, is essential for applications in augmented
reality, robotics, and human-computer interaction, enabling systems to better
interpret the camera wearer's actions, intentions, and surrounding environment.
However, building large-scale egocentric multimodal and multitask models
presents unique challenges. Egocentric data are inherently heterogeneous, with
large variations in modality coverage across devices and settings. Generating
pseudo-labels for missing modalities, such as gaze or head-mounted camera
trajectories, is often infeasible, making standard supervised learning
approaches difficult to scale. Furthermore, dynamic camera motion and the
complex temporal and spatial structure of first-person video pose additional
challenges for the direct application of existing multimodal foundation models.
To address these challenges, we introduce a set of efficient temporal
tokenizers and propose EgoM2P, a masked modeling framework that learns from
temporally-aware multimodal tokens to train a large, general-purpose model for
egocentric 4D understanding. This unified design supports multitasking across
diverse egocentric perception and synthesis tasks, including gaze prediction,
egocentric camera tracking, and monocular depth estimation from egocentric
video, and also serves as a generative model for conditional egocentric video
synthesis. Across these tasks, EgoM2P matches or outperforms specialist models
while being an order of magnitude faster. We will fully open-source EgoM2P to
support the community and advance egocentric vision research. Project page:
https://egom2p.github.io/.
comment: Accepted by ICCV 2025
♻ ☆ Fake it till You Make it: Reward Modeling as Discriminative Prediction
An effective reward model plays a pivotal role in reinforcement learning for
post-training enhancement of visual generative models. However, current
approaches of reward modeling suffer from implementation complexity due to
their reliance on extensive human-annotated preference data or meticulously
engineered quality dimensions that are often incomplete and
engineering-intensive. Inspired by adversarial training in generative
adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward
modeling framework that eliminates manual preference annotation and explicit
quality dimension engineering. Our method trains the reward model through
discrimination between a small set of representative, unpaired target
samples(denoted as Preference Proxy Data) and model-generated ordinary outputs,
requiring only a few hundred target samples. Comprehensive experiments
demonstrate our GAN-RM's effectiveness across multiple key applications
including test-time scaling implemented as Best-of-N sample filtering,
post-training approaches like Supervised Fine-Tuning (SFT) and Direct
Preference Optimization (DPO). Code and data will be released at
https://github.com/Visualignment/GAN-RM.
♻ ☆ Materialist: Physically Based Editing Using Single-Image Inverse Rendering
Lezhong Wang, Duc Minh Tran, Ruiqi Cui, Thomson TG, Anders Bjorholm Dahl, Siavash Arjomand Bigdeli, Jeppe Revall Frisvad, Manmohan Chandraker
Achieving physically consistent image editing remains a significant challenge
in computer vision. Existing image editing methods typically rely on neural
networks, which struggle to accurately handle shadows and refractions.
Conversely, physics-based inverse rendering often requires multi-view
optimization, limiting its practicality in single-image scenarios. In this
paper, we propose Materialist, a method combining a learning-based approach
with physically based progressive differentiable rendering. Given an image, our
method leverages neural networks to predict initial material properties.
Progressive differentiable rendering is then used to optimize the environment
map and refine the material properties with the goal of closely matching the
rendered result to the input image. Our approach enables a range of
applications, including material editing, object insertion, and relighting,
while also introducing an effective method for editing material transparency
without requiring full scene geometry. Furthermore, Our envmap estimation
method also achieves state-of-the-art performance, further enhancing the
accuracy of image editing task. Experiments demonstrate strong performance
across synthetic and real-world datasets, excelling even on challenging
out-of-domain images. Project website:
https://lez-s.github.io/materialist_project/
comment: Add acknowledgements, more authors and more results. Project website:
https://lez-s.github.io/materialist_project/
♻ ☆ DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection ICCV 2025
Francisco Caetano, Christiaan Viviers, Luis A. Zavala-Mondragón, Peter H. N. de With, Fons van der Sommen
Out-of-distribution (OOD) detection holds significant importance across many
applications. While semantic and domain-shift OOD problems are well-studied,
this work focuses on covariate shifts - subtle variations in the data
distribution that can degrade machine learning performance. We hypothesize that
detecting these subtle shifts can improve our understanding of in-distribution
boundaries, ultimately improving OOD detection. In adversarial discriminators
trained with Batch Normalization (BN), real and adversarial samples form
distinct domains with unique batch statistics - a property we exploit for OOD
detection. We introduce DisCoPatch, an unsupervised Adversarial Variational
Autoencoder (VAE) framework that harnesses this mechanism. During inference,
batches consist of patches from the same image, ensuring a consistent data
distribution that allows the model to rely on batch statistics. DisCoPatch uses
the VAE's suboptimal outputs (generated and reconstructed) as negative samples
to train the discriminator, thereby improving its ability to delineate the
boundary between in-distribution samples and covariate shifts. By tightening
this boundary, DisCoPatch achieves state-of-the-art results in public OOD
detection benchmarks. The proposed model not only excels in detecting covariate
shifts, achieving 95.5% AUROC on ImageNet-1K(-C) but also outperforms all prior
methods on public Near-OOD (95.0%) benchmarks. With a compact model size of
25MB, it achieves high OOD detection performance at notably lower latency than
existing methods, making it an efficient and practical solution for real-world
OOD detection applications. The code is publicly available.
comment: ICCV 2025
♻ ☆ Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling ICCV 2025
Fengxiang Wang, Hongzhen Wang, Di Wang, Zonghao Guo, Zhenyu Zhong, Long Lan, Wenjing Yang, Jing Zhang
Masked Image Modeling (MIM) has become an essential method for building
foundational visual models in remote sensing (RS). However, the limitations in
size and diversity of existing RS datasets restrict the ability of MIM methods
to learn generalizable representations. Additionally, conventional MIM
techniques, which require reconstructing all tokens, introduce unnecessary
computational overhead. To address these issues, we present a new pre-training
pipeline for RS models, featuring the creation of a large-scale RS dataset and
an efficient MIM approach. We curated a high-quality dataset named
\textbf{OpticalRS-13M} by collecting publicly available RS datasets and
processing them through exclusion, slicing, and deduplication. OpticalRS-13M
comprises 13 million optical images covering various RS tasks, such as object
detection and pixel segmentation. To enhance efficiency, we propose
\textbf{SelectiveMAE}, a pre-training method that dynamically encodes and
reconstructs semantically rich patch tokens, thereby reducing the
inefficiencies of traditional MIM models caused by redundant background pixels
in RS images. Extensive experiments show that OpticalRS-13M significantly
improves classification, detection, and segmentation performance, while
SelectiveMAE increases training efficiency over 2$\times$ times. This
highlights the effectiveness and scalability of our pipeline in developing RS
foundational models. The dataset, source code, and trained models will be
released at https://github.com/MiliLab/SelectiveMAE.
comment: ICCV 2025
♻ ☆ OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation
Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, Hai-Bao Chen
Text-to-image (T2I) models have garnered significant attention for generating
high-quality images aligned with text prompts. However, rapid T2I model
advancements reveal limitations in early benchmarks, lacking comprehensive
evaluations, for example, the evaluation on reasoning, text rendering and
style. Notably, recent state-of-the-art models, with their rich knowledge
modeling capabilities, show promising results on the image generation problems
requiring strong reasoning ability, yet existing evaluation systems have not
adequately addressed this frontier. To systematically address these gaps, we
introduce OneIG-Bench, a meticulously designed comprehensive benchmark
framework for fine-grained evaluation of T2I models across multiple dimensions,
including prompt-image alignment, text rendering precision, reasoning-generated
content, stylization, and diversity. By structuring the evaluation, this
benchmark enables in-depth analysis of model performance, helping researchers
and practitioners pinpoint strengths and bottlenecks in the full pipeline of
image generation. Specifically, OneIG-Bench enables flexible evaluation by
allowing users to focus on a particular evaluation subset. Instead of
generating images for the entire set of prompts, users can generate images only
for the prompts associated with the selected dimension and complete the
corresponding evaluation accordingly. Our codebase and dataset are now publicly
available to facilitate reproducible evaluation studies and cross-model
comparisons within the T2I research community.
♻ ☆ Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation
We introduce a diffusion-based framework that performs aligned novel view
image and geometry generation via a warping-and-inpainting methodology. Unlike
prior methods that require dense posed images or pose-embedded generative
models limited to in-domain views, our method leverages off-the-shelf geometry
predictors to predict partial geometries viewed from reference images, and
formulates novel-view synthesis as an inpainting task for both image and
geometry. To ensure accurate alignment between generated images and geometry,
we propose cross-modal attention distillation, where attention maps from the
image diffusion branch are injected into a parallel geometry diffusion branch
during both training and inference. This multi-task approach achieves
synergistic effects, facilitating geometrically robust image synthesis as well
as well-defined geometry prediction. We further introduce proximity-based mesh
conditioning to integrate depth and normal cues, interpolating between point
cloud and filtering erroneously predicted geometry from influencing the
generation process. Empirically, our method achieves high-fidelity
extrapolative view synthesis on both image and geometry across a range of
unseen scenes, delivers competitive reconstruction quality under interpolation
settings, and produces geometrically aligned colored point clouds for
comprehensive 3D completion. Project page is available at
https://cvlab-kaist.github.io/MoAI.
comment: Project page at https://cvlab-kaist.github.io/MoAI
♻ ☆ STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution
for Embodied AI and Autonomous Driving has become a prevailing trend. While
MLLMs have been extensively studied for visual semantic understanding tasks,
their ability to perform precise and quantitative spatial-temporal
understanding in real-world applications remains largely unexamined, leading to
uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we
introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal
understanding through challenging tasks such as estimating and predicting the
appearance, pose, displacement, and motion of objects. Our benchmark
encompasses a wide range of robot and vehicle operations across desktop,
indoor, and outdoor scenarios. The extensive experiments reveals that the
state-of-the-art MLLMs still struggle in real-world spatial-temporal
understanding, especially in tasks requiring precise distance estimation and
motion analysis.
♻ ☆ Consensus-Driven Uncertainty for Robotic Grasping based on RGB Perception IROS 2025
Deep object pose estimators are notoriously overconfident. A grasping agent
that both estimates the 6-DoF pose of a target object and predicts the
uncertainty of its own estimate could avoid task failure by choosing not to act
under high uncertainty. Even though object pose estimation improves and
uncertainty quantification research continues to make strides, few studies have
connected them to the downstream task of robotic grasping. We propose a method
for training lightweight, deep networks to predict whether a grasp guided by an
image-based pose estimate will succeed before that grasp is attempted. We
generate training data for our networks via object pose estimation on real
images and simulated grasping. We also find that, despite high object
variability in grasping trials, networks benefit from training on all objects
jointly, suggesting that a diverse variety of objects can nevertheless
contribute to the same goal.
comment: Accepted to IROS 2025
♻ ☆ Tackling fluffy clouds: robust field boundary delineation across global agricultural landscapes with Sentinel-1 and Sentinel-2 Time Series
Foivos I. Diakogiannis, Zheng-Shu Zhou, Jeff Wang, Gonzalo Mata, Dave Henry, Roger Lawes, Amy Parker, Peter Caccetta, Rodrigo Ibata, Ondrej Hlinka, Jonathan Richetti, Kathryn Batchelor, Chris Herrmann, Andrew Toovey, John Taylor
Accurate delineation of agricultural field boundaries is essential for
effective crop monitoring and resource management. However, competing
methodologies often face significant challenges, particularly in their reliance
on extensive manual efforts for cloud-free data curation and limited
adaptability to diverse global conditions. In this paper, we introduce
PTAViT3D, a deep learning architecture specifically designed for processing
three-dimensional time series of satellite imagery from either Sentinel-1 (S1)
or Sentinel-2 (S2). Additionally, we present PTAViT3D-CA, an extension of the
PTAViT3D model incorporating cross-attention mechanisms to fuse S1 and S2
datasets, enhancing robustness in cloud-contaminated scenarios. The proposed
methods leverage spatio-temporal correlations through a memory-efficient 3D
Vision Transformer architecture, facilitating accurate boundary delineation
directly from raw, cloud-contaminated imagery. We comprehensively validate our
models through extensive testing on various datasets, including Australia's
ePaddocks - CSIRO's national agricultural field boundary product - alongside
public benchmarks Fields-of-the-World, PASTIS, and AI4SmallFarms. Our results
consistently demonstrate state-of-the-art performance, highlighting excellent
global transferability and robustness. Crucially, our approach significantly
simplifies data preparation workflows by reliably processing cloud-affected
imagery, thereby offering strong adaptability across diverse agricultural
environments. Our code and models are publicly available at
https://github.com/feevos/tfcl.
comment: revision 1, under review
♻ ☆ Mr. DETR++: Instructive Multi-Route Training for Detection Transformers with Mixture-of-Experts CVPR 2025
Existing methods enhance the training of detection transformers by
incorporating an auxiliary one-to-many assignment. In this work, we treat the
model as a multi-task framework, simultaneously performing one-to-one and
one-to-many predictions. We investigate the roles of each component in the
transformer decoder across these two training targets, including
self-attention, cross-attention, and feed-forward network. Our empirical
results demonstrate that any independent component in the decoder can
effectively learn both targets simultaneously, even when other components are
shared. This finding leads us to propose a multi-route training mechanism,
featuring a primary route for one-to-one prediction and two auxiliary training
routes for one-to-many prediction. We propose a novel instructive
self-attention mechanism, integrated into the first auxiliary route, which
dynamically and flexibly guides object queries for one-to-many prediction. For
the second auxiliary route, we introduce a route-aware Mixture-of-Experts (MoE)
to facilitate knowledge sharing while mitigating potential conflicts between
routes. Additionally, we apply an MoE to low-scale features in the encoder,
optimizing the balance between efficiency and effectiveness. The auxiliary
routes are discarded during inference. We conduct extensive experiments across
various object detection baselines, achieving consistent improvements as
demonstrated in Fig. 1. Our method is highly flexible and can be readily
adapted to other tasks. To demonstrate its versatility, we conduct experiments
on both instance segmentation and panoptic segmentation, further validating its
effectiveness. Project page: https://visual-ai.github.io/mrdetr/
comment: Under review. Extended version of our CVPR 2025 paper, see
arXiv:2412.10028v3
♻ ☆ PuriDefense: Randomized Local Implicit Adversarial Purification for Defending Black-box Query-based Attacks
Black-box query-based attacks constitute significant threats to Machine
Learning as a Service (MLaaS) systems since they can generate adversarial
examples without accessing the target model's architecture and parameters.
Traditional defense mechanisms, such as adversarial training, gradient masking,
and input transformations, either impose substantial computational costs or
compromise the test accuracy of non-adversarial inputs. To address these
challenges, we propose an efficient defense mechanism, PuriDefense, that
employs random patch-wise purifications with an ensemble of lightweight
purification models at a low level of inference cost. These models leverage the
local implicit function and rebuild the natural image manifold. Our theoretical
analysis suggests that this approach slows down the convergence of query-based
attacks by incorporating randomness into purifications. Extensive experiments
on CIFAR-10 and ImageNet validate the effectiveness of our proposed
purifier-based defense mechanism, demonstrating significant improvements in
robustness against query-based attacks.
♻ ☆ Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes
While the human visual system employs distinct mechanisms to perceive salient
and camouflaged objects, existing models struggle to disentangle these tasks.
Specifically, salient object detection (SOD) models frequently misclassify
camouflaged objects as salient, while camouflaged object detection (COD) models
conversely misinterpret salient objects as camouflaged. We hypothesize that
this can be attributed to two factors: (i) the specific annotation paradigm of
current SOD and COD datasets, and (ii) the lack of explicit attribute
relationship modeling in current models. Prevalent SOD/COD datasets enforce a
mutual exclusivity constraint, assuming scenes contain either salient or
camouflaged objects, which poorly aligns with the real world. Furthermore,
current SOD/COD methods are primarily designed for these highly constrained
datasets and lack explicit modeling of the relationship between salient and
camouflaged objects. In this paper, to promote the development of unconstrained
salient and camouflaged object detection, we construct a large-scale dataset,
USC12K, which features comprehensive labels and four different scenes that
cover all possible logical existence scenarios of both salient and camouflaged
objects. To explicitly model the relationship between salient and camouflaged
objects, we propose a model called USCNet, which introduces two distinct prompt
query mechanisms for modeling inter-sample and intra-sample attribute
relationships. Additionally, to assess the model's ability to distinguish
between salient and camouflaged objects, we design an evaluation metric called
CSCS. The proposed method achieves state-of-the-art performance across all
scenes in various metrics. The code and dataset will be available at
https://github.com/ssecv/USCNet.
comment: 18 pages, 11 figures
♻ ☆ Recall and Refine: A Simple but Effective Source-free Open-set Domain Adaptation Framework
Open-set Domain Adaptation (OSDA) aims to adapt a model from a labeled source
domain to an unlabeled target domain, where novel classes - also referred to as
target-private unknown classes - are present. Source-free Open-set Domain
Adaptation (SF-OSDA) methods address OSDA without accessing labeled source
data, making them particularly relevant under privacy constraints. However,
SF-OSDA presents significant challenges due to distribution shifts and the
introduction of novel classes. Existing SF-OSDA methods typically rely on
thresholding the prediction entropy of a sample to identify it as either a
known or unknown class, but fail to explicitly learn discriminative features
for the target-private unknown classes. We propose Recall and Refine (RRDA), a
novel SF-OSDA framework designed to address these limitations by explicitly
learning features for target-private unknown classes. RRDA employs a two-stage
process. First, we enhance the model's capacity to recognize unknown classes by
training a target classifier with an additional decision boundary,guided by
synthetic samples generated from target domain features. This enables the
classifier to effectively separate known and unknown classes. Second, we adapt
the entire model to the target domain, addressing both domain shifts and
distinguishability to unknown classes. Any off-the-shelf source-free domain
adaptation method (e.g. SHOT, AaD) can be seamlessly integrated into our
framework at this stage. Extensive experiments on three benchmark datasets
demonstrate that RRDA significantly outperforms existing SF-OSDA and OSDA
methods.
comment: Accepted at TMLR 2025
♻ ☆ Do It Yourself: Learning Semantic Correspondence from Pseudo-Labels SC
Finding correspondences between semantically similar points across images and
object instances is one of the everlasting challenges in computer vision. While
large pre-trained vision models have recently been demonstrated as effective
priors for semantic matching, they still suffer from ambiguities for symmetric
objects or repeated object parts. We propose to improve semantic correspondence
estimation via 3D-aware pseudo-labeling. Specifically, we train an adapter to
refine off-the-shelf features using pseudo-labels obtained via 3D-aware
chaining, filtering wrong labels through relaxed cyclic consistency, and 3D
spherical prototype mapping constraints. While reducing the need for dataset
specific annotations compared to prior work, we set a new state-of-the-art on
SPair-71k by over 4% absolute gain and by over 7% against methods with similar
supervision requirements. The generality of our proposed approach simplifies
extension of training to other data sources, which we demonstrate in our
experiments.
comment: Project page: https://genintel.github.io/DIY-SC
♻ ☆ Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance
Understanding medical ultrasound imaging remains a long-standing challenge
due to significant visual variability caused by differences in imaging and
acquisition parameters. Recent advancements in large language models (LLMs)
have been used to automatically generate terminology-rich summaries orientated
to clinicians with sufficient physiological knowledge. Nevertheless, the
increasing demand for improved ultrasound interpretability and basic scanning
guidance among non-expert users, e.g., in point-of-care settings, has not yet
been explored. In this study, we first introduce the scene graph (SG) for
ultrasound images to explain image content to ordinary and provide guidance for
ultrasound scanning. The ultrasound SG is first computed using a
transformer-based one-stage method, eliminating the need for explicit object
detection. To generate a graspable image explanation for ordinary, the user
query is then used to further refine the abstract SG representation through
LLMs. Additionally, the predicted SG is explored for its potential in guiding
ultrasound scanning toward missing anatomies within the current imaging view,
assisting ordinary users in achieving more standardized and complete anatomical
exploration. The effectiveness of this SG-based image explanation and scanning
guidance has been validated on images from the left and right neck regions,
including the carotid and thyroid, across five volunteers. The results
demonstrate the potential of the method to maximally democratize ultrasound by
enhancing its interpretability and usability for ordinaries.
♻ ☆ Enhancing Dynamic CT Image Reconstruction with Neural Fields and Optical Flow
In this paper, we investigate image reconstruction for dynamic Computed
Tomography. The motion of the target with respect to the measurement
acquisition rate leads to highly resolved in time but highly undersampled in
space measurements. Such problems pose a major challenge: not accounting for
the dynamics of the process leads to a poor reconstruction with non-realistic
motion. Variational approaches that penalize time evolution have been proposed
to relate subsequent frames and improve image quality based on classical
grid-based discretizations. Neural fields have emerged as a novel way to
parameterize the quantity of interest using a neural network with a
low-dimensional input, benefiting from being lightweight, continuous, and
biased towards smooth representations. The latter property has been exploited
when solving dynamic inverse problems with neural fields by minimizing a
data-fidelity term only. We investigate and show the benefits of introducing
explicit motion regularizers for dynamic inverse problems based on partial
differential equations, namely, the optical flow equation, for the optimization
of neural fields. We compare it against its unregularized counterpart and show
the improvements in the reconstruction. We also compare neural fields against a
grid-based solver and show that the former outperforms the latter in terms of
PSNR in this task.
♻ ☆ TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography
Music-driven dance generation has garnered significant attention due to its
wide range of industrial applications, particularly in the creation of group
choreography. During the group dance generation process, however, most existing
methods still face three primary issues: multi-dancer collisions, single-dancer
foot sliding and abrupt swapping in the generation of long group dance. In this
paper, we propose TCDiff++, a music-driven end-to-end framework designed to
generate harmonious group dance. Specifically, to mitigate multi-dancer
collisions, we utilize a dancer positioning embedding to better maintain the
relative positioning among dancers. Additionally, we incorporate a
distance-consistency loss to ensure that inter-dancer distances remain within
plausible ranges. To address the issue of single-dancer foot sliding, we
introduce a swap mode embedding to indicate dancer swapping patterns and design
a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For
long group dance generation, we present a long group diffusion sampling
strategy that reduces abrupt position shifts by injecting positional
information into the noisy input. Furthermore, we integrate a Sequence Decoder
layer to enhance the model's ability to selectively process long sequences.
Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art
performance, particularly in long-duration scenarios, ensuring high-quality and
coherent group dance generation.
♻ ☆ 3D Hierarchical Panoptic Segmentation in Real Orchard Environments Across Different Sensors IROS 2025
Matteo Sodano, Federico Magistri, Elias Marks, Fares Hosn, Aibek Zurbayev, Rodrigo Marcuzzi, Meher V. R. Malladi, Jens Behley, Cyrill Stachniss
Crop yield estimation is a relevant problem in agriculture, because an
accurate yield estimate can support farmers' decisions on harvesting or
precision intervention. Robots can help to automate this process. To do so,
they need to be able to perceive the surrounding environment to identify target
objects such as trees and plants. In this paper, we introduce a novel approach
to address the problem of hierarchical panoptic segmentation of apple orchards
on 3D data from different sensors. Our approach is able to simultaneously
provide semantic segmentation, instance segmentation of trunks and fruits, and
instance segmentation of trees (a trunk with its fruits). This allows us to
identify relevant information such as individual plants, fruits, and trunks,
and capture the relationship among them, such as precisely estimate the number
of fruits associated to each tree in an orchard. To efficiently evaluate our
approach for hierarchical panoptic segmentation, we provide a dataset designed
specifically for this task. Our dataset is recorded in Bonn, Germany, in a real
apple orchard with a variety of sensors, spanning from a terrestrial laser
scanner to a RGB-D camera mounted on different robots platforms. The
experiments show that our approach surpasses state-of-the-art approaches in 3D
panoptic segmentation in the agricultural domain, while also providing full
hierarchical panoptic segmentation. Our dataset is publicly available at
https://www.ipb.uni-bonn.de/data/hops/. The open-source implementation of our
approach is available at https://github.com/PRBonn/hapt3D.
comment: Accepted to IROS 2025
♻ ☆ Cell Tracking according to Biological Needs -- Strong Mitosis-aware Multi-Hypothesis Tracker with Aleatoric Uncertainty
Cell tracking and segmentation assist biologists in extracting insights from
large-scale microscopy time-lapse data. Driven by local accuracy metrics,
current tracking approaches often suffer from a lack of long-term consistency
and the ability to reconstruct lineage trees correctly. To address this issue,
we introduce an uncertainty estimation technique for motion estimation
frameworks and extend the multi-hypothesis tracking framework. Our uncertainty
estimation lifts motion representations into probabilistic spatial densities
using problem-specific test-time augmentations. Moreover, we introduce a novel
mitosis-aware assignment problem formulation that allows multi-hypothesis
trackers to model cell splits and to resolve false associations and mitosis
detections based on long-term conflicts. In our framework, explicit biological
knowledge is modeled in assignment costs. We evaluate our approach on nine
competitive datasets and demonstrate that we outperform the current
state-of-the-art on biologically inspired metrics substantially, achieving
improvements by a factor of approximately 6 and uncover new insights into the
behavior of motion estimation uncertainty.
comment: 13 pages, 4 figures, 4 tables. This work has been accepted to the
IEEE for publication
♻ ☆ SA-Person: Text-Based Person Retrieval with Scene-aware Re-ranking
Text-based person retrieval aims to identify a target individual from a
gallery of images based on a natural language description. It presents a
significant challenge due to the complexity of real-world scenes and the
ambiguity of appearance-related descriptions. Existing methods primarily
emphasize appearance-based cross-modal retrieval, often neglecting the
contextual information embedded within the scene, which can offer valuable
complementary insights for retrieval. To address this, we introduce
SCENEPERSON-13W, a large-scale dataset featuring over 100,000 scenes with rich
annotations covering both pedestrian appearance and environmental cues. Based
on this, we propose SA-Person, a two-stage retrieval framework. In the first
stage, it performs discriminative appearance grounding by aligning textual cues
with pedestrian-specific regions. In the second stage, it introduces
SceneRanker, a training-free, scene-aware re-ranking method leveraging
multimodal large language models to jointly reason over pedestrian appearance
and the global scene context. Experiments on SCENEPERSON-13W validate the
effectiveness of our framework in challenging scene-level retrieval scenarios.
The code and dataset will be made publicly available.
comment: 22 pages, 7 figures. Under review
♻ ☆ Variational Supervised Contrastive Learning
Contrastive learning has proven to be highly efficient and adaptable in
shaping representation spaces across diverse modalities by pulling similar
samples together and pushing dissimilar ones apart. However, two key
limitations persist: (1) Without explicit regulation of the embedding
distribution, semantically related instances can inadvertently be pushed apart
unless complementary signals guide pair selection, and (2) excessive reliance
on large in-batch negatives and tailored augmentations hinders generalization.
To address these limitations, we propose Variational Supervised Contrastive
Learning (VarCon), which reformulates supervised contrastive learning as
variational inference over latent class variables and maximizes a
posterior-weighted evidence lower bound (ELBO) that replaces exhaustive
pair-wise comparisons for efficient class-aware matching and grants
fine-grained control over intra-class dispersion in the embedding space.
Trained exclusively on image data, our experiments on CIFAR-10, CIFAR-100,
ImageNet-100, and ImageNet-1K show that VarCon (1) achieves state-of-the-art
performance for contrastive learning frameworks, reaching 79.36% Top-1 accuracy
on ImageNet-1K and 78.29% on CIFAR-100 with a ResNet-50 encoder while
converging in just 200 epochs; (2) yields substantially clearer decision
boundaries and semantic organization in the embedding space, as evidenced by
KNN classification, hierarchical clustering results, and transfer-learning
assessments; and (3) demonstrates superior performance in few-shot learning
than supervised baseline and superior robustness across various augmentation
strategies.
♻ ☆ Structure-Preserving Patch Decoding for Efficient Neural Video Representation
Implicit neural representations (INRs) are the subject of extensive research,
particularly in their application to modeling complex signals by mapping
spatial and temporal coordinates to corresponding values. When handling videos,
mapping compact inputs to entire frames or spatially partitioned patch images
is an effective approach. This strategy better preserves spatial relationships,
reduces computational overhead, and improves reconstruction quality compared to
coordinate-based mapping. However, predicting entire frames often limits the
reconstruction of high-frequency visual details. Additionally, conventional
patch-based approaches based on uniform spatial partitioning tend to introduce
boundary discontinuities that degrade spatial coherence. We propose a neural
video representation method based on Structure-Preserving Patches (SPPs) to
address such limitations. Our method separates each video frame into patch
images of spatially aligned frames through a deterministic pixel-based
splitting similar to PixelUnshuffle. This operation preserves the global
spatial structure while allowing patch-level decoding. We train the decoder to
reconstruct these structured patches, enabling a global-to-local decoding
strategy that captures the global layout first and refines local details. This
effectively reduces boundary artifacts and mitigates distortions from naive
upsampling. Experiments on standard video datasets demonstrate that our method
achieves higher reconstruction quality and better compression performance than
existing INR-based baselines.
♻ ☆ StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
World models have recently become promising tools for predicting realistic
visuals based on actions in complex environments. However, their reliance on
only a few recent observations leads them to lose track of the long-term
context. Consequently, in just a few steps the generated scenes drift from what
was previously observed, undermining the temporal coherence of the sequence.
This limitation of the state-of-the-art world models, most of which rely on
diffusion, comes from their lack of a lasting environment state. To address
this problem, we introduce StateSpaceDiffuser, where a diffusion model is
enabled to perform long-context tasks by integrating features from a
state-space model, representing the entire interaction history. This design
restores long-term memory while preserving the high-fidelity synthesis of
diffusion models. To rigorously measure temporal consistency, we develop an
evaluation protocol that probes a model's ability to reinstantiate seen content
in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser
significantly outperforms a strong diffusion-only baseline, maintaining a
coherent visual context for an order of magnitude more steps. It delivers
consistent views in both a 2D maze navigation and a complex 3D environment.
These results establish that bringing state-space representations into
diffusion models is highly effective in demonstrating both visual details and
long-term memory.
♻ ☆ Moderating the Generalization of Score-based Generative Model
Score-based Generative Models (SGMs) have demonstrated remarkable
generalization abilities, e.g. generating unseen, but natural data. However,
the greater the generalization power, the more likely the unintended
generalization, and the more dangerous the abuse. Research on moderated
generalization in SGMs remains limited. To fill this gap, we first examine the
current 'gold standard' in Machine Unlearning (MU), i.e., re-training the model
after removing the undesirable training data, and find it does not work in
SGMs. Further analysis of score functions reveals that the MU 'gold standard'
does not alter the original score function, which explains its ineffectiveness.
Based on this insight, we propose the first Moderated Score-based Generative
Model (MSGM), which introduces a novel score adjustment strategy that redirects
the score function away from undesirable data during the continuous-time
stochastic differential equation process. Extensive experimental results
demonstrate that MSGM significantly reduces the likelihood of generating
undesirable content while preserving high visual quality for normal image
generation. Albeit designed for SGMs, MSGM is a general and flexible MU
framework that is compatible with diverse diffusion architectures (SGM and
DDPM) and training strategies (re-training and fine-tuning), and enables
zero-shot transfer of the pre-trained models to downstream tasks, e.g. image
inpainting and reconstruction. The code will be shared upon acceptance.
♻ ☆ Metis-RISE: RL Incentivizes and SFT Enhances Multimodal Reasoning Model Learning
Recent advancements in large language models (LLMs) have witnessed a surge in
the development of advanced reasoning paradigms, which are now being integrated
into multimodal large language models (MLLMs). However, existing approaches
often fall short: methods solely employing reinforcement learning (RL) can
struggle with sample inefficiency and activating entirely absent reasoning
capabilities, while conventional pipelines that initiate with a cold-start
supervised fine-tuning (SFT) phase before RL may restrict the model's
exploratory capacity and face suboptimal convergence. In this work, we
introduce \textbf{Metis-RISE} (\textbf{R}L \textbf{I}ncentivizes and
\textbf{S}FT \textbf{E}nhances) for multimodal reasoning model learning. Unlike
conventional approaches, Metis-RISE distinctively omits an initial SFT stage,
beginning instead with an RL phase (e.g., using a Group Relative Policy
Optimization variant) to incentivize and activate the model's latent reasoning
capacity. Subsequently, the targeted SFT stage addresses two key challenges
identified during RL: (1) \textit{inefficient trajectory sampling} for tasks
where the model possesses but inconsistently applies correct reasoning, which
we tackle using self-distilled reasoning trajectories from the RL model itself;
and (2) \textit{fundamental capability absence}, which we address by injecting
expert-augmented knowledge for prompts where the model entirely fails. This
strategic application of RL for incentivization followed by SFT for enhancement
forms the core of Metis-RISE, leading to two versions of our MLLMs (7B and 72B
parameters). Evaluations on the OpenCompass Multimodal Reasoning Leaderboard
demonstrate that both models achieve state-of-the-art performance among
similar-sized models, with the 72B version ranking fourth overall. Please refer
to our project page for open-source information.
comment: Project Page: https://github.com/MM-Thinking/Metis-RISE
♻ ☆ Self-Regulated Neurogenesis for Online Data-Incremental Learning
Neural networks often struggle with catastrophic forgetting when learning
sequences of tasks or data streams, unlike humans who can continuously learn
and consolidate new concepts even in the absence of explicit cues. Online
data-incremental learning seeks to emulate this capability by processing each
sample only once, without having access to task or stream cues at any point in
time since this is more realistic compared to offline setups, where all data
from novel class(es) is assumed to be readily available. However, existing
methods typically rely on storing the subsets of data in memory or expanding
the initial model architecture, resulting in significant computational
overhead. Drawing inspiration from 'self-regulated neurogenesis'-brain's
mechanism for creating specialized regions or circuits for distinct
functions-we propose a novel approach SERENA which encodes each concept in a
specialized network path called 'concept cell', integrated into a single
over-parameterized network. Once a concept is learned, its corresponding
concept cell is frozen, effectively preventing the forgetting of previously
acquired information. Furthermore, we introduce two new continual learning
scenarios that more closely reflect real-world conditions, characterized by
gradually changing sample sizes. Experimental results show that our method not
only establishes new state-of-the-art results across ten benchmarks but also
remarkably surpasses offline supervised batch learning performance. The code is
available at https://github.com/muratonuryildirim/serena.
comment: Published at Conference on Lifelong Learning Agents (CoLLAs) 2025
♻ ☆ Referring Expression Instance Retrieval and A Strong End-to-End Baseline
Using natural language to query visual information is a fundamental need in
real-world applications. Text-Image Retrieval (TIR) retrieves a target image
from a gallery based on an image-level description, while Referring Expression
Comprehension (REC) localizes a target object within a given image using an
instance-level description. However, real-world applications often present more
complex demands. Users typically query an instance-level description across a
large gallery and expect to receive both relevant image and the corresponding
instance location. In such scenarios, TIR struggles with fine-grained
descriptions and object-level localization, while REC is limited in its ability
to efficiently search large galleries and lacks an effective ranking mechanism.
In this paper, we introduce a new task called \textbf{Referring Expression
Instance Retrieval (REIR)}, which supports both instance-level retrieval and
localization based on fine-grained referring expressions. First, we propose a
large-scale benchmark for REIR, named REIRCOCO, constructed by prompting
advanced vision-language models to generate high-quality referring expressions
for instances in the MSCOCO and RefCOCO datasets. Second, we present a baseline
method, Contrastive Language-Instance Alignment with Relation Experts (CLARE),
which employs a dual-stream architecture to address REIR in an end-to-end
manner. Given a referring expression, the textual branch encodes it into a
query embedding. The visual branch detects candidate objects and extracts their
instance-level visual features. The most similar candidate to the query is
selected for bounding box prediction. CLARE is first trained on object
detection and REC datasets to establish initial grounding capabilities, then
optimized via Contrastive Language-Instance Alignment (CLIA) for improved
retrieval across images. We will release our code and benchmark publicly.