Computer Vision and Pattern Recognition 104
β HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis ICCV 2025
Timo Teufel, Pulkit Gera, Xilong Zhou, Umar Iqbal, Pramod Rao, Jan Kautz, Vladislav Golyanik, Christian Theobalt
Simultaneous relighting and novel-view rendering of digital human
representations is an important yet challenging task with numerous
applications. Progress in this area has been significantly limited due to the
lack of publicly available, high-quality datasets, especially for full-body
human captures. To address this critical gap, we introduce the HumanOLAT
dataset, the first publicly accessible large-scale dataset of multi-view
One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes
HDR RGB frames under various illuminations, such as white light, environment
maps, color gradients and fine-grained OLAT illuminations. Our evaluations of
state-of-the-art relighting and novel-view synthesis methods underscore both
the dataset's value and the significant challenges still present in modeling
complex human-centric appearance and lighting interactions. We believe
HumanOLAT will significantly facilitate future research, enabling rigorous
benchmarking and advancements in both general and human-specific relighting and
rendering techniques.
comment: TT and PG contributed equally; accepted at ICCV 2025; project page:
https://vcai.mpi-inf.mpg.de/projects/HumanOLAT/
β Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
There is a growing demand for deploying large generative AI models on mobile
devices. For recent popular video generative models, however, the Variational
AutoEncoder (VAE) represents one of the major computational bottlenecks. Both
large parameter sizes and mismatched kernels cause out-of-memory errors or
extremely slow inference on mobile devices. To address this, we propose a
low-cost solution that efficiently transfers widely used video VAEs to mobile
devices. (1) We analyze redundancy in existing VAE architectures and get
empirical design insights. By integrating 3D depthwise separable convolutions
into our model, we significantly reduce the number of parameters. (2) We
observe that the upsampling techniques in mainstream video VAEs are poorly
suited to mobile hardware and form the main bottleneck. In response, we propose
a decoupled 3D pixel shuffle scheme that slashes end-to-end delay. Building
upon these, we develop a universal mobile-oriented VAE decoder, Turbo-VAED. (3)
We propose an efficient VAE decoder training method. Since only the decoder is
used during deployment, we distill it to Turbo-VAED instead of retraining the
full VAE, enabling fast mobile adaptation with minimal performance loss. To our
knowledge, our method enables real-time 720p video VAE decoding on mobile
devices for the first time. This approach is widely applicable to most video
VAEs. When integrated into four representative models, with training cost as
low as $95, it accelerates original VAEs by up to 84.5x at 720p resolution on
GPUs, uses as low as 17.5% of original parameter count, and retains 96.9% of
the original reconstruction quality. Compared to mobile-optimized VAEs,
Turbo-VAED achieves a 2.9x speedup in FPS and better reconstruction quality on
the iPhone 16 Pro. The code and models will soon be available at
https://github.com/hustvl/Turbo-VAED.
β Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer
Zixin Yin, Xili Dai, Ling-Hao Chen, Deyu Zhou, Jianan Wang, Duomin Wang, Gang Yu, Lionel M. Ni, Heung-Yeung Shum
Text-guided color editing in images and videos is a fundamental yet unsolved
problem, requiring fine-grained manipulation of color attributes, including
albedo, light source color, and ambient lighting, while preserving physical
consistency in geometry, material properties, and light-matter interactions.
Existing training-free methods offer broad applicability across editing tasks
but struggle with precise color control and often introduce visual
inconsistency in both edited and non-edited regions. In this work, we present
ColorCtrl, a training-free color editing method that leverages the attention
mechanisms of modern Multi-Modal Diffusion Transformers (MM-DiT). By
disentangling structure and color through targeted manipulation of attention
maps and value tokens, our method enables accurate and consistent color
editing, along with word-level control of attribute intensity. Our method
modifies only the intended regions specified by the prompt, leaving unrelated
areas untouched. Extensive experiments on both SD3 and FLUX.1-dev demonstrate
that ColorCtrl outperforms existing training-free approaches and achieves
state-of-the-art performances in both edit quality and consistency.
Furthermore, our method surpasses strong commercial models such as FLUX.1
Kontext Max and GPT-4o Image Generation in terms of consistency. When extended
to video models like CogVideoX, our approach exhibits greater advantages,
particularly in maintaining temporal coherence and editing stability. Finally,
our method also generalizes to instruction-based editing diffusion models such
as Step1X-Edit and FLUX.1 Kontext dev, further demonstrating its versatility.
β OpenCUA: Open Foundations for Computer-Use Agents
Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, Zhennan Shen, Zhuokai Li, Ryan Li, Xiaochuan Li, Junda Chen, Boyuan Zheng, Peihang Li, Fangyu Lei, Ruisheng Cao, Yeqiao Fu, Dongchan Shin, Martin Shin, Jiarui Hu, Yuyan Wang, Jixuan Chen, Yuxiao Ye, Danyang Zhang, Dikang Du, Hao Hu, Huarong Chen, Zaida Zhou, Yipu Wang, Heng Wang, Diyi Yang, Victor Zhong, Flood Sung, Y. Charles, Zhilin Yang, Tao Yu
Vision-language models have demonstrated impressive capabilities as
computer-use agents (CUAs) capable of automating diverse computer tasks. As
their commercial potential grows, critical details of the most capable CUA
systems remain closed. As these agents will increasingly mediate digital
interactions and execute consequential decisions on our behalf, the research
community needs access to open CUA frameworks to study their capabilities,
limitations, and risks. To bridge this gap, we propose OpenCUA, a comprehensive
open-source framework for scaling CUA data and foundation models. Our framework
consists of: (1) an annotation infrastructure that seamlessly captures human
computer-use demonstrations; (2) AgentNet, the first large-scale computer-use
task dataset spanning 3 operating systems and 200+ applications and websites;
(3) a scalable pipeline that transforms demonstrations into state-action pairs
with reflective long Chain-of-Thought reasoning that sustain robust performance
gains as data scales. Our end-to-end agent models demonstrate strong
performance across CUA benchmarks. In particular, OpenCUA-32B achieves an
average success rate of 34.8% on OSWorld-Verified, establishing a new
state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA
(GPT-4o). Further analysis confirms that our approach generalizes well across
domains and benefits significantly from increased test-time computation. We
release our annotation tool, datasets, code, and models to build open
foundations for further CUA research.
β Deep Learning Models for Robust Facial Liveness Detection
Oleksandr Kuznetsov, Emanuele Frontoni, Luca Romeo, Riccardo Rosati, Andrea Maranesi, Alessandro Muscatello
In the rapidly evolving landscape of digital security, biometric
authentication systems, particularly facial recognition, have emerged as
integral components of various security protocols. However, the reliability of
these systems is compromised by sophisticated spoofing attacks, where imposters
gain unauthorized access by falsifying biometric traits. Current literature
reveals a concerning gap: existing liveness detection methodologies - designed
to counteract these breaches - fall short against advanced spoofing tactics
employing deepfakes and other artificial intelligence-driven manipulations.
This study introduces a robust solution through novel deep learning models
addressing the deficiencies in contemporary anti-spoofing techniques. By
innovatively integrating texture analysis and reflective properties associated
with genuine human traits, our models distinguish authentic presence from
replicas with remarkable precision. Extensive evaluations were conducted across
five diverse datasets, encompassing a wide range of attack vectors and
environmental conditions. Results demonstrate substantial advancement over
existing systems, with our best model (AttackNet V2.2) achieving 99.9% average
accuracy when trained on combined data. Moreover, our research unveils critical
insights into the behavioral patterns of impostor attacks, contributing to a
more nuanced understanding of their evolving nature. The implications are
profound: our models do not merely fortify the authentication processes but
also instill confidence in biometric systems across various sectors reliant on
secure access.
β Addressing Bias in VLMs for Glaucoma Detection Without Protected Attribute Supervision MICCAI-2025
Ahsan Habib Akash, Greg Murray, Annahita Amireskandari, Joel Palko, Carol Laxson, Binod Bhattarai, Prashnna Gyawali
Vision-Language Models (VLMs) have achieved remarkable success on multimodal
tasks such as image-text retrieval and zero-shot classification, yet they can
exhibit demographic biases even when explicit protected attributes are absent
during training. In this work, we focus on automated glaucoma screening from
retinal fundus images, a critical application given that glaucoma is a leading
cause of irreversible blindness and disproportionately affects underserved
populations. Building on a reweighting-based contrastive learning framework, we
introduce an attribute-agnostic debiasing method that (i) infers proxy
subgroups via unsupervised clustering of image-image embeddings, (ii) computes
gradient-similarity weights between the CLIP-style multimodal loss and a
SimCLR-style image-pair contrastive loss, and (iii) applies these weights in a
joint, top-$k$ weighted objective to upweight underperforming clusters. This
label-free approach adaptively targets the hardest examples, thereby reducing
subgroup disparities. We evaluate our method on the Harvard FairVLMed glaucoma
subset, reporting Equalized Odds Distance (EOD), Equalized Subgroup AUC (ES
AUC), and Groupwise AUC to demonstrate equitable performance across inferred
demographic subgroups.
comment: 3rd Workshop in Data Engineering in Medical Imaging (DEMI),
MICCAI-2025 Workshop
β Efficient motion-based metrics for video frame interpolation SP
Video frame interpolation (VFI) offers a way to generate intermediate frames
between consecutive frames of a video sequence. Although the development of
advanced frame interpolation algorithms has received increased attention in
recent years, assessing the perceptual quality of interpolated content remains
an ongoing area of research. In this paper, we investigate simple ways to
process motion fields, with the purposes of using them as video quality metric
for evaluating frame interpolation algorithms. We evaluate these quality
metrics using the BVI-VFI dataset which contains perceptual scores measured for
interpolated sequences. From our investigation we propose a motion metric based
on measuring the divergence of motion fields. This metric correlates reasonably
with these perceptual scores (PLCC=0.51) and is more computationally efficient
(x2.7 speedup) compared to FloLPIPS (a well known motion-based metric). We then
use our new proposed metrics to evaluate a range of state of the art frame
interpolation metrics and find our metrics tend to favour more perceptual
pleasing interpolated frames that may not score highly in terms of PSNR or
SSIM.
comment: SPIE2025 - Applications of Digital Image Processing XLVIII accepted
manuscript
β Scaling Learned Image Compression Models up to 1 Billion
Recent advances in large language models (LLMs) highlight a strong connection
between intelligence and compression. Learned image compression, a fundamental
task in modern data compression, has made significant progress in recent years.
However, current models remain limited in scale, restricting their
representation capacity, and how scaling model size influences compression
performance remains unexplored. In this work, we present a pioneering study on
scaling up learned image compression models and revealing the performance
trends through scaling laws. Using the recent state-of-the-art HPCM model as
baseline, we scale model parameters from 68.5 millions to 1 billion and fit
power-law relations between test loss and key scaling variables, including
model size and optimal training compute. The results reveal a scaling trend,
enabling extrapolation to larger scale models. Experimental results demonstrate
that the scaled-up HPCM-1B model achieves state-of-the-art rate-distortion
performance. We hope this work inspires future exploration of large-scale
compression models and deeper investigations into the connection between
compression and intelligence.
comment: 11 pages, technical report
β A new dataset and comparison for multi-camera frame synthesis SP
Many methods exist for frame synthesis in image sequences but can be broadly
categorised into frame interpolation and view synthesis techniques.
Fundamentally, both frame interpolation and view synthesis tackle the same
task, interpolating a frame given surrounding frames in time or space. However,
most frame interpolation datasets focus on temporal aspects with single cameras
moving through time and space, while view synthesis datasets are typically
biased toward stereoscopic depth estimation use cases. This makes direct
comparison between view synthesis and frame interpolation methods challenging.
In this paper, we develop a novel multi-camera dataset using a custom-built
dense linear camera array to enable fair comparison between these approaches.
We evaluate classical and deep learning frame interpolators against a view
synthesis method (3D Gaussian Splatting) for the task of view in-betweening.
Our results reveal that deep learning methods do not significantly outperform
classical methods on real image data, with 3D Gaussian Splatting actually
underperforming frame interpolators by as much as 3.5 dB PSNR. However, in
synthetic scenes, the situation reverses -- 3D Gaussian Splatting outperforms
frame interpolation algorithms by almost 5 dB PSNR at a 95% confidence level.
comment: SPIE2025 - Applications of Digital Image Processing XLVIII accepted
manuscript
β VertexRegen: Mesh Generation with Continuous Level of Detail ICCV 2025
We introduce VertexRegen, a novel mesh generation framework that enables
generation at a continuous level of detail. Existing autoregressive methods
generate meshes in a partial-to-complete manner and thus intermediate steps of
generation represent incomplete structures. VertexRegen takes inspiration from
progressive meshes and reformulates the process as the reversal of edge
collapse, i.e. vertex split, learned through a generative model. Experimental
results demonstrate that VertexRegen produces meshes of comparable quality to
state-of-the-art methods while uniquely offering anytime generation with the
flexibility to halt at any step to yield valid meshes with varying levels of
detail.
comment: ICCV 2025. Project Page: https://vertexregen.github.io/
β VLM-3D:End-to-End Vision-Language Models for Open-World 3D Perception
Open-set perception in complex traffic environments poses a critical
challenge for autonomous driving systems, particularly in identifying
previously unseen object categories, which is vital for ensuring safety. Visual
Language Models (VLMs), with their rich world knowledge and strong semantic
reasoning capabilities, offer new possibilities for addressing this task.
However, existing approaches typically leverage VLMs to extract visual features
and couple them with traditional object detectors, resulting in multi-stage
error propagation that hinders perception accuracy. To overcome this
limitation, we propose VLM-3D, the first end-to-end framework that enables VLMs
to perform 3D geometric perception in autonomous driving scenarios. VLM-3D
incorporates Low-Rank Adaptation (LoRA) to efficiently adapt VLMs to driving
tasks with minimal computational overhead, and introduces a joint
semantic-geometric loss design: token-level semantic loss is applied during
early training to ensure stable convergence, while 3D IoU loss is introduced in
later stages to refine the accuracy of 3D bounding box predictions. Evaluations
on the nuScenes dataset demonstrate that the proposed joint semantic-geometric
loss in VLM-3D leads to a 12.8% improvement in perception accuracy, fully
validating the effectiveness and advancement of our method.
β ALFred: An Active Learning Framework for Real-world Semi-supervised Anomaly Detection with Adaptive Thresholds
Video Anomaly Detection (VAD) can play a key role in spotting unusual
activities in video footage. VAD is difficult to use in real-world settings due
to the dynamic nature of human actions, environmental variations, and domain
shifts. Traditional evaluation metrics often prove inadequate for such
scenarios, as they rely on static assumptions and fall short of identifying a
threshold that distinguishes normal from anomalous behavior in dynamic
settings. To address this, we introduce an active learning framework tailored
for VAD, designed for adapting to the ever-changing real-world conditions. Our
approach leverages active learning to continuously select the most informative
data points for labeling, thereby enhancing model adaptability. A critical
innovation is the incorporation of a human-in-the-loop mechanism, which enables
the identification of actual normal and anomalous instances from
pseudo-labeling results generated by AI. This collected data allows the
framework to define an adaptive threshold tailored to different environments,
ensuring that the system remains effective as the definition of 'normal' shifts
across various settings. Implemented within a lab-based framework that
simulates real-world conditions, our approach allows rigorous testing and
refinement of VAD algorithms with a new metric. Experimental results show that
our method achieves an EBI (Error Balance Index) of 68.91 for Q3 in real-world
simulated scenarios, demonstrating its practical effectiveness and
significantly enhancing the applicability of VAD in dynamic environments.
β Per-Query Visual Concept Learning
Visual concept learning, also known as Text-to-image personalization, is the
process of teaching new concepts to a pretrained model. This has numerous
applications from product placement to entertainment and personalized design.
Here we show that many existing methods can be substantially augmented by
adding a personalization step that is (1) specific to the prompt and noise
seed, and (2) using two loss terms based on the self- and cross- attention,
capturing the identity of the personalized concept. Specifically, we leverage
PDM features - previously designed to capture identity - and show how they can
be used to improve personalized semantic similarity. We evaluate the benefit
that our method gains on top of six different personalization methods, and
several base text-to-image models (both UNet- and DiT-based). We find
significant improvements even over previous per-query personalization methods.
comment: Project page is at
https://per-query-visual-concept-learning.github.io/
β Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding
Vision-Language-Action models have demonstrated remarkable capabilities in
predicting agent movements within virtual environments and real-world scenarios
based on visual observations and textual instructions. Although recent research
has focused on enhancing spatial and temporal understanding independently, this
paper presents a novel approach that integrates both aspects through visual
prompting. We introduce a method that projects visual traces of key points from
observations onto depth maps, enabling models to capture both spatial and
temporal information simultaneously. The experiments in SimplerEnv show that
the mean number of tasks successfully solved increased for 4% compared to
SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this
enhancement can be achieved with minimal training data, making it particularly
valuable for real-world applications where data collection is challenging. The
project page is available at https://ampiromax.github.io/ST-VLA.
β When Deepfakes Look Real: Detecting AI-Generated Faces with Unlabeled Data due to Annotation Challenges
Existing deepfake detection methods heavily depend on labeled training data.
However, as AI-generated content becomes increasingly realistic, even
\textbf{human annotators struggle to distinguish} between deepfakes and
authentic images. This makes the labeling process both time-consuming and less
reliable. Specifically, there is a growing demand for approaches that can
effectively utilize large-scale unlabeled data from online social networks.
Unlike typical unsupervised learning tasks, where categories are distinct,
AI-generated faces closely mimic real image distributions and share strong
similarities, causing performance drop in conventional strategies. In this
paper, we introduce the Dual-Path Guidance Network (DPGNet), to tackle two key
challenges: (1) bridging the domain gap between faces from different generation
models, and (2) utilizing unlabeled image samples. The method features two core
modules: text-guided cross-domain alignment, which uses learnable prompts to
unify visual and textual embeddings into a domain-invariant feature space, and
curriculum-driven pseudo label generation, which dynamically exploit more
informative unlabeled samples. To prevent catastrophic forgetting, we also
facilitate bridging between domains via cross-domain knowledge distillation.
Extensive experiments on \textbf{11 popular datasets}, show that DPGNet
outperforms SoTA approaches by \textbf{6.3\%}, highlighting its effectiveness
in leveraging unlabeled data to address the annotation challenges posed by the
increasing realism of deepfakes.
comment: 10pages,5figures
β Uncertainty-aware Cross-training for Semi-supervised Medical Image Segmentation
Semi-supervised learning has gained considerable popularity in medical image
segmentation tasks due to its capability to reduce reliance on expert-examined
annotations. Several mean-teacher (MT) based semi-supervised methods utilize
consistency regularization to effectively leverage valuable information from
unlabeled data. However, these methods often heavily rely on the student model
and overlook the potential impact of cognitive biases within the model.
Furthermore, some methods employ co-training using pseudo-labels derived from
different inputs, yet generating high-confidence pseudo-labels from perturbed
inputs during training remains a significant challenge. In this paper, we
propose an Uncertainty-aware Cross-training framework for semi-supervised
medical image Segmentation (UC-Seg). Our UC-Seg framework incorporates two
distinct subnets to effectively explore and leverage the correlation between
them, thereby mitigating cognitive biases within the model. Specifically, we
present a Cross-subnet Consistency Preservation (CCP) strategy to enhance
feature representation capability and ensure feature consistency across the two
subnets. This strategy enables each subnet to correct its own biases and learn
shared semantics from both labeled and unlabeled data. Additionally, we propose
an Uncertainty-aware Pseudo-label Generation (UPG) component that leverages
segmentation results and corresponding uncertainty maps from both subnets to
generate high-confidence pseudo-labels. We extensively evaluate the proposed
UC-Seg on various medical image segmentation tasks involving different modality
images, such as MRI, CT, ultrasound, colonoscopy, and so on. The results
demonstrate that our method achieves superior segmentation accuracy and
generalization performance compared to other state-of-the-art semi-supervised
methods. Our code will be released at https://github.com/taozh2017/UCSeg.
comment: 14 pages, 10 figures
β Towards Perfection: Building Inter-component Mutual Correction for Retinex-based Low-light Image Enhancement
In low-light image enhancement, Retinex-based deep learning methods have
garnered significant attention due to their exceptional interpretability. These
methods decompose images into mutually independent illumination and reflectance
components, allows each component to be enhanced separately. In fact, achieving
perfect decomposition of illumination and reflectance components proves to be
quite challenging, with some residuals still existing after decomposition. In
this paper, we formally name these residuals as inter-component residuals
(ICR), which has been largely underestimated by previous methods. In our
investigation, ICR not only affects the accuracy of the decomposition but also
causes enhanced components to deviate from the ideal outcome, ultimately
reducing the final synthesized image quality. To address this issue, we propose
a novel Inter-correction Retinex model (IRetinex) to alleviate ICR during the
decomposition and enhancement stage. In the decomposition stage, we leverage
inter-component residual reduction module to reduce the feature similarity
between illumination and reflectance components. In the enhancement stage, we
utilize the feature similarity between the two components to detect and
mitigate the impact of ICR within each enhancement unit. Extensive experiments
on three low-light benchmark datasets demonstrated that by reducing ICR, our
method outperforms state-of-the-art approaches both qualitatively and
quantitatively.
comment: This article has been accepted by ACMMM 2025
β UniConvNet: Expanding Effective Receptive Field while Maintaining Asymptotically Gaussian Distribution for ConvNets of Any Scale ICCV 2025
Convolutional neural networks (ConvNets) with large effective receptive field
(ERF), still in their early stages, have demonstrated promising effectiveness
while constrained by high parameters and FLOPs costs and disrupted
asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an
alternative paradigm: rather than merely employing extremely large ERF, it is
more effective and efficient to expand the ERF while maintaining AGD of ERF by
proper combination of smaller kernels, such as $7\times{7}$, $9\times{9}$,
$11\times{11}$. This paper introduces a Three-layer Receptive Field Aggregator
and designs a Layer Operator as the fundamental operator from the perspective
of receptive field. The ERF can be expanded to the level of existing
large-kernel ConvNets through the stack of proposed modules while maintaining
AGD of ERF. Using these designs, we propose a universal model for ConvNet of
any scale, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017,
and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and
ViTs across various vision recognition tasks for both lightweight and
large-scale models with comparable throughput. Surprisingly, UniConvNet-T
achieves $84.2\%$ ImageNet top-1 accuracy with $30M$ parameters and $5.1G$
FLOPs. UniConvNet-XL also shows competitive scalability to big data and large
models, acquiring $88.4\%$ top-1 accuracy on ImageNet. Code and models are
publicly available at https://github.com/ai-paperwithcode/UniConvNet.
comment: ICCV 2025
β Spatial-Temporal Multi-Scale Quantization for Flexible Motion Generation
Despite significant advancements in human motion generation, current motion
representations, typically formulated as discrete frame sequences, still face
two critical limitations: (i) they fail to capture motion from a multi-scale
perspective, limiting the capability in complex patterns modeling; (ii) they
lack compositional flexibility, which is crucial for model's generalization in
diverse generation tasks. To address these challenges, we introduce MSQ, a
novel quantization method that compresses the motion sequence into multi-scale
discrete tokens across spatial and temporal dimensions. MSQ employs distinct
encoders to capture body parts at varying spatial granularities and temporally
interpolates the encoded features into multiple scales before quantizing them
into discrete tokens. Building on this representation, we establish a
generative mask modeling model to effectively support motion editing, motion
control, and conditional motion generation. Through quantitative and
qualitative analysis, we show that our quantization method enables the seamless
composition of motion tokens without requiring specialized design or
re-training. Furthermore, extensive evaluations demonstrate that our approach
outperforms existing baseline methods on various benchmarks.
comment: 18 pages
β KFFocus: Highlighting Keyframes for Enhanced Video Understanding
Recently, with the emergence of large language models, multimodal LLMs have
demonstrated exceptional capabilities in image and video modalities. Despite
advancements in video comprehension, the substantial computational demands of
long video sequences lead current video LLMs (Vid-LLMs) to employ compression
strategies at both the inter-frame level (e.g., uniform sampling of video
frames) and intra-frame level (e.g., condensing all visual tokens of each frame
into a limited number). However, this approach often neglects the uneven
temporal distribution of critical information across frames, risking the
omission of keyframes that contain essential temporal and semantic details. To
tackle these challenges, we propose KFFocus, a method designed to efficiently
compress video tokens and emphasize the informative context present within
video frames. We substitute uniform sampling with a refined approach inspired
by classic video compression principles to identify and capture keyframes based
on their temporal redundancy. By assigning varying condensation ratios to
frames based on their contextual relevance, KFFocus efficiently reduces token
redundancy while preserving informative content details. Additionally, we
introduce a spatiotemporal modeling module that encodes both the temporal
relationships between video frames and the spatial structure within each frame,
thus providing Vid-LLMs with a nuanced understanding of spatial-temporal
dynamics. Extensive experiments on widely recognized video understanding
benchmarks, especially long video scenarios, demonstrate that KFFocus
significantly outperforms existing methods, achieving substantial computational
efficiency and enhanced accuracy.
β ColorGPT: Leveraging Large Language Models for Multimodal Color Recommendation ICDAR2025
Colors play a crucial role in the design of vector graphic documents by
enhancing visual appeal, facilitating communication, improving usability, and
ensuring accessibility. In this context, color recommendation involves
suggesting appropriate colors to complete or refine a design when one or more
colors are missing or require alteration. Traditional methods often struggled
with these challenges due to the complex nature of color design and the limited
data availability. In this study, we explored the use of pretrained Large
Language Models (LLMs) and their commonsense reasoning capabilities for color
recommendation, raising the question: Can pretrained LLMs serve as superior
designers for color recommendation tasks? To investigate this, we developed a
robust, rigorously validated pipeline, ColorGPT, that was built by
systematically testing multiple color representations and applying effective
prompt engineering techniques. Our approach primarily targeted color palette
completion by recommending colors based on a set of given colors and
accompanying context. Moreover, our method can be extended to full palette
generation, producing an entire color palette corresponding to a provided
textual description. Experimental results demonstrated that our LLM-based
pipeline outperformed existing methods in terms of color suggestion accuracy
and the distribution of colors in the color palette completion task. For the
full palette generation task, our approach also yielded improvements in color
diversity and similarity compared to current techniques.
comment: Accepted to ICDAR2025
β TaoCache: Structure-Maintained Video Generation Acceleration
Existing cache-based acceleration methods for video diffusion models
primarily skip early or mid denoising steps, which often leads to structural
discrepancies relative to full-timestep generation and can hinder instruction
following and character consistency. We present TaoCache, a training-free,
plug-and-play caching strategy that, instead of residual-based caching, adopts
a fixed-point perspective to predict the model's noise output and is
specifically effective in late denoising stages. By calibrating cosine
similarities and norm ratios of consecutive noise deltas, TaoCache preserves
high-resolution structure while enabling aggressive skipping. The approach is
orthogonal to complementary accelerations such as Pyramid Attention Broadcast
(PAB) and TeaCache, and it integrates seamlessly into DiT-based frameworks.
Across Latte-1, OpenSora-Plan v110, and Wan2.1, TaoCache attains substantially
higher visual quality (LPIPS, SSIM, PSNR) than prior caching methods under the
same speedups.
β Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
The Earth's surface is constantly changing, and detecting these changes
provides valuable insights that benefit various aspects of human society. While
traditional change detection methods have been employed to detect changes from
bi-temporal images, these approaches typically require expert knowledge for
accurate interpretation. To enable broader and more flexible access to change
information by non-expert users, the task of Change Detection Visual Question
Answering (CDVQA) has been introduced. However, existing CDVQA methods have
been developed under the assumption that training and testing datasets share
similar distributions. This assumption does not hold in real-world
applications, where domain shifts often occur. In this paper, the CDVQA task is
revisited with a focus on addressing domain shift. To this end, a new
multi-modal and multi-domain dataset, BrightVQA, is introduced to facilitate
domain generalization research in CDVQA. Furthermore, a novel state space
model, termed Text-Conditioned State Space Model (TCSSM), is proposed. The
TCSSM framework is designed to leverage both bi-temporal imagery and
geo-disaster-related textual information in an unified manner to extract
domain-invariant features across domains. Input-dependent parameters existing
in TCSSM are dynamically predicted by using both bi-temporal images and
geo-disaster-related description, thereby facilitating the alignment between
bi-temporal visual data and the associated textual descriptions. Extensive
experiments are conducted to evaluate the proposed method against
state-of-the-art models, and superior performance is consistently demonstrated.
The code and dataset will be made publicly available upon acceptance at
https://github.com/Elman295/TCSSM.
β Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation ICCV 2025
Storytelling tasks involving generating consistent subjects have gained
significant attention recently. However, existing methods, whether
training-free or training-based, continue to face challenges in maintaining
subject consistency due to the lack of fine-grained guidance and inter-frame
interaction. Additionally, the scarcity of high-quality data in this field
makes it difficult to precisely control storytelling tasks, including the
subject's position, appearance, clothing, expression, and posture, thereby
hindering further advancements. In this paper, we demonstrate that layout
conditions, such as the subject's position and detailed attributes, effectively
facilitate fine-grained interactions between frames. This not only strengthens
the consistency of the generated frame sequence but also allows for precise
control over the subject's position, appearance, and other key details.
Building on this, we introduce an advanced storytelling task: Layout-Togglable
Storytelling, which enables precise subject control by incorporating layout
conditions. To address the lack of high-quality datasets with layout
annotations for this task, we develop Lay2Story-1M, which contains over 1
million 720p and higher-resolution images, processed from approximately 11,300
hours of cartoon videos. Building on Lay2Story-1M, we create Lay2Story-Bench, a
benchmark with 3,000 prompts designed to evaluate the performance of different
methods on this task. Furthermore, we propose Lay2Story, a robust framework
based on the Diffusion Transformers (DiTs) architecture for Layout-Togglable
Storytelling tasks. Through both qualitative and quantitative experiments, we
find that our method outperforms the previous state-of-the-art (SOTA)
techniques, achieving the best results in terms of consistency, semantic
correlation, and aesthetic quality.
comment: Accepted by ICCV 2025
β UniSTFormer: Unified Spatio-Temporal Lightweight Transformer for Efficient Skeleton-Based Action Recognition
Skeleton-based action recognition (SAR) has achieved impressive progress with
transformer architectures. However, existing methods often rely on complex
module compositions and heavy designs, leading to increased parameter counts,
high computational costs, and limited scalability. In this paper, we propose a
unified spatio-temporal lightweight transformer framework that integrates
spatial and temporal modeling within a single attention module, eliminating the
need for separate temporal modeling blocks. This approach reduces redundant
computations while preserving temporal awareness within the spatial modeling
process. Furthermore, we introduce a simplified multi-scale pooling fusion
module that combines local and global pooling pathways to enhance the model's
ability to capture fine-grained local movements and overarching global motion
patterns. Extensive experiments on benchmark datasets demonstrate that our
lightweight model achieves a superior balance between accuracy and efficiency,
reducing parameter complexity by over 58% and lowering computational cost by
over 60% compared to state-of-the-art transformer-based baselines, while
maintaining competitive recognition performance.
β MADPromptS: Unlocking Zero-Shot Morphing Attack Detection with Multiple Prompt Aggregation
Face Morphing Attack Detection (MAD) is a critical challenge in face
recognition security, where attackers can fool systems by interpolating the
identity information of two or more individuals into a single face image,
resulting in samples that can be verified as belonging to multiple identities
by face recognition systems. While multimodal foundation models (FMs) like CLIP
offer strong zero-shot capabilities by jointly modeling images and text, most
prior works on FMs for biometric recognition have relied on fine-tuning for
specific downstream tasks, neglecting their potential for direct, generalizable
deployment. This work explores a pure zero-shot approach to MAD by leveraging
CLIP without any additional training or fine-tuning, focusing instead on the
design and aggregation of multiple textual prompts per class. By aggregating
the embeddings of diverse prompts, we better align the model's internal
representations with the MAD task, capturing richer and more varied cues
indicative of bona-fide or attack samples. Our results show that prompt
aggregation substantially improves zero-shot detection performance,
demonstrating the effectiveness of exploiting foundation models' built-in
multimodal knowledge through efficient prompt engineering.
comment: Accepted at ACM Multimedia Workshops
β Accelerated Volumetric Compression without Hierarchies: A Fourier Feature Based Implicit Neural Representation Approach
Volumetric data compression is critical in fields like medical imaging,
scientific simulation, and entertainment. We introduce a structure-free neural
compression method combining Fourierfeature encoding with selective voxel
sampling, yielding compact volumetric representations and faster convergence.
Our dynamic voxel selection uses morphological dilation to prioritize active
regions, reducing redundant computation without any hierarchical metadata. In
the experiment, sparse training reduced training time by 63.7 % (from 30 to 11
minutes) with only minor quality loss: PSNR dropped 0.59 dB (from 32.60 to
32.01) and SSIM by 0.008 (from 0.948 to 0.940). The resulting neural
representation, stored solely as network weights, achieves a compression rate
of 14 and eliminates traditional data-loading overhead. This connects
coordinate-based neural representation with efficient volumetric compression,
offering a scalable, structure-free solution for practical applications.
comment: 2 pages, accepted for the VIS IEEE 2025 poster
β Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions
Ultrasound (US) imaging is increasingly used in spinal procedures due to its
real-time, radiation-free capabilities; however, its effectiveness is hindered
by shadowing artifacts that obscure deeper tissue structures. Traditional
approaches, such as CT-to-US registration, incorporate anatomical information
from preoperative CT scans to guide interventions, but they are limited by
complex registration requirements, differences in spine curvature, and the need
for recent CT imaging. Recent shape completion methods can offer an alternative
by reconstructing spinal structures in US data, while being pretrained on large
set of publicly available CT scans. However, these approaches are typically
offline and have limited reproducibility. In this work, we introduce a novel
integrated system that combines robotic ultrasound with real-time shape
completion to enhance spinal visualization. Our robotic platform autonomously
acquires US sweeps of the lumbar spine, extracts vertebral surfaces from
ultrasound, and reconstructs the complete anatomy using a deep learning-based
shape completion network. This framework provides interactive, real-time
visualization with the capability to autonomously repeat scans and can enable
navigation to target locations. This can contribute to better consistency,
reproducibility, and understanding of the underlying anatomy. We validate our
approach through quantitative experiments assessing shape completion accuracy
and evaluations of multiple spine acquisition protocols on a phantom setup.
Additionally, we present qualitative results of the visualization on a
volunteer scan.
β A Pseudo Global Fusion Paradigm-Based Cross-View Network for LiDAR-Based Place Recognition
LiDAR-based Place Recognition (LPR) remains a critical task in Embodied
Artificial Intelligence (AI) and Autonomous Driving, primarily addressing
localization challenges in GPS-denied environments and supporting loop closure
detection. Existing approaches reduce place recognition to a Euclidean
distance-based metric learning task, neglecting the feature space's intrinsic
structures and intra-class variances. Such Euclidean-centric formulation
inherently limits the model's capacity to capture nonlinear data distributions,
leading to suboptimal performance in complex environments and temporal-varying
scenarios. To address these challenges, we propose a novel cross-view network
based on an innovative fusion paradigm. Our framework introduces a
pseudo-global information guidance mechanism that coordinates multi-modal
branches to perform feature learning within a unified semantic space.
Concurrently, we propose a Manifold Adaptation and Pairwise Variance-Locality
Learning Metric that constructs a Symmetric Positive Definite (SPD) matrix to
compute Mahalanobis distance, superseding traditional Euclidean distance
metrics. This geometric formulation enables the model to accurately
characterize intrinsic data distributions and capture complex inter-class
dependencies within the feature space. Experimental results demonstrate that
the proposed algorithm achieves competitive performance, particularly excelling
in complex environmental conditions.
β Automatic and standardized surgical reporting for central nervous system tumors
David Bouget, Mathilde Gajda Faanes, Asgeir Store Jakola, Frederik Barkhof, Hilko Ardon, Lorenzo Bello, Mitchel S. Berger, Shawn L. Hervey-Jumper, Julia Furtner, Albert J. S. Idema, Barbara Kiesel, Georg Widhalm, Rishi Nandoe Tewarie, Emmanuel Mandonnet, Pierre A. Robe, Michiel Wagemakers, Timothy R. Smith, Philip C. De Witt Hamer, Ole solheim, Ingerid Reinertsen
Magnetic resonance (MR) imaging is essential for evaluating central nervous
system (CNS) tumors, guiding surgical planning, treatment decisions, and
assessing postoperative outcomes and complication risks. While recent work has
advanced automated tumor segmentation and report generation, most efforts have
focused on preoperative data, with limited attention to postoperative imaging
analysis. This study introduces a comprehensive pipeline for standardized
postsurtical reporting in CNS tumors. Using the Attention U-Net architecture,
segmentation models were trained for the preoperative (non-enhancing) tumor
core, postoperative contrast-enhancing residual tumor, and resection cavity.
Additionally, MR sequence classification and tumor type identification for
contrast-enhancing lesions were explored using the DenseNet architecture. The
models were integrated into a reporting pipeline, following the RANO 2.0
guidelines. Training was conducted on multicentric datasets comprising 2000 to
7000 patients, using a 5-fold cross-validation. Evaluation included patient-,
voxel-, and object-wise metrics, with benchmarking against the latest BraTS
challenge results. The segmentation models achieved average voxel-wise Dice
scores of 87%, 66%, 70%, and 77% for the tumor core, non-enhancing tumor core,
contrast-enhancing residual tumor, and resection cavity, respectively.
Classification models reached 99.5% balanced accuracy in MR sequence
classification and 80% in tumor type classification. The pipeline presented in
this study enables robust, automated segmentation, MR sequence classification,
and standardized report generation aligned with RANO 2.0 guidelines, enhancing
postoperative evaluation and clinical decision-making. The proposed models and
methods were integrated into Raidionics, open-source software platform for CNS
tumor analysis, now including a dedicated module for postsurgical analysis.
comment: 16 pages, 6 figures, 9 tables
β Masked Clustering Prediction for Unsupervised Point Cloud Pre-training
Vision transformers (ViTs) have recently been widely applied to 3D point
cloud understanding, with masked autoencoding as the predominant pre-training
paradigm. However, the challenge of learning dense and informative semantic
features from point clouds via standard ViTs remains underexplored. We propose
MaskClu, a novel unsupervised pre-training method for ViTs on 3D point clouds
that integrates masked point modeling with clustering-based learning. MaskClu
is designed to reconstruct both cluster assignments and cluster centers from
masked point clouds, thus encouraging the model to capture dense semantic
information. Additionally, we introduce a global contrastive learning mechanism
that enhances instance-level feature learning by contrasting different masked
views of the same point cloud. By jointly optimizing these complementary
objectives, i.e., dense semantic reconstruction, and instance-level contrastive
learning. MaskClu enables ViTs to learn richer and more semantically meaningful
representations from 3D point clouds. We validate the effectiveness of our
method via multiple 3D tasks, including part segmentation, semantic
segmentation, object detection, and classification, where MaskClu sets new
competitive results. The code and models will be released
at:https://github.com/Amazingren/maskclu.
comment: 3D point cloud pretraining method. 8 pages in the main manuscript
β A Robust Epipolar-Domain Regularization Algorithm for Light Field Depth Estimation
Robust depth estimation in light field imaging remains a critical challenge
for pattern recognition applications such as augmented reality, biomedical
imaging, and scene reconstruction. While existing approaches often rely heavily
on deep convolutional neural networks, they tend to incur high computational
costs and struggle in noisy real-world environments. This paper proposes a
novel lightweight depth estimation pipeline that integrates light field-based
disparity information with a directed random walk refinement algorithm. Unlike
traditional CNN-based methods, our approach enhances depth map consistency
without requiring extensive training or large-scale datasets. The proposed
method was evaluated on the 4D Light Field Benchmark dataset and a diverse set
of real-world images. Experimental results indicate that while performance
slightly declines under uncontrolled conditions, the algorithm consistently
maintains low computational complexity and competitive accuracy compared to
state-of-the-art deep learning models. These findings highlight the potential
of our method as a robust and efficient alternative for depth estimation and
segmentation in light field imaging. The work provides insights into practical
algorithm design for light field-based pattern recognition and opens new
directions for integrating probabilistic graph models with depth sensing
frameworks.
β Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos ICCV 2025
Creating realistic, fully animatable whole-body avatars from a single
portrait is challenging due to limitations in capturing subtle expressions,
body movements, and dynamic backgrounds. Current evaluation datasets and
metrics fall short in addressing these complexities. To bridge this gap, we
introduce the Whole-Body Benchmark Dataset (WB-DH), an open-source, multi-modal
benchmark designed for evaluating whole-body animatable avatar generation. Key
features include: (1) detailed multi-modal annotations for fine-grained
guidance, (2) a versatile evaluation framework, and (3) public access to the
dataset and tools at https://github.com/deepreasonings/WholeBodyBenchmark.
comment: This paper has been accepted by ICCV 2025 Workshop MMFM4
β GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments ICCV 2025
Novel view synthesis with neural models has advanced rapidly in recent years,
yet adapting these models to scene changes remains an open problem. Existing
methods are either labor-intensive, requiring extensive model retraining, or
fail to capture detailed types of changes over time. In this paper, we present
GaussianUpdate, a novel approach that combines 3D Gaussian representation with
continual learning to address these challenges. Our method effectively updates
the Gaussian radiance fields with current data while preserving information
from past scenes. Unlike existing methods, GaussianUpdate explicitly models
different types of changes through a novel multi-stage update strategy.
Additionally, we introduce a visibility-aware continual learning approach with
generative replay, enabling self-aware updating without the need to store
images. The experiments on the benchmark dataset demonstrate our method
achieves superior and real-time rendering with the capability of visualizing
changes over different times
comment: Accepted to ICCV 2025
β Frequency-Assisted Adaptive Sharpening Scheme Considering Bitrate and Quality Tradeoff
Sharpening is a widely adopted technique to improve video quality, which can
effectively emphasize textures and alleviate blurring. However, increasing the
sharpening level comes with a higher video bitrate, resulting in degraded
Quality of Service (QoS). Furthermore, the video quality does not necessarily
improve with increasing sharpening levels, leading to issues such as
over-sharpening. Clearly, it is essential to figure out how to boost video
quality with a proper sharpening level while also controlling bandwidth costs
effectively. This paper thus proposes a novel Frequency-assisted Sharpening
level Prediction model (FreqSP). We first label each video with the sharpening
level correlating to the optimal bitrate and quality tradeoff as ground truth.
Then taking uncompressed source videos as inputs, the proposed FreqSP leverages
intricate CNN features and high-frequency components to estimate the optimal
sharpening level. Extensive experiments demonstrate the effectiveness of our
method.
β Adaptive High-Frequency Preprocessing for Video Coding
High-frequency components are crucial for maintaining video clarity and
realism, but they also significantly impact coding bitrate, resulting in
increased bandwidth and storage costs. This paper presents an end-to-end
learning-based framework for adaptive high-frequency preprocessing to enhance
subjective quality and save bitrate in video coding. The framework employs the
Frequency-attentive Feature pyramid Prediction Network (FFPN) to predict the
optimal high-frequency preprocessing strategy, guiding subsequent filtering
operators to achieve the optimal tradeoff between bitrate and quality after
compression. For training FFPN, we pseudo-label each training video with the
optimal strategy, determined by comparing the rate-distortion (RD) performance
across different preprocessing types and strengths. Distortion is measured
using the latest quality assessment metric. Comprehensive evaluations on
multiple datasets demonstrate the visually appealing enhancement capabilities
and bitrate savings achieved by our framework.
β DiffPhysCam: Differentiable Physics-Based Camera Simulation for Inverse Rendering and Embodied AI
We introduce DiffPhysCam, a differentiable camera simulator designed to
support robotics and embodied AI applications by enabling gradient-based
optimization in visual perception pipelines. Generating synthetic images that
closely mimic those from real cameras is essential for training visual models
and enabling end-to-end visuomotor learning. Moreover, differentiable rendering
allows inverse reconstruction of real-world scenes as digital twins,
facilitating simulation-based robotics training. However, existing virtual
cameras offer limited control over intrinsic settings, poorly capture optical
artifacts, and lack tunable calibration parameters -- hindering sim-to-real
transfer. DiffPhysCam addresses these limitations through a multi-stage
pipeline that provides fine-grained control over camera settings, models key
optical effects such as defocus blur, and supports calibration with real-world
data. It enables both forward rendering for image synthesis and inverse
rendering for 3D scene reconstruction, including mesh and material texture
optimization. We show that DiffPhysCam enhances robotic perception performance
in synthetic image tasks. As an illustrative example, we create a digital twin
of a real-world scene using inverse rendering, simulate it in a multi-physics
environment, and demonstrate navigation of an autonomous ground vehicle using
images generated by DiffPhysCam.
comment: 19 pages, 17 figures, and 4 tables
β Silicon Minds versus Human Hearts: The Wisdom of Crowds Beats the Wisdom of AI in Emotion Recognition
The ability to discern subtle emotional cues is fundamental to human social
intelligence. As artificial intelligence (AI) becomes increasingly common, AI's
ability to recognize and respond to human emotions is crucial for effective
human-AI interactions. In particular, whether such systems can match or surpass
human experts remains to be seen. However, the emotional intelligence of AI,
particularly multimodal large language models (MLLMs), remains largely
unexplored. This study evaluates the emotion recognition abilities of MLLMs
using the Reading the Mind in the Eyes Test (RMET) and its multiracial
counterpart (MRMET), and compares their performance against human participants.
Results show that, on average, MLLMs outperform humans in accurately
identifying emotions across both tests. This trend persists even when comparing
performance across low, medium, and expert-level performing groups. Yet when we
aggregate independent human decisions to simulate collective intelligence,
human groups significantly surpass the performance of aggregated MLLM
predictions, highlighting the wisdom of the crowd. Moreover, a collaborative
approach (augmented intelligence) that combines human and MLLM predictions
achieves greater accuracy than either humans or MLLMs alone. These results
suggest that while MLLMs exhibit strong emotion recognition at the individual
level, the collective intelligence of humans and the synergistic potential of
human-AI collaboration offer the most promising path toward effective emotional
AI. We discuss the implications of these findings for the development of
emotionally intelligent AI systems and future research directions.
β A Parametric Bi-Directional Curvature-Based Framework for Image Artifact Classification and Quantification
This work presents a novel framework for No-Reference Image Quality
Assessment (NR-IQA) founded on the analysis of directional image curvature.
Within this framework, we define a measure of Anisotropic Texture Richness
(ATR), which is computed at the pixel level using two tunable thresholds -- one
permissive and one restrictive -- that quantify orthogonal texture suppression.
When its parameters are optimized for a specific artifact, the resulting ATR
score serves as a high-performance quality metric, achieving Spearman
correlations with human perception of approximately -0.93 for Gaussian blur and
-0.95 for white noise on the LIVE dataset. The primary contribution is a
two-stage system that leverages the differential response of ATR to various
distortions. First, the system utilizes the signature from two specialist ATR
configurations to classify the primary artifact type (blur vs. noise) with over
97% accuracy. Second, following classification, it employs a dedicated
regression model mapping the relevant ATR score to a quality rating to quantify
the degradation. On a combined dataset, the complete system predicts human
scores with a coefficient of determination (R2) of 0.892 and a Root Mean Square
Error (RMSE) of 5.17 DMOS points. This error corresponds to just 7.4% of the
dataset's total quality range, demonstrating high predictive accuracy. This
establishes our framework as a robust, dual-purpose tool for the classification
and subsequent quantification of image degradation.
β 3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs
Recent Multi-Modal Large Language Models (MLLMs) have demonstrated strong
capabilities in learning joint representations from text and images. However,
their spatial reasoning remains limited. We introduce 3DFroMLLM, a novel
framework that enables the generation of 3D object prototypes directly from
MLLMs, including geometry and part labels. Our pipeline is agentic, comprising
a designer, coder, and visual inspector operating in a refinement loop.
Notably, our approach requires no additional training data or detailed user
instructions. Building on prior work in 2D generation, we demonstrate that
rendered images produced by our framework can be effectively used for image
classification pretraining tasks and outperforms previous methods by 15%. As a
compelling real-world use case, we show that the generated prototypes can be
leveraged to improve fine-grained vision-language models by using the rendered,
part-labeled prototypes to fine-tune CLIP for part segmentation and achieving a
55% accuracy improvement without relying on any additional human-labeled data.
β TARA: Token-Aware LoRA for Composable Personalization in Diffusion Models
Personalized text-to-image generation aims to synthesize novel images of a
specific subject or style using only a few reference images. Recent methods
based on Low-Rank Adaptation (LoRA) enable efficient single-concept
customization by injecting lightweight, concept-specific adapters into
pre-trained diffusion models. However, combining multiple LoRA modules for
multi-concept generation often leads to identity missing and visual feature
leakage. In this work, we identify two key issues behind these failures: (1)
token-wise interference among different LoRA modules, and (2) spatial
misalignment between the attention map of a rare token and its corresponding
concept-specific region. To address these issues, we propose Token-Aware LoRA
(TARA), which introduces a token mask to explicitly constrain each module to
focus on its associated rare token to avoid interference, and a training
objective that encourages the spatial attention of a rare token to align with
its concept region. Our method enables training-free multi-concept composition
by directly injecting multiple independently trained TARA modules at inference
time. Experimental results demonstrate that TARA enables efficient
multi-concept inference and effectively preserving the visual identity of each
concept by avoiding mutual interference between LoRA modules. The code and
models are available at https://github.com/YuqiPeng77/TARA.
β Revisiting Efficient Semantic Segmentation: Learning Offsets for Better Spatial and Class Feature Alignment ICCV 2025
Semantic segmentation is fundamental to vision systems requiring pixel-level
scene understanding, yet deploying it on resource-constrained devices demands
efficient architectures. Although existing methods achieve real-time inference
through lightweight designs, we reveal their inherent limitation: misalignment
between class representations and image features caused by a per-pixel
classification paradigm. With experimental analysis, we find that this paradigm
results in a highly challenging assumption for efficient scenarios: Image pixel
features should not vary for the same category in different images. To address
this dilemma, we propose a coupled dual-branch offset learning paradigm that
explicitly learns feature and class offsets to dynamically refine both class
representations and spatial image features. Based on the proposed paradigm, we
construct an efficient semantic segmentation network, OffSeg. Notably, the
offset learning paradigm can be adopted to existing methods with no additional
architectural changes. Extensive experiments on four datasets, including
ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent
improvements with negligible parameters. For instance, on the ADE20K dataset,
our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and
Mask2Former-Tiny by 2.7%, 1.9%, and 2.6% mIoU, respectively, with only 0.1-0.2M
additional parameters required.
comment: Accepted at ICCV 2025. Project page:
https://github.com/HVision-NKU/OffSeg
β Identity-Preserving Aging and De-Aging of Faces in the StyleGAN Latent Space
Face aging or de-aging with generative AI has gained significant attention
for its applications in such fields like forensics, security, and media.
However, most state of the art methods rely on conditional Generative
Adversarial Networks (GANs), Diffusion-based models, or Visual Language Models
(VLMs) to age or de-age faces based on predefined age categories and
conditioning via loss functions, fine-tuning, or text prompts. The reliance on
such conditioning leads to complex training requirements, increased data needs,
and challenges in generating consistent results. Additionally, identity
preservation is rarely taken into accountor evaluated on a single face
recognition system without any control or guarantees on whether identity would
be preserved in a generated aged/de-aged face. In this paper, we propose to
synthesize aged and de-aged faces via editing latent space of StyleGAN2 using a
simple support vector modeling of aging/de-aging direction and several feature
selection approaches. By using two state-of-the-art face recognition systems,
we empirically find the identity preserving subspace within the StyleGAN2
latent space, so that an apparent age of a given face can changed while
preserving the identity. We then propose a simple yet practical formula for
estimating the limits on aging/de-aging parameters that ensures identity
preservation for a given input face. Using our method and estimated parameters
we have generated a public dataset of synthetic faces at different ages that
can be used for benchmarking cross-age face recognition, age assurance systems,
or systems for detection of synthetic images. Our code and dataset are
available at the project page https://www.idiap.ch/paper/agesynth/
comment: Accepted for publication in IEEE International Joint Conference on
Biometrics (IJCB), 2025
β MonoPartNeRF:Human Reconstruction from Monocular Video via Part-Based Neural Radiance Fields
In recent years, Neural Radiance Fields (NeRF) have achieved remarkable
progress in dynamic human reconstruction and rendering. Part-based rendering
paradigms, guided by human segmentation, allow for flexible parameter
allocation based on structural complexity, thereby enhancing representational
efficiency. However, existing methods still struggle with complex pose
variations, often producing unnatural transitions at part boundaries and
failing to reconstruct occluded regions accurately in monocular settings. We
propose MonoPartNeRF, a novel framework for monocular dynamic human rendering
that ensures smooth transitions and robust occlusion recovery. First, we build
a bidirectional deformation model that combines rigid and non-rigid
transformations to establish a continuous, reversible mapping between
observation and canonical spaces. Sampling points are projected into a
parameterized surface-time space (u, v, t) to better capture non-rigid motion.
A consistency loss further suppresses deformation-induced artifacts and
discontinuities. We introduce a part-based pose embedding mechanism that
decomposes global pose vectors into local joint embeddings based on body
regions. This is combined with keyframe pose retrieval and interpolation, along
three orthogonal directions, to guide pose-aware feature sampling. A learnable
appearance code is integrated via attention to model dynamic texture changes
effectively. Experiments on the ZJU-MoCap and MonoCap datasets demonstrate that
our method significantly outperforms prior approaches under complex pose and
occlusion conditions, achieving superior joint alignment, texture fidelity, and
structural continuity.
β Region-Adaptive Video Sharpening via Rate-Perception Optimization
Sharpening is a widely adopted video enhancement technique. However, uniform
sharpening intensity ignores texture variations, degrading video quality.
Sharpening also increases bitrate, and there's a lack of techniques to
optimally allocate these additional bits across diverse regions. Thus, this
paper proposes RPO-AdaSharp, an end-to-end region-adaptive video sharpening
model for both perceptual enhancement and bitrate savings. We use the coding
tree unit (CTU) partition mask as prior information to guide and constrain the
allocation of increased bits. Experiments on benchmarks demonstrate the
effectiveness of the proposed model qualitatively and quantitatively.
β DiffPose-Animal: A Language-Conditioned Diffusion Framework for Animal Pose Estimation
Animal pose estimation is a fundamental task in computer vision, with growing
importance in ecological monitoring, behavioral analysis, and intelligent
livestock management. Compared to human pose estimation, animal pose estimation
is more challenging due to high interspecies morphological diversity, complex
body structures, and limited annotated data. In this work, we introduce
DiffPose-Animal, a novel diffusion-based framework for top-down animal pose
estimation. Unlike traditional heatmap regression methods, DiffPose-Animal
reformulates pose estimation as a denoising process under the generative
framework of diffusion models. To enhance semantic guidance during keypoint
generation, we leverage large language models (LLMs) to extract both global
anatomical priors and local keypoint-wise semantics based on species-specific
prompts. These textual priors are encoded and fused with image features via
cross-attention modules to provide biologically meaningful constraints
throughout the denoising process. Additionally, a diffusion-based keypoint
decoder is designed to progressively refine pose predictions, improving
robustness to occlusion and annotation sparsity. Extensive experiments on
public animal pose datasets demonstrate the effectiveness and generalization
capability of our method, especially under challenging scenarios with diverse
species, cluttered backgrounds, and incomplete keypoints.
comment: 13pages,2figures
β SHREC 2025: Retrieval of Optimal Objects for Multi-modal Enhanced Language and Spatial Assistance (ROOMELSA)
Trong-Thuan Nguyen, Viet-Tham Huynh, Quang-Thuc Nguyen, Hoang-Phuc Nguyen, Long Le Bao, Thai Hoang Minh, Minh Nguyen Anh, Thang Nguyen Tien, Phat Nguyen Thuan, Huy Nguyen Phong, Bao Huynh Thai, Vinh-Tiep Nguyen, Duc-Vu Nguyen, Phu-Hoa Pham, Minh-Huy Le-Hoang, Nguyen-Khang Le, Minh-Chinh Nguyen, Minh-Quan Ho, Ngoc-Long Tran, Hien-Long Le-Hoang, Man-Khoi Tran, Anh-Duong Tran, Kim Nguyen, Quan Nguyen Hung, Dat Phan Thanh, Hoang Tran Van, Tien Huynh Viet, Nhan Nguyen Viet Thien, Dinh-Khoi Vo, Van-Loc Nguyen, Trung-Nghia Le, Tam V. Nguyen, Minh-Triet Tran
Recent 3D retrieval systems are typically designed for simple, controlled
scenarios, such as identifying an object from a cropped image or a brief
description. However, real-world scenarios are more complex, often requiring
the recognition of an object in a cluttered scene based on a vague, free-form
description. To this end, we present ROOMELSA, a new benchmark designed to
evaluate a system's ability to interpret natural language. Specifically,
ROOMELSA attends to a specific region within a panoramic room image and
accurately retrieves the corresponding 3D model from a large database. In
addition, ROOMELSA includes over 1,600 apartment scenes, nearly 5,200 rooms,
and more than 44,000 targeted queries. Empirically, while coarse object
retrieval is largely solved, only one top-performing model consistently ranked
the correct match first across nearly all test cases. Notably, a lightweight
CLIP-based model also performed well, although it struggled with subtle
variations in materials, part structures, and contextual cues, resulting in
occasional errors. These findings highlight the importance of tightly
integrating visual and language understanding. By bridging the gap between
scene-level grounding and fine-grained 3D retrieval, ROOMELSA establishes a new
benchmark for advancing robust, real-world 3D recognition systems.
β Bridging the Gap: A Framework for Real-World Video Deepfake Detection via Social Network Compression Emulation
The growing presence of AI-generated videos on social networks poses new
challenges for deepfake detection, as detectors trained under controlled
conditions often fail to generalize to real-world scenarios. A key factor
behind this gap is the aggressive, proprietary compression applied by platforms
like YouTube and Facebook, which launder low-level forensic cues. However,
replicating these transformations at scale is difficult due to API limitations
and data-sharing constraints. For these reasons, we propose a first framework
that emulates the video sharing pipelines of social networks by estimating
compression and resizing parameters from a small set of uploaded videos. These
parameters enable a local emulator capable of reproducing platform-specific
artifacts on large datasets without direct API access. Experiments on
FaceForensics++ videos shared via social networks demonstrate that our emulated
data closely matches the degradation patterns of real uploads. Furthermore,
detectors fine-tuned on emulated videos achieve comparable performance to those
trained on actual shared media. Our approach offers a scalable and practical
solution for bridging the gap between lab-based training and real-world
deployment of deepfake detectors, particularly in the underexplored domain of
compressed video content.
β Exploring Palette based Color Guidance in Diffusion Models ACM MM 2025
With the advent of diffusion models, Text-to-Image (T2I) generation has seen
substantial advancements. Current T2I models allow users to specify object
colors using linguistic color names, and some methods aim to personalize
color-object association through prompt learning. However, existing models
struggle to provide comprehensive control over the color schemes of an entire
image, especially for background elements and less prominent objects not
explicitly mentioned in prompts. This paper proposes a novel approach to
enhance color scheme control by integrating color palettes as a separate
guidance mechanism alongside prompt instructions. We investigate the
effectiveness of palette guidance by exploring various palette representation
methods within a diffusion-based image colorization framework. To facilitate
this exploration, we construct specialized palette-text-image datasets and
conduct extensive quantitative and qualitative analyses. Our results
demonstrate that incorporating palette guidance significantly improves the
model's ability to generate images with desired color schemes, enabling a more
controlled and refined colorization process.
comment: Accepted to ACM MM 2025
β Adaptive Confidence-Wise Loss for Improved Lens Structure Segmentation in AS-OCT
Zunjie Xiao, Xiao Wu, Tianhang Liu, Lingxi Hu, Yinling Zhang, Xiaoqing Zhang, Risa Higashita, Jiang Liu
Precise lens structure segmentation is essential for the design of
intraocular lenses (IOLs) in cataract surgery. Existing deep segmentation
networks typically weight all pixels equally under cross-entropy (CE) loss,
overlooking the fact that sub-regions of lens structures are inhomogeneous
(e.g., some regions perform better than others) and that boundary regions often
suffer from poor segmentation calibration at the pixel level. Clinically,
experts annotate different sub-regions of lens structures with varying
confidence levels, considering factors such as sub-region proportions,
ambiguous boundaries, and lens structure shapes. Motivated by this observation,
we propose an Adaptive Confidence-Wise (ACW) loss to group each lens structure
sub-region into different confidence sub-regions via a confidence threshold
from the unique region aspect, aiming to exploit the potential of expert
annotation confidence prior. Specifically, ACW clusters each target region into
low-confidence and high-confidence groups and then applies a region-weighted
loss to reweigh each confidence group. Moreover, we design an adaptive
confidence threshold optimization algorithm to adjust the confidence threshold
of ACW dynamically. Additionally, to better quantify the miscalibration errors
in boundary region segmentation, we propose a new metric, termed Boundary
Expected Calibration Error (BECE). Extensive experiments on a clinical lens
structure AS-OCT dataset and other multi-structure datasets demonstrate that
our ACW significantly outperforms competitive segmentation loss methods across
different deep segmentation networks (e.g., MedSAM). Notably, our method
surpasses CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction
in lens structure segmentation under U-Net. The code of this paper is available
at https://github.com/XiaoLing12138/Adaptive-Confidence-Wise-Loss.
β SafeFix: Targeted Model Repair via Controlled Image Generation
Deep learning models for visual recognition often exhibit systematic errors
due to underrepresented semantic subpopulations. Although existing debugging
frameworks can pinpoint these failures by identifying key failure attributes,
repairing the model effectively remains difficult. Current solutions often rely
on manually designed prompts to generate synthetic training images -- an
approach prone to distribution shift and semantic errors. To overcome these
challenges, we introduce a model repair module that builds on an interpretable
failure attribution pipeline. Our approach uses a conditional text-to-image
model to generate semantically faithful and targeted images for failure cases.
To preserve the quality and relevance of the generated samples, we further
employ a large vision-language model (LVLM) to filter the outputs, enforcing
alignment with the original data distribution and maintaining semantic
consistency. By retraining vision models with this rare-case-augmented
synthetic dataset, we significantly reduce errors associated with rare cases.
Our experiments demonstrate that this targeted repair strategy improves model
robustness without introducing new bugs. Code is available at
https://github.com/oxu2/SafeFix
β Subjective and Objective Quality Assessment of Banding Artifacts on Compressed Videos
Qi Zheng, Li-Heng Chen, Chenlong He, Neil Berkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik, Yibo Fan, Zhengzhong Tu
Although there have been notable advancements in video compression
technologies in recent years, banding artifacts remain a serious issue
affecting the quality of compressed videos, particularly on smooth regions of
high-definition videos. Noticeable banding artifacts can severely impact the
perceptual quality of videos viewed on a high-end HDTV or high-resolution
screen. Hence, there is a pressing need for a systematic investigation of the
banding video quality assessment problem for advanced video codecs. Given that
the existing publicly available datasets for studying banding artifacts are
limited to still picture data only, which cannot account for temporal banding
dynamics, we have created a first-of-a-kind open video dataset, dubbed
LIVE-YT-Banding, which consists of 160 videos generated by four different
compression parameters using the AV1 video codec. A total of 7,200 subjective
opinions are collected from a cohort of 45 human subjects. To demonstrate the
value of this new resources, we tested and compared a variety of models that
detect banding occurrences, and measure their impact on perceived quality.
Among these, we introduce an effective and efficient new no-reference (NR)
video quality evaluator which we call CBAND. CBAND leverages the properties of
the learned statistics of natural images expressed in the embeddings of deep
neural networks. Our experimental results show that the perceptual banding
prediction performance of CBAND significantly exceeds that of previous
state-of-the-art models, and is also orders of magnitude faster. Moreover,
CBAND can be employed as a differentiable loss function to optimize video
debanding models. The LIVE-YT-Banding database, code, and pre-trained model are
all publically available at https://github.com/uniqzheng/CBAND.
β ROD: RGB-Only Fast and Efficient Off-road Freespace Detection
Off-road freespace detection is more challenging than on-road scenarios
because of the blurred boundaries of traversable areas. Previous
state-of-the-art (SOTA) methods employ multi-modal fusion of RGB images and
LiDAR data. However, due to the significant increase in inference time when
calculating surface normal maps from LiDAR data, multi-modal methods are not
suitable for real-time applications, particularly in real-world scenarios where
higher FPS is required compared to slow navigation. This paper presents a novel
RGB-only approach for off-road freespace detection, named ROD, eliminating the
reliance on LiDAR data and its computational demands. Specifically, we utilize
a pre-trained Vision Transformer (ViT) to extract rich features from RGB
images. Additionally, we design a lightweight yet efficient decoder, which
together improve both precision and inference speed. ROD establishes a new SOTA
on ORFD and RELLIS-3D datasets, as well as an inference speed of 50 FPS,
significantly outperforming prior models.
β STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision
Vision-language models (VLMs) have made significant strides in reasoning, yet
they often struggle with complex multimodal tasks and tend to generate overly
verbose outputs. A key limitation is their reliance on chain-of-thought (CoT)
reasoning, despite many tasks benefiting from alternative topologies like trees
or graphs. To address this, we introduce STELAR-Vision, a training framework
for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline
that enriches training with diverse topological structures. Using supervised
fine-tuning and reinforcement learning, we post-train Qwen2VL models with both
accuracy and efficiency in mind. Additionally, we propose Frugal Learning,
which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H,
STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the
larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it
outperforms Phi-4-Multimodal-Instruct by up to 28.4% and
LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong
generalization. Compared to Chain-Only training, our approach achieves 4.3%
higher overall accuracy on in-distribution datasets and consistently
outperforms across all OOD benchmarks. We have released datasets, and code will
be available.
β PADReg: Physics-Aware Deformable Registration Guided by Contact Force for Ultrasound Sequences
Ultrasound deformable registration estimates spatial transformations between
pairs of deformed ultrasound images, which is crucial for capturing
biomechanical properties and enhancing diagnostic accuracy in diseases such as
thyroid nodules and breast cancer. However, ultrasound deformable registration
remains highly challenging, especially under large deformation. The inherently
low contrast, heavy noise and ambiguous tissue boundaries in ultrasound images
severely hinder reliable feature extraction and correspondence matching.
Existing methods often suffer from poor anatomical alignment and lack physical
interpretability. To address the problem, we propose PADReg, a physics-aware
deformable registration framework guided by contact force. PADReg leverages
synchronized contact force measured by robotic ultrasound systems as a physical
prior to constrain the registration. Specifically, instead of directly
predicting deformation fields, we first construct a pixel-wise stiffness map
utilizing the multi-modal information from contact force and ultrasound images.
The stiffness map is then combined with force data to estimate a dense
deformation field, through a lightweight physics-aware module inspired by
Hooke's law. This design enables PADReg to achieve physically plausible
registration with better anatomical alignment than previous methods relying
solely on image similarity. Experiments on in-vivo datasets demonstrate that it
attains a HD95 of 12.90, which is 21.34\% better than state-of-the-art methods.
The source code is available at https://github.com/evelynskip/PADReg.
comment: This work has been submitted to the IEEE for possible publication
β Per-Query Visual Concept Learning
Visual concept learning, also known as Text-to-image personalization, is the
process of teaching new concepts to a pretrained model. This has numerous
applications from product placement to entertainment and personalized design.
Here we show that many existing methods can be substantially augmented by
adding a personalization step that is (1) specific to the prompt and noise
seed, and (2) using two loss terms based on the self- and cross- attention,
capturing the identity of the personalized concept. Specifically, we leverage
PDM features -- previously designed to capture identity -- and show how they
can be used to improve personalized semantic similarity. We evaluate the
benefit that our method gains on top of six different personalization methods,
and several base text-to-image models (both UNet- and DiT-based). We find
significant improvements even over previous per-query personalization methods.
comment: Project page is at
https://per-query-visual-concept-learning.github.io/
β» β Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control
Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma
While recent flow-based image editing models demonstrate general-purpose
capabilities across diverse tasks, they often struggle to specialize in
challenging scenarios -- particularly those involving large-scale shape
transformations. When performing such structural edits, these methods either
fail to achieve the intended shape change or inadvertently alter non-target
regions, resulting in degraded background quality. We propose
Follow-Your-Shape, a training-free and mask-free framework that supports
precise and controllable editing of object shapes while strictly preserving
non-target content. Motivated by the divergence between inversion and editing
trajectories, we compute a Trajectory Divergence Map (TDM) by comparing
token-wise velocity differences between the inversion and denoising paths. The
TDM enables precise localization of editable regions and guides a Scheduled KV
Injection mechanism that ensures stable and faithful editing. To facilitate a
rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120
new images and enriched prompt pairs specifically curated for shape-aware
editing. Experiments demonstrate that our method achieves superior editability
and visual fidelity, particularly in tasks requiring large-scale shape
replacement.
comment: Project webpage is available at https://follow-your-shape.github.io/
β» β Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu
Visual effects (VFX) are essential visual enhancements fundamental to modern
cinematic production. Although video generation models offer cost-efficient
solutions for VFX production, current methods are constrained by per-effect
LoRA training, which limits generation to single effects. This fundamental
limitation impedes applications that require spatially controllable composite
effects, i.e., the concurrent generation of multiple effects at designated
locations. However, integrating diverse effects into a unified framework faces
major challenges: interference from effect variations and spatial
uncontrollability during multi-VFX joint training. To tackle these challenges,
we propose Omni-Effects, a first unified framework capable of generating
prompt-guided effects and spatially controllable composite effects. The core of
our framework comprises two key innovations: (1) LoRA-based Mixture of Experts
(LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects
within a unified model while effectively mitigating cross-task interference.
(2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the
text token, enabling precise spatial control. Furthermore, we introduce an
Independent-Information Flow (IIF) module integrated within the SAP, isolating
the control signals corresponding to individual effects to prevent any unwanted
blending. To facilitate this research, we construct a comprehensive VFX dataset
Omni-VFX via a novel data collection pipeline combining image editing and
First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX
evaluation framework for validating model performance. Extensive experiments
demonstrate that Omni-Effects achieves precise spatial control and diverse
effect generation, enabling users to specify both the category and location of
desired effects.
β» β Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation
Generating high-fidelity human videos that match user-specified identities is
important yet challenging in the field of generative AI. Existing methods often
rely on an excessive number of training parameters and lack compatibility with
other AIGC tools. In this paper, we propose Stand-In, a lightweight and
plug-and-play framework for identity preservation in video generation.
Specifically, we introduce a conditional image branch into the pre-trained
video generation model. Identity control is achieved through restricted
self-attentions with conditional position mapping, and can be learned quickly
with only 2000 pairs. Despite incorporating and training just $\sim$1%
additional parameters, our framework achieves excellent results in video
quality and identity preservation, outperforming other full-parameter training
methods. Moreover, our framework can be seamlessly integrated for other tasks,
such as subject-driven video generation, pose-referenced video generation,
stylization, and face swapping.
β» β ViStoryBench: Comprehensive Benchmark Suite for Story Visualization
Cailin Zhuang, Ailin Huang, Wei Cheng, Jingwei Wu, Yaoqi Hu, Jiaqi Liao, Hongyuan Wang, Xinyao Liao, Weiwei Cai, Hengyuan Xu, Xuanyang Zhang, Xianfang Zeng, Zhewei Huang, Gang Yu, Chi Zhang
Story visualization aims to generate coherent image sequences that faithfully
depict a narrative and align with character references. Despite progress in
generative models, existing benchmarks are narrow in scope, often limited to
short prompts, no character reference, or single-image cases, and fall short of
real-world storytelling complexity. This hinders a nuanced understanding of
model capabilities and limitations. We present ViStoryBench, a comprehensive
benchmark designed to evaluate story visualization models across diverse
narrative structures, visual styles, and character settings. The benchmark
features richly annotated multi-shot scripts derived from curated stories
spanning literature, film, and folklore. Large language models assist in story
summarization and script generation, with all outputs verified by humans to
ensure coherence and fidelity. Character references are carefully curated to
maintain intra-story consistency across varying artistic styles. To enable
thorough evaluation, ViStoryBench introduces a set of automated metrics that
assess character consistency, style similarity, prompt adherence, aesthetic
quality, and generation artifacts such as copy-paste behavior. These metrics
are validated through human studies, and used to benchmark a broad range of
open-source and commercial models. ViStoryBench offers a high-fidelity,
multi-dimensional evaluation suite that facilitates systematic analysis and
fosters future progress in visual storytelling.
comment: 33 Pages, Project Page: https://vistorybench.github.io/, Code:
https://github.com/vistorybench/vistorybench
β» β CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
Shravan Nayak, Mehar Bhatia, Xiaofeng Zhang, Verena Rieser, Lisa Anne Hendricks, Sjoerd van Steenkiste, Yash Goyal, Karolina StaΕczak, Aishwarya Agrawal
The increasing ubiquity of text-to-image (T2I) models as tools for visual
content generation raises concerns about their ability to accurately represent
diverse cultural contexts -- where missed cues can stereotype communities and
undermine usability. In this work, we present the first study to systematically
quantify the alignment of T2I models and evaluation metrics with respect to
both explicit (stated) as well as implicit (unstated, implied by the prompt's
cultural context) cultural expectations. To this end, we introduce
CulturalFrames, a novel benchmark designed for rigorous human evaluation of
cultural representation in visual generations. Spanning 10 countries and 5
socio-cultural domains, CulturalFrames comprises 983 prompts, 3637
corresponding images generated by 4 state-of-the-art T2I models, and over 10k
detailed human annotations. We find that across models and countries, cultural
expectations are missed an average of 44% of the time. Among these failures,
explicit expectations are missed at a surprisingly high average rate of 68%,
while implicit expectation failures are also significant, averaging 49%.
Furthermore, we show that existing T2I evaluation metrics correlate poorly with
human judgments of cultural alignment, irrespective of their internal
reasoning. Collectively, our findings expose critical gaps, provide a concrete
testbed, and outline actionable directions for developing culturally informed
T2I models and metrics that improve global usability.
β» β Euclid Quick Data Release (Q1). Active galactic nuclei identification using diffusion-based inpainting of Euclid VIS images
Euclid Collaboration, G. Stevens, S. Fotopoulou, M. N. Bremer, T. Matamoro Zatarain, K. Jahnke, B. Margalef-Bentabol, M. Huertas-Company, M. J. Smith, M. Walmsley, M. Salvato, M. Mezcua, A. Paulino-Afonso, M. Siudek, M. Talia, F. Ricci, W. Roster, N. Aghanim, B. Altieri, S. Andreon, H. Aussel, C. Baccigalupi, M. Baldi, S. Bardelli, P. Battaglia, A. Biviano, A. Bonchi, E. Branchini, M. Brescia, J. Brinchmann, S. Camera, G. CaΓ±as-Herrera, V. Capobianco, C. Carbone, J. Carretero, M. Castellano, G. Castignani, S. Cavuoti, K. C. Chambers, A. Cimatti, C. Colodro-Conde, G. Congedo, C. J. Conselice, L. Conversi, Y. Copin, A. Costille, F. Courbin, H. M. Courtois, M. Cropper, A. Da Silva, H. Degaudenzi, G. De Lucia, C. Dolding, H. Dole, M. Douspis, F. Dubath, X. Dupac, S. Dusini, S. Escoffier, M. Farina, S. Ferriol, K. George, C. Giocoli, B. R. Granett, A. Grazian, F. Grupp, S. V. H. Haugan, I. M. Hook, F. Hormuth, A. Hornstrup, P. Hudelot, M. Jhabvala, E. KeihΓ€nen, S. Kermiche, A. Kiessling, M. Kilbinger, B. Kubik, M. KΓΌmmel, H. Kurki-Suonio, Q. Le Boulc'h, A. M. C. Le Brun, D. Le Mignant, P. B. Lilje, V. Lindholm, I. Lloro, G. Mainetti, D. Maino, E. Maiorano, O. Marggraf, M. Martinelli, N. Martinet, F. Marulli, R. Massey, S. Maurogordato, H. J. McCracken, E. Medinaceli, S. Mei, M. Melchior, M. Meneghetti, E. Merlin, G. Meylan, A. Mora, M. Moresco, L. Moscardini, R. Nakajima, C. Neissner, S. -M. Niemi, C. Padilla, S. Paltani, F. Pasian, K. Pedersen, W. J. Percival, V. Pettorino, G. Polenta, M. Poncet, L. A. Popa, L. Pozzetti, F. Raison, R. Rebolo, A. Renzi, J. Rhodes, G. Riccio, E. Romelli, M. Roncarelli, R. Saglia, A. G. SΓ‘nchez, D. Sapone, J. A. Schewtschenko, M. Schirmer, P. Schneider, T. Schrabback, A. Secroun, S. Serrano, P. Simon, C. Sirignano, G. Sirri, J. Skottfelt, L. Stanco, J. Steinwagner, P. Tallada-CrespΓ, A. N. Taylor, I. Tereno, S. Toft, R. Toledo-Moreo, F. Torradeflot, I. Tutusaus, L. Valenziano, J. Valiviita, T. Vassallo, G. Verdoes Kleijn, A. Veropalumbo, Y. Wang, J. Weller, A. Zacchei, G. Zamorani, F. M. Zerbi, I. A. Zinchenko, E. Zucca, V. Allevato, M. Ballardini, M. Bolzonella, E. Bozzo, C. Burigana, R. Cabanac, A. Cappi, J. A. Escartin Vigo, L. Gabarra, W. G. Hartley, J. MartΓn-Fleitas, S. Matthew, R. B. Metcalf, A. Pezzotta, M. PΓΆntinen, I. Risso, V. Scottez, M. Sereno, M. Tenti, M. Wiesmann, Y. Akrami, S. Alvi, I. T. Andika, S. Anselmi, M. Archidiacono, F. Atrio-Barandela, D. Bertacca, M. Bethermin, L. Bisigello, A. Blanchard, L. Blot, S. Borgani, M. L. Brown, S. Bruton, A. Calabro, F. Caro, T. Castro, F. Cogato, S. Davini, G. Desprez, A. DΓaz-SΓ‘nchez, J. J. Diaz, S. Di Domizio, J. M. Diego, P. -A. Duc, A. Enia, Y. Fang, A. G. Ferrari, A. Finoguenov, A. Fontana, A. Franco, J. GarcΓa-Bellido, T. Gasparetto, V. Gautard, E. Gaztanaga, F. Giacomini, F. Gianotti, M. Guidi, C. M. Gutierrez, A. Hall, S. Hemmati, H. Hildebrandt, J. Hjorth, J. J. E. Kajava, Y. Kang, V. Kansal, D. Karagiannis, C. C. Kirkpatrick, S. Kruk, L. Legrand, M. Lembo, F. Lepori, G. Leroy, J. Lesgourgues, L. Leuzzi, T. I. Liaudat, J. Macias-Perez, M. Magliocchetti, F. Mannucci, R. Maoli, C. J. A. P. Martins, L. Maurin, M. Miluzio, P. Monaco, G. Morgante, K. Naidoo, A. Navarro-Alsina, F. Passalacqua, K. Paterson, L. Patrizii, A. Pisani, D. Potter, S. Quai, M. Radovich, P. -F. Rocci, G. Rodighiero, S. Sacquegna, M. SahlΓ©n, D. B. Sanders, E. Sarpa, A. Schneider, M. Schultheis, D. Sciotti, E. Sellentin, F. Shankar, L. C. Smith, K. Tanidis, G. Testera, R. Teyssier, S. Tosi, A. Troja, M. Tucci, C. Valieri, D. Vergani, G. Verza, N. A. Walton
Light emission from galaxies exhibit diverse brightness profiles, influenced
by factors such as galaxy type, structural features and interactions with other
galaxies. Elliptical galaxies feature more uniform light distributions, while
spiral and irregular galaxies have complex, varied light profiles due to their
structural heterogeneity and star-forming activity. In addition, galaxies with
an active galactic nucleus (AGN) feature intense, concentrated emission from
gas accretion around supermassive black holes, superimposed on regular galactic
light, while quasi-stellar objects (QSO) are the extreme case of the AGN
emission dominating the galaxy. The challenge of identifying AGN and QSO has
been discussed many times in the literature, often requiring multi-wavelength
observations. This paper introduces a novel approach to identify AGN and QSO
from a single image. Diffusion models have been recently developed in the
machine-learning literature to generate realistic-looking images of everyday
objects. Utilising the spatial resolving power of the Euclid VIS images, we
created a diffusion model trained on one million sources, without using any
source pre-selection or labels. The model learns to reconstruct light
distributions of normal galaxies, since the population is dominated by them. We
condition the prediction of the central light distribution by masking the
central few pixels of each source and reconstruct the light according to the
diffusion model. We further use this prediction to identify sources that
deviate from this profile by examining the reconstruction error of the few
central pixels regenerated in each source's core. Our approach, solely using
VIS imaging, features high completeness compared to traditional methods of AGN
and QSO selection, including optical, near-infrared, mid-infrared, and X-rays.
comment: Paper submitted as part of the A&A Special Issue `Euclid Quick Data
Release (Q1)', 34 pages, 26 figures
β» β Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions
While current general-purpose 3D human models (e.g., SMPL-X) efficiently
represent accurate human shape and pose, they lacks the ability to physically
interact with the environment due to the kinematic nature. As a result,
kinematic-based interaction models often suffer from issues such as
interpenetration and unrealistic object dynamics. To address this limitation,
we introduce a novel approach that embeds SMPL-X into a tangible entity capable
of dynamic physical interactions with its surroundings. Specifically, we
propose a "half-physics" mechanism that transforms 3D kinematic motion into a
physics simulation. Our approach maintains kinematic control over inherent
SMPL-X poses while ensuring physically plausible interactions with scenes and
objects, effectively eliminating penetration and unrealistic object dynamics.
Unlike reinforcement learning-based methods, which demand extensive and complex
training, our half-physics method is learning-free and generalizes to any body
shape and motion; meanwhile, it operates in real time. Moreover, it preserves
the fidelity of the original kinematic motion while seamlessly integrating
physical interactions
β» β Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
Yang Yao, Lingyu Li, Jiaxin Song, Chiyu Chen, Zhenqi He, Yixu Wang, Xin Wang, Tianle Gu, Jie Li, Yan Teng, Yingchun Wang
As Multimodal Large Language Models (MLLMs) continue to evolve, their
cognitive and reasoning capabilities have seen remarkable progress. However,
challenges in visual fine-grained perception and commonsense causal inference
persist. This paper introduces Argus Inspection, a multimodal benchmark with
two levels of difficulty, emphasizing detailed visual recognition while
incorporating real-world commonsense understanding to evaluate causal reasoning
abilities. Expanding on it, we present the Eye of Panoptes framework, which
integrates a binary parametric Sigmoid metric with an indicator function,
enabling a more holistic evaluation of MLLMs' responses in opinion-based
reasoning tasks. Experiments conducted on 26 mainstream MLLMs reveal that the
highest performance in visual fine-grained reasoning reaches only 0.46,
highlighting considerable potential for enhancement. Our research offers
valuable perspectives for the continued refinement of MLLMs.
β» β MUG: Pseudo Labeling Augmented Audio-Visual Mamba Network for Audio-Visual Video Parsing ICCV 2025
The weakly-supervised audio-visual video parsing (AVVP) aims to predict all
modality-specific events and locate their temporal boundaries. Despite
significant progress, due to the limitations of the weakly-supervised and the
deficiencies of the model architecture, existing methods are lacking in
simultaneously improving both the segment-level prediction and the event-level
prediction. In this work, we propose a audio-visual Mamba network with pseudo
labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and
excluding the noise interference from the alternate modalities. Specifically,
we annotate some of the pseudo-labels based on previous work. Using unimodal
pseudo-labels, we perform cross-modal random combinations to generate new data,
which can enhance the model's ability to parse various segment-level event
combinations. For feature processing and interaction, we employ a audio-visual
mamba network. The AV-Mamba enhances the ability to perceive different segments
and excludes additional modal noise while sharing similar modal information.
Our extensive experiments demonstrate that MUG improves state-of-the-art
results on LLP dataset in all metrics (e.g,, gains of 2.1% and 1.2% in terms of
visual Segment-level and audio Segment-level metrics). Our code is available at
https://github.com/WangLY136/MUG.
comment: Accpted by ICCV 2025
β» β LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition
In human-centered environments such as restaurants, homes, and warehouses,
robots often face challenges in accurately recognizing 3D objects. These
challenges stem from the complexity and variability of these environments,
including diverse object shapes. In this paper, we propose a novel Lightweight
Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to
enhance 3D object recognition in robotic applications. Our approach leverages
the Globally Entropy-based Embeddings Fusion (GEEF) method to integrate
multi-views efficiently. The LM-MCVT architecture incorporates pre- and
mid-level convolutional encoders and local and global transformers to enhance
feature extraction and recognition accuracy. We evaluate our method on the
synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6% using
a four-view setup, surpassing existing state-of-the-art methods. To further
validate its effectiveness, we conduct 5-fold cross-validation on the
real-world OmniObject3D dataset using the same configuration. Results
consistently show superior performance, demonstrating the method's robustness
in 3D object recognition across synthetic and real-world 3D data.
β» β Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning
Class-incremental learning (CIL) enables models to learn new classes
progressively while preserving knowledge of previously learned ones. Recent
advances in this field have shifted towards parameter-efficient fine-tuning
techniques, with many approaches building upon the framework that maintains a
pool of learnable prompts. Although effective, these methods introduce
substantial computational overhead, primarily due to prompt pool querying and
increased input sequence lengths from prompt concatenation. In this work, we
present a novel prompt-based approach that addresses this limitation. Our
method trains a single set of shared prompts across all tasks and, rather than
concatenating prompts to the input, directly modifies the CLS token's attention
computation by adding the prompts to it. This simple and lightweight design not
only significantly reduces computational complexity-both in terms of inference
costs and the number of trainable parameters-but also eliminates the need to
optimize prompt lengths for different downstream tasks, offering a more
efficient yet powerful solution for rehearsal-free class-incremental learning.
Extensive experiments across a diverse range of CIL benchmarks demonstrate the
effectiveness of our approach, highlighting its potential to establish a new
prompt-based CIL paradigm. Furthermore, experiments on general recognition
benchmarks beyond the CIL setting also show strong performance, positioning our
method as a promising candidate for a general parameter-efficient fine-tuning
approach.
β» β OE3DIS: Open-Ended 3D Point Cloud Instance Segmentation ICCV
Open-Vocab 3D Instance Segmentation methods (OV-3DIS) have recently
demonstrated their ability to generalize to unseen objects. However, these
methods still depend on predefined class names during testing, restricting the
autonomy of agents. To mitigate this constraint, we propose a novel problem
termed Open-Ended 3D Instance Segmentation (OE-3DIS), which eliminates the
necessity for predefined class names during testing. Moreover, we contribute a
comprehensive set of strong baselines, derived from OV-3DIS approaches and
leveraging 2D Multimodal Large Language Models. To assess the performance of
our OE-3DIS system, we introduce a novel Open-Ended score, evaluating both the
semantic and geometric quality of predicted masks and their associated class
names, alongside the standard AP score. Our approach demonstrates significant
performance improvements over the baselines on the ScanNet200 and ScanNet++
datasets. Remarkably, our method surpasses the performance of Open3DIS, the
current state-of-the-art method in OV-3DIS, even in the absence of ground-truth
object class names.
comment: Accepted at ICCVW'25 - OpenSUN3D: 5th Workshop on Open-World 3D Scene
Understanding with Foundation Models
β» β Un-EVIMO: Unsupervised Event-Based Independent Motion Segmentation
Event cameras are a novel type of biologically inspired vision sensor known
for their high temporal resolution, high dynamic range, and low power
consumption. Because of these properties, they are well-suited for processing
fast motions that require rapid reactions. Although event cameras have recently
shown competitive performance in unsupervised optical flow estimation,
performance in detecting independently moving objects (IMOs) is lacking behind,
although event-based methods would be suited for this task based on their low
latency and HDR properties. Previous approaches to event-based IMO segmentation
have been heavily dependent on labeled data. However, biological vision systems
have developed the ability to avoid moving objects through daily tasks without
being given explicit labels. In this work, we propose the first event framework
that generates IMO pseudo-labels using geometric constraints. Due to its
unsupervised nature, our method can handle an arbitrary number of not
predetermined objects and is easily scalable to datasets where expensive IMO
labels are not readily available. We evaluate our approach on the EVIMO dataset
and show that it performs competitively with supervised methods, both
quantitatively and qualitatively.
β» β 3D Human Mesh Estimation from Single View RGBD
Despite significant progress in 3D human mesh estimation from RGB images;
RGBD cameras, offering additional depth data, remain underutilized. In this
paper, we present a method for accurate 3D human mesh estimation from a single
RGBD view, leveraging the affordability and widespread adoption of RGBD cameras
for real-world applications. A fully supervised approach for this problem,
requires a dataset with RGBD image and 3D mesh label pairs. However, collecting
such a dataset is costly and challenging, hence, existing datasets are small,
and limited in pose and shape diversity. To overcome this data scarcity, we
leverage existing Motion Capture (MoCap) datasets. We first obtain complete 3D
meshes from the body models found in MoCap datasets, and create partial,
single-view versions of them by projection to a virtual camera. This simulates
the depth data provided by an RGBD camera from a single viewpoint. Then, we
train a masked autoencoder to complete the partial, single-view mesh. During
inference, our method, which we name as M$^3$ for ``Masked Mesh Modeling'',
matches the depth values coming from the sensor to vertices of a template human
mesh, which creates a partial, single-view mesh. We effectively recover parts
of the 3D human body mesh model that are not visible, resulting in a full body
mesh. M$^3$ achieves 16.8 mm and 22.0 mm per-vertex-error (PVE) on the SURREAL
and CAPE datasets, respectively; outperforming existing methods that use
full-body point clouds as input. We obtain a competitive 70.9 PVE on the BEHAVE
dataset, outperforming a recently published RGB based method by 18.4 mm,
highlighting the usefulness of depth data. Code will be released.
β» β From Lab to Field: Real-World Evaluation of an AI-Driven Smart Video Solution to Enhance Community Safety
Shanle Yao, Babak Rahimi Ardabili, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Christopher Neff, Lauren Bourque, Hamed Tabkhi
This article adopts and evaluates an AI-enabled Smart Video Solution (SVS)
designed to enhance safety in the real world. The system integrates with
existing infrastructure camera networks, leveraging recent advancements in AI
for easy adoption. Prioritizing privacy and ethical standards, pose based data
is used for downstream AI tasks such as anomaly detection. Cloud-based
infrastructure and mobile app are deployed, enabling real-time alerts within
communities. The SVS employs innovative data representation and visualization
techniques, such as the Occupancy Indicator, Statistical Anomaly Detection,
Bird's Eye View, and Heatmaps, to understand pedestrian behaviors and enhance
public safety. Evaluation of the SVS demonstrates its capacity to convert
complex computer vision outputs into actionable insights for stakeholders,
community partners, law enforcement, urban planners, and social scientists.
This article presents a comprehensive real-world deployment and evaluation of
the SVS, implemented in a community college environment across 16 cameras. The
system integrates AI-driven visual processing, supported by statistical
analysis, database management, cloud communication, and user notifications.
Additionally, the article evaluates the end-to-end latency from the moment an
AI algorithm detects anomalous behavior in real-time at the camera level to the
time stakeholders receive a notification. The results demonstrate the system's
robustness, effectively managing 16 CCTV cameras with a consistent throughput
of 16.5 frames per second (FPS) over a 21-hour period and an average end-to-end
latency of 26.76 seconds between anomaly detection and alert issuance.
β» β TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation
Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Fu-Yun Wang, Yuchi Wang, Renrui Zhang, Peng Gao, Hongsheng Li
Diffusion Transformers (DiTs) are a powerful yet underexplored class of
generative models compared to U-Net-based diffusion architectures. We propose
TIDE-Temporal-aware sparse autoencoders for Interpretable Diffusion
transformErs-a framework designed to extract sparse, interpretable activation
features across timesteps in DiTs. TIDE effectively captures temporally-varying
representations and reveals that DiTs naturally learn hierarchical semantics
(e.g., 3D structure, object class, and fine-grained concepts) during
large-scale pretraining. Experiments show that TIDE enhances interpretability
and controllability while maintaining reasonable generation quality, enabling
applications such as safe image editing and style transfer.
β» β SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
Repurposing large vision-language models (LVLMs) as computer use agents
(CUAs) has led to substantial breakthroughs, primarily driven by human-labeled
data. However, these models often struggle with novel and specialized software,
particularly in scenarios lacking human annotations. To address this challenge,
we propose SEAgent, an agentic self-evolving framework enabling CUAs to
autonomously evolve through interactions with unfamiliar software.
Specifically, SEAgent empowers computer-use agents to autonomously master novel
software environments via experiential learning, where agents explore new
software, learn through iterative trial-and-error, and progressively tackle
auto-generated tasks organized from simple to complex. To achieve this goal, we
design a World State Model for step-wise trajectory assessment, along with a
Curriculum Generator that generates increasingly diverse and challenging tasks.
The agent's policy is updated through experiential learning, comprised of
adversarial imitation of failure actions and Group Relative Policy Optimization
(GRPO) on successful ones. Furthermore, we introduce a specialist-to-generalist
training strategy that integrates individual experiential insights from
specialist agents, facilitating the development of a stronger generalist CUA
capable of continuous autonomous evolution. This unified agent ultimately
achieves performance surpassing ensembles of individual specialist agents on
their specialized software. We validate the effectiveness of SEAgent across
five novel software environments within OS-World. Our approach achieves a
significant improvement of 23.2% in success rate, from 11.3% to 34.5%, over a
competitive open-source CUA, i.e., UI-TARS.
comment: Code at https://github.com/SunzeY/SEAgent
β» β HiMat: DiT-based Ultra-High Resolution SVBRDF Generation
Creating highly detailed SVBRDFs is essential for 3D content creation. The
rise of high-resolution text-to-image generative models, based on diffusion
transformers (DiT), suggests an opportunity to finetune them for this task.
However, retargeting the models to produce multiple aligned SVBRDF maps instead
of just RGB images, while achieving high efficiency and ensuring consistency
across different maps, remains a challenge. In this paper, we introduce HiMat:
a memory- and computation-efficient diffusion-based framework capable of
generating native 4K-resolution SVBRDFs. A key challenge we address is
maintaining consistency across different maps in a lightweight manner, without
relying on training new VAEs or significantly altering the DiT backbone (which
would damage its prior capabilities). To tackle this, we introduce the
CrossStitch module, a lightweight convolutional module that captures inter-map
dependencies through localized operations. Its weights are initialized such
that the DiT backbone operation is unchanged before finetuning starts. HiMat
enables generation with strong structural coherence and high-frequency details.
Results with a large set of text prompts demonstrate the effectiveness of our
approach for 4K SVBRDF generation. Further experiments suggest generalization
to tasks such as intrinsic decomposition.
β» β GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving
Diffusion-based models are redefining the state-of-the-art in end-to-end
autonomous driving, yet their performance is increasingly hampered by a
reliance on transformer-based fusion. These architectures face fundamental
limitations: quadratic computational complexity restricts the use of
high-resolution features, and a lack of spatial priors prevents them from
effectively modeling the inherent structure of Bird's Eye View (BEV)
representations. This paper introduces GMF-Drive (Gated Mamba Fusion for
Driving), an end-to-end framework that overcomes these challenges through two
principled innovations. First, we supersede the information-limited
histogram-based LiDAR representation with a geometrically-augmented pillar
format encoding shape descriptors and statistical features, preserving critical
3D geometric details. Second, we propose a novel hierarchical gated mamba
fusion (GM-Fusion) architecture that substitutes an expensive transformer with
a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM
leverages directional sequencing and adaptive fusion mechanisms to capture
long-range dependencies with linear complexity, while explicitly respecting the
unique spatial properties of the driving scene. Extensive experiments on the
challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new
state-of-the-art performance, significantly outperforming DiffusionDrive.
Comprehensive ablation studies validate the efficacy of each component,
demonstrating that task-specific SSMs can surpass a general-purpose transformer
in both performance and efficiency for autonomous driving.
comment: 7 pages, 4 figures
β» β Understanding Dynamic Scenes in Ego Centric 4D Point Clouds
Understanding dynamic 4D scenes from an egocentric perspective-modeling
changes in 3D spatial structure over time-is crucial for human-machine
interaction, autonomous navigation, and embodied intelligence. While existing
egocentric datasets contain dynamic scenes, they lack unified 4D annotations
and task-driven evaluation protocols for fine-grained spatio-temporal
reasoning, especially on motion of objects and human, together with their
interactions. To address this gap, we introduce EgoDynamic4D, a novel QA
benchmark on highly dynamic scenes, comprising RGB-D video, camera poses,
globally unique instance masks, and 4D bounding boxes. We construct 927K QA
pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable,
step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering
agent motion, human-object interaction, trajectory prediction, relation
understanding, and temporal-causal reasoning, with fine-grained,
multidimensional metrics. To tackle these tasks, we propose an end-to-end
spatio-temporal reasoning framework that unifies dynamic and static scene
information, using instance-aware feature encoding, time and camera encoding,
and spatially adaptive down-sampling to compress large 4D scenes into token
sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method
consistently outperforms baselines, validating the effectiveness of multimodal
temporal modeling for egocentric dynamic scene understanding.
β» β Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation CVPR2025
Generating 3D meshes from a single image is an important but ill-posed task.
Existing methods mainly adopt 2D multiview diffusion models to generate
intermediate multiview images, and use the Large Reconstruction Model (LRM) to
create the final meshes. However, the multiview images exhibit local
inconsistencies, and the meshes often lack fidelity to the input image or look
blurry. We propose Fancy123, featuring two enhancement modules and an
unprojection operation to address the above three issues, respectively. The
appearance enhancement module deforms the 2D multiview images to realign
misaligned pixels for better multiview consistency. The fidelity enhancement
module deforms the 3D mesh to match the input image. The unprojection of the
input image and deformed multiview images onto LRM's generated mesh ensures
high clarity, discarding LRM's predicted blurry-looking mesh colors. Extensive
qualitative and quantitative experiments verify Fancy123's SoTA performance
with significant improvement. Also, the two enhancement modules are
plug-and-play and work at inference time, allowing seamless integration into
various existing single-image-to-3D methods. Code at:
https://github.com/YuQiao0303/Fancy123
comment: CVPR2025
β» β OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions
Open Semantic Mapping (OSM) is a key technology in robotic perception,
combining semantic segmentation and SLAM techniques. This paper introduces a
dynamically configurable and highly automated LLM/LVLM-powered pipeline for
evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark).
The study focuses on evaluating state-of-the-art semantic mapping algorithms
under varying indoor lighting conditions, a critical challenge in indoor
environments. We introduce a novel dataset with simulated RGB-D sequences and
ground truth 3D reconstructions, facilitating the rigorous analysis of mapping
performance across different lighting conditions. Through experiments on
leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the
semantic fidelity of object recognition and segmentation. Additionally, we
introduce a Scene Graph evaluation method to analyze the ability of models to
interpret semantic structure. The results provide insights into the robustness
of these models, forming future research directions for developing resilient
and adaptable robotic systems. Project page is available at
https://be2rlab.github.io/OSMa-Bench/.
comment: Project page: https://be2rlab.github.io/OSMa-Bench/
β» β LayLens: Improving Deepfake Understanding through Simplified Explanations
This demonstration paper presents $\mathbf{LayLens}$, a tool aimed to make
deepfake understanding easier for users of all educational backgrounds. While
prior works often rely on outputs containing technical jargon, LayLens bridges
the gap between model reasoning and human understanding through a three-stage
pipeline: (1) explainable deepfake detection using a state-of-the-art forgery
localization model, (2) natural language simplification of technical
explanations using a vision-language model, and (3) visual reconstruction of a
plausible original image via guided image editing. The interface presents both
technical and layperson-friendly explanations in addition to a side-by-side
comparison of the uploaded and reconstructed images. A user study with 15
participants shows that simplified explanations significantly improve clarity
and reduce cognitive load, with most users expressing increased confidence in
identifying deepfakes. LayLens offers a step toward transparent, trustworthy,
and user-centric deepfake forensics.
comment: Accepted to ACM ICMI 2025 Demos
β» β SCB-Dataset: A Dataset for Detecting Student and Teacher Classroom Behavior
Using deep learning methods to detect the classroom behaviors of both
students and teachers is an effective way to automatically analyze classroom
performance and enhance teaching effectiveness. Then, there is still a scarcity
of publicly available high-quality datasets on student-teacher behaviors. We
constructed SCB-Dataset a comprehensive dataset of student and teacher
classroom behaviors covering 19 classes. SCB-Dataset is divided into two types:
Object Detection and Image Classification. The Object Detection part includes
13,330 images and 122,977 labels, and the Image Classification part includes
21,019 images. We conducted benchmark tests on SCB-Dataset using YOLO series
algorithms and Large vision-language model. We believe that SCB-Dataset can
provide a solid foundation for future applications of artificial intelligence
in education. Code:https://github.com/Whiffe/SCB-dataset
β» β 3DFacePolicy: Audio-Driven 3D Facial Animation Based on Action Control
Audio-driven 3D facial animation has achieved significant progress in both
research and applications. While recent baselines struggle to generate natural
and continuous facial movements due to their frame-by-frame vertex generation
approach, we propose 3DFacePolicy, a pioneer work that introduces a novel
definition of vertex trajectory changes across consecutive frames through the
concept of "action". By predicting action sequences for each vertex that encode
frame-to-frame movements, we reformulate vertex generation approach into an
action-based control paradigm. Specifically, we leverage a robotic control
mechanism, diffusion policy, to predict action sequences conditioned on both
audio and vertex states. Extensive experiments on VOCASET and BIWI datasets
demonstrate that our approach significantly outperforms state-of-the-art
methods and is particularly expert in dynamic, expressive and naturally smooth
facial animations.
β» β Learning to Harmonize Cross-vendor X-ray Images by Non-linear Image Dynamics Correction
In this paper, we explore how conventional image enhancement can improve
model robustness in medical image analysis. By applying commonly used
normalization methods to images from various vendors and studying their
influence on model generalization in transfer learning, we show that the
nonlinear characteristics of domain-specific image dynamics cannot be addressed
by simple linear transforms. To tackle this issue, we reformulate the image
harmonization task as an exposure correction problem and propose a method
termed Global Deep Curve Estimation (GDCE) to reduce domain-specific exposure
mismatch. GDCE performs enhancement via a pre-defined polynomial function and
is trained with a "domain discriminator", aiming to improve model transparency
in downstream tasks compared to existing black-box methods.
β» β How Does Bilateral Ear Symmetry Affect Deep Ear Features?
Ear recognition has gained attention as a reliable biometric technique due to
the distinctive characteristics of human ears. With the increasing availability
of large-scale datasets, convolutional neural networks (CNNs) have been widely
adopted to learn features directly from raw ear images, outperforming
traditional hand-crafted methods. However, the effect of bilateral ear symmetry
on the features learned by CNNs has received little attention in recent
studies. In this paper, we investigate how bilateral ear symmetry influences
the effectiveness of CNN-based ear recognition. To this end, we first develop
an ear side classifier to automatically categorize ear images as either left or
right. We then explore the impact of incorporating this side information during
both training and test. Cross-dataset evaluations are conducted on five
datasets. Our results suggest that treating left and right ears separately
during training and testing can lead to notable performance improvements.
Furthermore, our ablation studies on alignment strategies, input sizes, and
various hyperparameter settings provide practical insights into training
CNN-based ear recognition systems on large-scale datasets to achieve higher
verification rates.
β» β See the Forest and the Trees: A Synergistic Reasoning Framework for Knowledge-Based Visual Question Answering
Multimodal Large Language Models (MLLMs) have pushed the frontiers of
Knowledge-Based Visual Question Answering (KBVQA), yet their reasoning is
fundamentally bottlenecked by a reliance on uni-dimensional evidence. This
"seeing only the trees, but not the forest" approach prevents robust,
multi-faceted understanding. Inspired by the principle of seeing both the
forest and trees, we propose Synergos-VQA, a novel synergistic reasoning
framework. At its core, Synergos-VQA concurrently generates and fuses three
complementary evidence streams at inference time: (1) Holistic Evidence to
perceive the entire scene (the "forest"), (2) Structural Evidence from a
prototype-driven module to identify key objects (the "trees"), and (3) Causal
Evidence from a counterfactual probe to ensure the reasoning is robustly
grounded. By synergistically fusing this multi-faceted evidence, our framework
achieves a more comprehensive and reliable reasoning process. Extensive
experiments show that Synergos-VQA decisively establishes a new
state-of-the-art on three challenging benchmarks, including OK-VQA and A-OKVQA.
Furthermore, our approach demonstrates strong plug-and-play capabilities,
significantly boosting various open-source MLLMs and proving that superior
methodological design can outperform sheer model scale.
comment: Paper withdrawn by authors. A critical bug in our data processing
script (process_data.py, line 152) caused an incorrect indexing operation,
leading to systematic data omission. This error invalidates the performance
benchmarks in Table 2 and the conclusions, leaving the paper's central claim
unsupported. We apologize to the research community for this error
β» β Multiple Stochastic Prompt Tuning for Few-shot Adaptation under Extreme Domain Shift
Foundation Vision-Language Models (VLMs) like CLIP exhibit strong
generalization capabilities due to large-scale pretraining on diverse
image-text pairs. However, their performance often degrades when applied to
target datasets with significant distribution shifts in both visual appearance
and class semantics. Recent few-shot learning approaches adapt CLIP to
downstream tasks using limited labeled data via adapter or prompt tuning, but
are not specifically designed to handle such extreme domain shifts. Conversely,
some works addressing cross-domain few-shot learning consider such
domain-shifted scenarios but operate in an episodic setting with only a few
classes per episode, limiting their applicability to real-world deployment,
where all classes must be handled simultaneously. To address this gap, we
propose a novel framework, MIST (Multiple Stochastic Prompt Tuning), for
efficiently adapting CLIP to datasets with extreme distribution shifts using
only a few labeled examples, in scenarios involving all classes at once.
Specifically, we introduce multiple learnable prompts per class to effectively
capture diverse modes in visual representations arising from distribution
shifts. To further enhance generalization, these prompts are modeled as
learnable Gaussian distributions, enabling efficient exploration of the prompt
parameter space and reducing overfitting caused by limited supervision.
Extensive experiments and comparisons with state-of-the-art methods demonstrate
the effectiveness of the proposed framework.
β» β PointDreamer: Zero-shot 3D Textured Mesh Reconstruction from Colored Point Cloud
Faithfully reconstructing textured meshes is crucial for many applications.
Compared to text or image modalities, leveraging 3D colored point clouds as
input (colored-PC-to-mesh) offers inherent advantages in comprehensively and
precisely replicating the target object's 360{\deg} characteristics. While most
existing colored-PC-to-mesh methods suffer from blurry textures or require
hard-to-acquire 3D training data, we propose PointDreamer, a novel framework
that harnesses 2D diffusion prior for superior texture quality. Crucially,
unlike prior 2D-diffusion-for-3D works driven by text or image inputs,
PointDreamer successfully adapts 2D diffusion models to 3D point cloud data by
a novel project-inpaint-unproject pipeline. Specifically, it first projects the
point cloud into sparse 2D images and then performs diffusion-based inpainting.
After that, diverging from most existing 3D reconstruction or generation
approaches that predict texture in 3D/UV space thus often yielding blurry
texture, PointDreamer achieves high-quality texture by directly unprojecting
the inpainted 2D images to the 3D mesh. Furthermore, we identify for the first
time a typical kind of unprojection artifact appearing in occlusion borders,
which is common in other multiview-image-to-3D pipelines but less-explored. To
address this, we propose a novel solution named the Non-Border-First (NBF)
unprojection strategy. Extensive qualitative and quantitative experiments on
various synthetic and real-scanned datasets demonstrate that PointDreamer,
though zero-shot, exhibits SoTA performance (30% improvement on LPIPS score
from 0.118 to 0.068), and is robust to noisy, sparse, or even incomplete input
data. Code at: https://github.com/YuQiao0303/PointDreamer.
β» β Cut2Next: Generating Next Shot via In-Context Tuning
Effective multi-shot generation demands purposeful, film-like transitions and
strict cinematic continuity. Current methods, however, often prioritize basic
visual consistency, neglecting crucial editing patterns (e.g., shot/reverse
shot, cutaways) that drive narrative flow for compelling storytelling. This
yields outputs that may be visually coherent but lack narrative sophistication
and true cinematic integrity. To bridge this, we introduce Next Shot Generation
(NSG): synthesizing a subsequent, high-quality shot that critically conforms to
professional editing patterns while upholding rigorous cinematic continuity.
Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs
in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This
strategy uses Relational Prompts to define overall context and inter-shot
editing styles. Individual Prompts then specify per-shot content and
cinematographic attributes. Together, these guide Cut2Next to generate
cinematically appropriate next shots. Architectural innovations, Context-Aware
Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further
integrate these diverse signals without introducing new parameters. We
construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with
hierarchical prompts, and introduce CutBench for evaluation. Experiments show
Cut2Next excels in visual consistency and text fidelity. Crucially, user
studies reveal a strong preference for Cut2Next, particularly for its adherence
to intended editing patterns and overall cinematic continuity, validating its
ability to generate high-quality, narratively expressive, and cinematically
coherent subsequent shots.
β» β UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI ICCV 2025
We introduce UnrealZoo, a collection of over 100 photo-realistic 3D virtual
worlds built on Unreal Engine, designed to reflect the complexity and
variability of open-world environments. We also provide a rich variety of
playable entities, including humans, animals, robots, and vehicles for embodied
AI research. We extend UnrealCV with optimized APIs and tools for data
collection, environment augmentation, distributed training, and benchmarking.
These improvements achieve significant improvements in the efficiency of
rendering and communication, enabling advanced applications such as multi-agent
interactions. Our experimental evaluation across visual navigation and tracking
tasks reveals two key insights: 1) environmental diversity provides substantial
benefits for developing generalizable reinforcement learning (RL) agents, and
2) current embodied agents face persistent challenges in open-world scenarios,
including navigation in unstructured terrain, adaptation to unseen
morphologies, and managing latency in the close-loop control systems for
interacting in highly dynamic objects. UnrealZoo thus serves as both a
comprehensive testing ground and a pathway toward developing more capable
embodied AI systems for real-world deployment.
comment: ICCV 2025 (Highlight), Project page: http://unrealzoo.site/
β» β PAD-F: Prior-Aware Debiasing Framework for Long-Tailed X-ray Prohibited Item Detection
Detecting prohibited items in X-ray security imagery is a challenging yet
crucial task. With the rapid advancement of deep learning, object detection
algorithms have been widely applied in this area. However, the distribution of
object classes in real-world prohibited item detection scenarios often exhibits
a distinct long-tailed distribution. Due to the unique principles of X-ray
imaging, conventional methods for long-tailed object detection are often
ineffective in this domain. To tackle these challenges, we introduce the
Prior-Aware Debiasing Framework (PAD-F), a novel approach that employs a
two-pronged strategy leveraging both material and co-occurrence priors. At the
data level, our Explicit Material-Aware Augmentation (EMAA) component generates
numerous challenging training samples for tail classes. It achieves this
through a placement strategy guided by material-specific absorption rates and a
gradient-based Poisson blending technique. At the feature level, the Implicit
Co-occurrence Aggregator (ICA) acts as a plug-in module that enhances features
for ambiguous objects by implicitly learning and aggregating statistical
co-occurrence relationships within the image. Extensive experiments on the
HiXray and PIDray datasets demonstrate that PAD-F significantly boosts the
performance of multiple popular detectors. It achieves an absolute improvement
of up to +17.2% in AP50 for tail classes and comprehensively outperforms
existing state-of-the-art methods. Our work provides an effective and versatile
solution to the critical problem of long-tailed detection in X-ray security.
comment: 9 pages, 5 figures
β» β Mem4D: Decoupling Static and Dynamic Memory for Dynamic Scene Reconstruction
Xudong Cai, Shuo Wang, Peng Wang, Yongcai Wang, Zhaoxin Fan, Wanting Li, Tianbao Zhang, Jianrong Tao, Yeying Jin, Deying Li
Reconstructing dense geometry for dynamic scenes from a monocular video is a
critical yet challenging task. Recent memory-based methods enable efficient
online reconstruction, but they fundamentally suffer from a Memory Demand
Dilemma: The memory representation faces an inherent conflict between the
long-term stability required for static structures and the rapid, high-fidelity
detail retention needed for dynamic motion. This conflict forces existing
methods into a compromise, leading to either geometric drift in static
structures or blurred, inaccurate reconstructions of dynamic objects. To
address this dilemma, we propose Mem4D, a novel framework that decouples the
modeling of static geometry and dynamic motion. Guided by this insight, we
design a dual-memory architecture: 1) The Transient Dynamics Memory (TDM)
focuses on capturing high-frequency motion details from recent frames, enabling
accurate and fine-grained modeling of dynamic content; 2) The Persistent
Structure Memory (PSM) compresses and preserves long-term spatial information,
ensuring global consistency and drift-free reconstruction for static elements.
By alternating queries to these specialized memories, Mem4D simultaneously
maintains static geometry with global consistency and reconstructs dynamic
elements with high fidelity. Experiments on challenging benchmarks demonstrate
that our method achieves state-of-the-art or competitive performance while
maintaining high efficiency. Codes will be publicly available.
β» β Efficient Annotation of Medieval Charters
Diplomatics, the analysis of medieval charters, is a major field of research
in which paleography is applied. Annotating data, if performed by laymen, needs
validation and correction by experts. In this paper, we propose an effective
and efficient annotation approach for charter segmentation, essentially
reducing it to object detection. This approach allows for a much more efficient
use of the paleographer's time and produces results that can compete and even
outperform pixel-level segmentation in some use cases. Further experiments shed
light on how to design a class ontology in order to make the best use of
annotators' time and effort. Exploiting the presence of calibration cards in
the image, we further annotate the data with the physical length in pixels and
train regression neural networks to predict it from image patches.
β» β Masked Autoencoder Self Pre-Training for Defect Detection in Microelectronics
While transformers have surpassed convolutional neural networks (CNNs) in
various computer vision tasks, microelectronics defect detection still largely
relies on CNNs. We hypothesize that this gap is due to the fact that a)
transformers have an increased need for data and b) (labelled) image generation
procedures for microelectronics are costly, and data is therefore sparse.
Whereas in other domains, pre-training on large natural image datasets can
mitigate this problem, in microelectronics transfer learning is hindered due to
the dissimilarity of domain data and natural images. We address this challenge
through self pre-training, where models are pre-trained directly on the target
dataset, rather than another dataset. We propose a resource-efficient vision
transformer (ViT) pre-training framework for defect detection in
microelectronics based on masked autoencoders (MAE). We perform pre-training
and defect detection using a dataset of less than 10,000 scanning acoustic
microscopy (SAM) images. Our experimental results show that our approach leads
to substantial performance gains compared to a) supervised ViT, b) ViT
pre-trained on natural image datasets, and c) state-of-the-art CNN-based defect
detection models used in microelectronics. Additionally, interpretability
analysis reveals that our self pre-trained models attend to defect-relevant
features such as cracks in the solder material, while baseline models often
attend to spurious patterns. This shows that our approach yields
defect-specific feature representations, resulting in more interpretable and
generalizable transformer models for this data-sparse domain.
comment: 16 pages, 5 figures
β» β HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
When applying Visual Place Recognition (VPR) to real-world mobile robots and
similar applications, perspective-to-equirectangular (P2E) formulation
naturally emerges as a suitable approach to accommodate diverse query images
captured from various viewpoints. In this paper, we introduce HypeVPR, a novel
hierarchical embedding framework in hyperbolic space, designed to address the
unique challenges of P2E VPR. The key idea behind HypeVPR is that visual
environments captured by panoramic views exhibit inherent hierarchical
structures. To leverage this property, we employ hyperbolic space to represent
hierarchical feature relationships and preserve distance properties within the
feature space. To achieve this, we propose a hierarchical feature aggregation
mechanism that organizes local-to-global feature representations within
hyperbolic space. Additionally, HypeVPR adopts an efficient coarse-to-fine
search strategy to enable flexible control over accuracy-efficiency trade-offs
and ensure robust matching even between descriptors from different image types.
This approach allows HypeVPR to outperform existing methods while significantly
accelerating retrieval and reducing database storage requirements. The code and
models will be released at https://github.com/suhan-woo/HypeVPR.git.
β» β Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
Autonomous driving systems must operate reliably in safety-critical
scenarios, particularly those involving unusual or complex behavior by
Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets
is essential for robust evaluation and generalization, but retrieving such rare
human behavior scenarios within the long tail of large-scale datasets is
challenging. To support targeted evaluation of autonomous driving systems in
diverse, human-centered scenarios, we propose a novel context-aware motion
retrieval framework. Our method combines Skinned Multi-Person Linear
(SMPL)-based motion sequences and corresponding video frames before encoding
them into a shared multimodal embedding space aligned with natural language.
Our approach enables the scalable retrieval of human behavior and their context
through text queries. This work also introduces our dataset WayMoCo, an
extension of the Waymo Open Dataset. It contains automatically labeled motion
and scene context descriptions derived from generated pseudo-ground-truth SMPL
sequences and corresponding image data. Our approach outperforms
state-of-the-art models by up to 27.5% accuracy in motion-context retrieval,
when evaluated on the WayMoCo dataset.
comment: Project page: https://iv.ee.hm.edu/contextmotionclip/; This work has
been submitted to the IEEE for possible publication
β» β Style transfer between Microscopy and Magnetic Resonance Imaging via Generative Adversarial Network in small sample size settings ICIP
Cross-modal augmentation of Magnetic Resonance Imaging (MRI) and microscopic
imaging based on the same tissue samples is promising because it can allow
histopathological analysis in the absence of an underlying invasive biopsy
procedure. Here, we tested a method for generating microscopic histological
images from MRI scans of the human corpus callosum using conditional generative
adversarial network (cGAN) architecture. To our knowledge, this is the first
multimodal translation of the brain MRI to histological volumetric
representation of the same sample. The technique was assessed by training
paired image translation models taking sets of images from MRI scans and
microscopy. The use of cGAN for this purpose is challenging because microscopy
images are large in size and typically have low sample availability. The
current work demonstrates that the framework reliably synthesizes histology
images from MRI scans of corpus callosum, emphasizing the network's ability to
train on high resolution histologies paired with relatively lower-resolution
MRI scans. With the ultimate goal of avoiding biopsies, the proposed tool can
be used for educational purposes.
comment: 2023 IEEE International Conference on Image Processing (ICIP)
β» β When Imitation Learning Outperforms Reinforcement Learning in Surgical Action Planning MICCAI2025
Maxence Boels, Harry Robertshaw, Thomas C Booth, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin
Surgical action planning requires predicting future instrument-verb-target
triplets for real-time assistance. While teleoperated robotic surgery provides
natural expert demonstrations for imitation learning (IL), reinforcement
learning (RL) could potentially discover superior strategies through
exploration. We present the first comprehensive comparison of IL versus RL for
surgical action planning on CholecT50. Our Dual-task Autoregressive Imitation
Learning (DARIL) baseline achieves 34.6% action triplet recognition mAP and
33.6% next frame prediction mAP with smooth planning degradation to 29.2% at
10-second horizons. We evaluated three RL variants: world model-based RL,
direct video RL, and inverse RL enhancement. Surprisingly, all RL approaches
underperformed DARIL i.e. world model RL dropped to 3.1% mAP at 10s while
direct video RL achieved only 15.9%. Our analysis reveals that distribution
matching on expert-annotated test sets systematically favors IL over
potentially valid RL policies that differ from training demonstrations. This
challenges assumptions about RL superiority in sequential decision making and
provides crucial insights for surgical AI development.
comment: Paper accepted at the MICCAI2025 workshop proceedings on
COLlaborative Intelligence and Autonomy in Image-guided Surgery (COLAS)
β» β Multi-Keypoint Affordance Representation for Functional Dexterous Grasping
Fan Yang, Dongsheng Luo, Wenrui Chen, Jiacheng Lin, Junjie Cai, Kailun Yang, Zhiyong Li, Yaonan Wang
Functional dexterous grasping requires precise hand-object interaction, going
beyond simple gripping. Existing affordance-based methods primarily predict
coarse interaction regions and cannot directly constrain the grasping posture,
leading to a disconnection between visual perception and manipulation. To
address this issue, we propose a multi-keypoint affordance representation for
functional dexterous grasping, which directly encodes task-driven grasp
configurations by localizing functional contact points. Our method introduces
Contact-guided Multi-Keypoint Affordance (CMKA), leveraging human grasping
experience images for weak supervision combined with Large Vision Models for
fine affordance feature extraction, achieving generalization while avoiding
manual keypoint annotations. Additionally, we present a Keypoint-based Grasp
matrix Transformation (KGT) method, ensuring spatial consistency between hand
keypoints and object contact points, thus providing a direct link between
visual perception and dexterous grasping actions. Experiments on public
real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks
demonstrate that our method significantly improves affordance localization
accuracy, grasp consistency, and generalization to unseen tools and tasks,
bridging the gap between visual affordance learning and dexterous robotic
manipulation. The source code and demo videos are publicly available at
https://github.com/PopeyePxx/MKA.
comment: Accepted to IEEE Robotics and Automation Letters (RA-L). The source
code and demo videos are publicly available at
https://github.com/PopeyePxx/MKA
β» β SSPFusion: A Semantic Structure-Preserving Approach for Infrared and Visible Image Fusion
Most existing learning-based multi-modality image fusion (MMIF) methods
suffer from significant structure inconsistency due to their inappropriate
usage of structural features at the semantic level. To alleviate these issues,
we propose a semantic structure-preserving fusion approach for MMIF, namely
SSPFusion. At first, we design a structural feature extractor (SFE) to extract
the prominent structural features from multiple input images. Concurrently, we
introduce a transformation function with Sobel operator to generate
self-supervised structural signals in these extracted features. Subsequently,
we design a multi-scale structure-preserving fusion (SPF) module, guided by the
generated structural signals, to merge the structural features of input images.
This process ensures the preservation of semantic structure consistency between
the resultant fusion image and the input images. Through the synergy of these
two robust modules of SFE and SPF, our method can generate high-quality fusion
images and demonstrate good generalization ability. Experimental results, on
both infrared-visible image fusion and medical image fusion tasks, demonstrate
that our method outperforms nine state-of-the-art methods in terms of both
qualitative and quantitative evaluations. The code is publicly available at
https://github.com/QiaoYang-CV/SSPFUSION.
comment: Accepted by Expert Systems with Applications (ESWA)
β» β Adversarial Video Promotion Against Text-to-Video Retrieval
Thanks to the development of cross-modal models, text-to-video retrieval
(T2VR) is advancing rapidly, but its robustness remains largely unexamined.
Existing attacks against T2VR are designed to push videos away from queries,
i.e., suppressing the ranks of videos, while the attacks that pull videos
towards selected queries, i.e., promoting the ranks of videos, remain largely
unexplored. These attacks can be more impactful as attackers may gain more
views/clicks for financial benefits and widespread (mis)information. To this
end, we pioneer the first attack against T2VR to promote videos adversarially,
dubbed the Video Promotion attack (ViPro). We further propose Modal Refinement
(MoRe) to capture the finer-grained, intricate interaction between visual and
textual modalities to enhance black-box transferability. Comprehensive
experiments cover 2 existing baselines, 3 leading T2VR models, 3 prevailing
datasets with over 10k videos, evaluated under 3 scenarios. All experiments are
conducted in a multi-target setting to reflect realistic scenarios where
attackers seek to promote the video regarding multiple queries simultaneously.
We also evaluated our attacks for defences and imperceptibility. Overall, ViPro
surpasses other baselines by over $30/10/4\%$ for white/grey/black-box settings
on average. Our work highlights an overlooked vulnerability, provides a
qualitative analysis on the upper/lower bound of our attacks, and offers
insights into potential counterplays. Code will be publicly available at
https://github.com/michaeltian108/ViPro.
β» β Unsupervised Document and Template Clustering using Multimodal Embeddings
This paper investigates a novel approach to unsupervised document clustering
by leveraging multimodal embeddings as input to clustering algorithms such as
$k$-Means, DBSCAN, a combination of HDBSCAN and $k$-NN, and BIRCH. Our method
aims to achieve a finer-grained document understanding by not only grouping
documents at the type level (e.g., invoices, purchase orders), but also
distinguishing between different templates within the same document category.
This is achieved by using embeddings that capture textual content, layout
information, and visual features of documents. We evaluated the effectiveness
of this approach using embeddings generated by several state-of-the-art
pre-trained multimodal models, including SBERT, LayoutLMv1, LayoutLMv3, DiT,
Donut, ColPali, Gemma3, and InternVL3. Our findings demonstrate the potential
of multimodal embeddings to significantly enhance document clustering, offering
benefits for various applications in intelligent document processing, document
layout analysis, and unsupervised document classification. This work provides
valuable insight into the advantages and limitations of different multimodal
models for this task and opens new avenues for future research to understand
and organize document collections.
comment: 22 pages, 12 figures
β» β From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
Yuying Shang, Xinyi Zeng, Yutao Zhu, Xiao Yang, Zhengwei Fang, Jingyuan Zhang, Jiawei Chen, Zinan Liu, Yu Tian
Hallucinations in large vision-language models (LVLMs) are a significant
challenge, i.e., generating objects that are not presented in the visual input,
which impairs their reliability. Recent studies often attribute hallucinations
to a lack of understanding of visual input, yet ignore a more fundamental
issue: the model's inability to effectively extract or decouple visual
features. In this paper, we revisit the hallucinations in LVLMs from an
architectural perspective, investigating whether the primary cause lies in the
visual encoder (feature extraction) or the modal alignment module (feature
decoupling). Motivated by our findings on the preliminary investigation, we
propose a novel tuning strategy, PATCH, to mitigate hallucinations in LVLMs.
This plug-and-play method can be integrated into various LVLMs, utilizing
adaptive virtual tokens to extract object features from bounding boxes, thereby
addressing hallucinations caused by insufficient decoupling of visual features.
PATCH achieves state-of-the-art performance on multiple multi-modal
hallucination datasets. We hope this approach provides researchers with deeper
insights into the underlying causes of hallucinations in LVLMs, fostering
further advancements and innovation in this field.
β» β PC-SRGAN: Physically Consistent Super-Resolution Generative Adversarial Network for General Transient Simulations
Machine Learning, particularly Generative Adversarial Networks (GANs), has
revolutionised Super-Resolution (SR). However, generated images often lack
physical meaningfulness, which is essential for scientific applications. Our
approach, PC-SRGAN, enhances image resolution while ensuring physical
consistency for interpretable simulations. PC-SRGAN significantly improves both
the Peak Signal-to-Noise Ratio and the Structural Similarity Index Measure
compared to conventional SR methods, even with limited training data (e.g.,
only 13% of training data is required to achieve performance similar to SRGAN).
Beyond SR, PC-SRGAN augments physically meaningful machine learning,
incorporating numerically justified time integrators and advanced quality
metrics. These advancements promise reliable and causal machine-learning models
in scientific domains. A significant advantage of PC-SRGAN over conventional SR
techniques is its physical consistency, which makes it a viable surrogate model
for time-dependent problems. PC-SRGAN advances scientific machine learning by
improving accuracy and efficiency, enhancing process understanding, and
broadening applications to scientific research. We publicly release the
complete source code of PC-SRGAN and all experiments at
https://github.com/hasan-rakibul/PC-SRGAN.
comment: 11 pages, combining the main content and the appendices, unlike
having them separated in the published version at IEEE Xplore
(https://doi.org/10.1109/TPAMI.2025.3596647)
β» β Zero-shot Emotion Annotation in Facial Images Using Large Multimodal Models: Benchmarking and Prospects for Multi-Class, Multi-Frame Approaches
This study investigates the feasibility and performance of using large
multimodal models (LMMs) to automatically annotate human emotions in everyday
scenarios. We conducted experiments on the DailyLife subset of the publicly
available FERV39k dataset, employing the GPT-4o-mini model for rapid, zero-shot
labeling of key frames extracted from video segments. Under a seven-class
emotion taxonomy ("Angry," "Disgust," "Fear," "Happy," "Neutral," "Sad,"
"Surprise"), the LMM achieved an average precision of approximately 50%. In
contrast, when limited to ternary emotion classification
(negative/neutral/positive), the average precision increased to approximately
64%. Additionally, we explored a strategy that integrates multiple frames
within 1-2 second video clips to enhance labeling performance and reduce costs.
The results indicate that this approach can slightly improve annotation
accuracy. Overall, our preliminary findings highlight the potential application
of zero-shot LMMs in human facial emotion annotation tasks, offering new
avenues for reducing labeling costs and broadening the applicability of LMMs in
complex multimodal environments.
comment: 10 pages, accepted to MRAC'25: 3rd International Workshop on
Multimodal and Responsible Affective Computing (ACM-MM 2025)
β» β MjΓΆlnir: A Deep Learning Parametrization Framework for Global Lightning Flash Density
Recent advances in AI-based weather forecasting models, such as FourCastNet,
Pangu-Weather, and GraphCast, have demonstrated the remarkable ability of deep
learning to emulate complex atmospheric dynamics. Building on this momentum, we
propose Mj\"olnir, a novel deep learning-based framework for global lightning
flash density parameterization. Trained on ERA5 atmospheric predictors and
World Wide Lightning Location Network (WWLLN) observations at a daily temporal
resolution and 1 degree spatial resolution, Mj\"olnir captures the nonlinear
mapping between large-scale environmental conditions and lightning activity.
The model architecture is based on the InceptionNeXt backbone with SENet, and a
multi-task learning strategy to simultaneously predict lightning occurrence and
magnitude. Extensive evaluations yield that Mollnir accurately reproduces the
global distribution, seasonal variability, and regional characteristics of
lightning activity, achieving a global Pearson correlation coefficient of 0.96
for annual mean fields. These results suggest that Mj\"olnir serves not only as
an effective data-driven global lightning parameterization but also as a
promising AI-based scheme for next-generation Earth system models (AI-ESMs).
comment: After an internal review, we found that the current version does not
meet our intended academic standards due to incomplete descriptions and
insufficient detail in key sections. No revised manuscript can be prepared in
the near future. To ensure academic quality, we withdraw this version and
plan to resubmit when the work is substantially improved