ProText: A Benchmark Dataset for Measuring (Mis)gendering in Long-Form Texts
We introduce ProText, a dataset for measuring gendering and misgendering in stylistically diverse long-form English texts. ProText spans three dimensions: Theme nouns (names, occupations, titles, kinship terms), Theme category (stereotypically male, stereotypically female, gender-neutral/non-gendere...
Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting
Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Tex...
Drop-In Perceptual Optimization for 3D Gaussian Splatting
Despite their output being ultimately consumed by human viewers, 3D Gaussian Splatting (3DGS) methods often rely on ad-hoc combinations of pixel-level losses, resulting in blurry renderings. To address this, we systematically explore perceptual optimization strategies for 3DGS by searching over a di...
SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation
This paper was accepted at the Principled Design for Trustworthy AI — Interpretability, Robustness, and Safety across Modalities Workshop at ICLR 2026.
What exactly makes a particular image unsafe? Systematically differentiating between benign and problematic images is a challenging problem, as subt...
AMES: Approximate Multi-modal Enterprise Search via Late Interaction Retrieval
We present AMES (Approximate Multimodal Enterprise Search), a unified multimodal late interaction retrieval architecture which is backend agnostic. AMES demonstrates that fine-grained multimodal late interaction retrieval can be deployed within a production grade enterprise search engine without arc...
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
Dense image captioning is critical for cross-modal alignment in vision-language pretraining and text-to-image generation, but scaling expert-quality annotations is prohibitively expensive. While synthetic captioning via strong vision-language models (VLMs) is a practical alternative, supervised dist...
EMBridge: Enhancing Gesture Generalization from EMG Signals through Cross-Modal Representation Learning
Hand gesture classification using high-quality structured data such as videos, im-
ages, and hand skeletons is a well-explored problem in computer vision. Alterna-
tively, leveraging low-power, cost-effective bio-signals, e.g., surface electromyo-
graphy (sEMG), allows for continuous gesture predict...
depyf: Open the Opaque Box of PyTorch Compiler for Machine Learning Researchers
PyTorch \texttt{2.x} introduces a compiler designed to accelerate deep learning programs. However, for machine learning researchers, adapting to the PyTorch compiler to full potential can be challenging. The compiler operates at the Python bytecode level, making it appear as an opaque box. To addres...
Reasoning and planning are the bedrock of intelligent AI systems, enabling them to plan, interact, adapt, and ultimately, operate independently. At Apple, understanding and advancing reasoning capablilities in AI systems has long been an area of active research, and has resulted in numerous publicat...
Faster Rates For Federated Variational Inequalities
In this paper, we study federated optimization for solving stochastic variational inequalities (VIs), a problem that has attracted growing attention in recent years. Despite substantial progress, a significant gap remains between existing convergence rates and the state-of-the-art bounds known for f...
VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Video-conditioned sound and speech generation, encompassing video-to-sound (V2S) and visual text-to-speech (VisualTTS) tasks, are conventionally addressed as separate tasks, with limited exploration to unify them within a signle framework. Recent attempts to unify V2S and VisualTTS face challenges i...
How PARTs Assemble into Wholes: Learning the Relative Composition of Images
The composition of objects and their parts, along with object-object positional relationships, provides a rich source of information for representation learning. Hence, spatial-aware pretext tasks have been actively explored in self-supervised learning. Existing works commonly start from a grid stru...
Inferring Optical Tissue Properties from Photoplethysmography using Hybrid Amortized Inference
Smart wearables enable continuous tracking of established biomarkers such as heart rate, heart rate variability, and blood oxygen saturation via photoplethysmography (PPG). Beyond these metrics, PPG waveforms contain richer physiological information, as recent deep learning (DL) studies demonstrate....
Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompt...