ProM3E

Abstract

We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model.

💡 Motivation

❌ Reconstruction objective in pixel/token space is not optimal for cross-view correspondences (e.g. Satellite images ↔ Ground-level images) where there is no pixel/token-level correspondences.
❌ Lack of one-to-one correspondence between multimodal observations (e.g. Images ↔ Audio, Images ↔ Text).
❌ Lack of massive scale all-paired multimodal datasets, especially for more than two modalities.

🎯 Masked Multimodal Embedding Reconstruction

🧪 Train ImageBind-Style Multimodal Encoders using massive scale image-paired datasets (left).
🧪 Train ProM3E using masked multimodal variational autoencoder (right).

ProM3E first trains modality-specific encoders to project all modalities into a shared embedding space using supervised contrastive alignment, after which the encoders are frozen. The aligned embeddings are fed into a masked multimodal variational autoencoder that randomly masks subsets of modalities and learns to reconstruct them by modeling a joint Gaussian latent distribution, optimized with a contrastive reconstruction loss combined with a variational information bottleneck objective. At inference, the model performs any-to-any modality reconstruction by sampling from the predicted latent distribution, while the learned variance provides an estimate of uncertainty and cross-modal informativeness.

⭐️ Results

Our training strategy effectively mitigates the modality gap between the encoders.

We show that our model can be trained with much less all paired dataset and the performance across various dataset sizes and tasks remain consistent. For instance training with 10% of the dataset (7,913 samples) only results in a performance drop of ∼3% on average.

BibTeX

@inproceedings{sastry2025prom3e, title={ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology}, author={Sastry, Srikumar and Khanal, Subash and Dhakal, Aayush and Lin, Jiayu and Cher, Dan and Jarosz, Phoenix and Jacobs, Nathan}, journal={IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year={2026} }

ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology

We propose a masked embedding reconstruction-based training recipe to enhance representations learned from pretrained encoders. The resulting representations improve performance by 5-10% across a wide range of downstream tasks.