Global and Local Entailment Learning for Natural World Imagery

Srikumar Sastry, Aayush Dhakal, Eric Xing, Subash Khanal, Nathan Jacobs

Washington University in St. Louis
ICCV, 2025

Paper 🤗 Models GitHub arXiv

Our method transforms the textual embedding space to enforce local entailment and global partial order. In simple terms, given a multi-level hierarchy, we enforce two properties: 1) Siblings from the same parent share similar embeddings and 2) coarse-grained concepts are closer to a root concept than fine-grained concepts. Here, we compare BioCLIP's textual embedding space with our textual embedding space.

Abstract

Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models.

🎯 Global and Local Entailment Learning

Conceptual overview of our method focusing on preserving the global order of concepts in vision-language models according to their distance from an entailment root. Our method aims to enforce transitivity in entailment.

Transitivity in Entailment: In an ideal transitivity-imposed entailment, textual embeddings satisfy partial order conditions.

🧪 Results

✅ State-of-the-art ordering and hierarchical image-text retrieval performance.

We evaluate the ability of different models to encode the partial order of taxonomies in the Tree of Life. Additionally, we evaluate the models on the standard task of hierarchical image-text retrieval.

Hierarchical retrieval performance on HierarCaps dataset.

BibTeX

@inproceedings{sastry2025global,
    title={Global and Local Entailment Learning for Natural World Imagery},
    author={Sastry, Srikumar and Dhakal, Aayush and Xing, Eric and Khanal, Subash and Jacobs, Nathan},
    booktitle={International Conference on Computer Vision},
    year={2025},
    organization={IEEE/CVF}
}