TaxaBind

Abstract

We present TaxaBind, a unified embedding space for characterizing any species of interest. TaxaBind is a multimodal embedding space across six modalities: ground-level images of species, geographic location, satellite image, text, audio, and environmental features, useful for solving ecological problems. To learn this joint embedding space, we leverage ground-level images of species as a binding modality. We propose multimodal patching, a technique for effectively distilling the knowledge from various modalities into the binding modality. We construct two large datasets for pretraining: iSatNat with species images and satellite images, and iSoundNat with species images and audio. Additionally, we introduce TaxaBench-8k, a diverse multimodal dataset with six paired modalities for evaluating deep learning models on ecological tasks. Experiments with TaxaBind demonstrate its strong zero-shot and emergent capabilities on a range of tasks including species classification, cross-model retrieval, and audio classification.

🎯 Multimodal Patching

For distilling unique information from different modalities, we patch the encoders using zero-shot classification with text. The network f is shared across all modalities and it is patched using techniques like sequential patching or parallel patching.

We evaluate the zero-shot classification accuracy of the ground-level image encoder with different values of α on iNat-2021. We observe performance improvements in all the cases.

🌏 Inference

We provide a simple way to access all our models through rshf and huggingface.

from transformers import PretrainedConfig
from rshf.taxabind import TaxaBind
config = PretrainedConfig.from_pretrained("MVRL/taxabind-config")
taxabind = TaxaBind(config)

# Loads open_clip style model

model = taxabind.get_image_text_encoder()
tokenizer = taxabind.get_tokenizer()
processor = taxabind.get_image_processor()

For more information on how to load other encoders, please refer to the GitHub.

BibTeX

@inproceedings{sastry2025taxabind, title={TaxaBind: A Unified Embedding Space for Ecological Applications}, author={Sastry, Srikumar and Khanal, Subash and Dhakal, Aayush and Ahmad, Adeel and Jacobs, Nathan}, booktitle={Winter Conference on Applications of Computer Vision}, year={2025}, organization={IEEE/CVF} }

TaxaBind: A Unified Embedding Space for Ecological Applications

TaxaBind is a suite of multimodal models useful for downstream ecological tasks covering six modalities: ground-level image, geographic location, satellite image, text, audio, and environmental features.

Abstract

🎯 Multimodal Patching

For distilling unique information from different modalities, we patch the encoders using zero-shot classification with text. The network f is shared across all modalities and it is patched using techniques like sequential patching or parallel patching.

We evaluate the zero-shot classification accuracy of the ground-level image encoder with different values of α on iNat-2021. We observe performance improvements in all the cases.

🧪 Results

Zero-shot classification performance on various fine-grained species classification datasets using the taxonomic description of species.

🌏 Inference

We provide a simple way to access all our models through rshf and huggingface.

For more information on how to load other encoders, please refer to the GitHub.

🤗 HuggingFace Demo

Demo of species image to satellite image retrieval using TaxaBind. This demo is on cpu so it may take a while!

BibTeX