NEW Check out TerraDiT-Ω (accepted to ECCV 2026) — our new & improved follow-up with unified spatial control for any geospatial primitive →

TerraDiT: Point-Conditioned Diffusion Transformer
for Satellite Image Synthesis

Preprint 2026
Washington University in St. Louis
*Equal contribution
Paper arXiv Code Weights Coming Soon
Point prompts, learned spatial prior, and generated satellite images

TerraDiT synthesizes high-fidelity satellite imagery from point prompts alone—no dense pixel-level maps required. From the prompts (left), a learned spatial prior (center) localizes each point's influence to drive the generated images (right).

Abstract

TL;DR — TerraDiT is a point-conditioned diffusion transformer that generates satellite imagery from text and point locations alone—no dense pixel-level maps—via an adaptive local attention mechanism for annotation-efficient, spatially precise control.

We propose TerraDiT, a diffusion transformer for text-to-satellite image generation with point-based control, enabling semantically rich and spatially precise generation using only point locations and textual descriptions instead of dense pixel-level maps. TerraDiT introduces an adaptive local attention mechanism to incorporate point queries effectively, achieving state-of-the-art performance while providing a flexible, annotation-efficient, and computationally simple framework for controllable satellite image synthesis.

Method

Point-prompted satellite image generation examples

Point prompting. Given only a handful of labeled points (leftmost column)—e.g. industrial building, parking lot, forest; water; or canal, building, farm—TerraDiT generates diverse, spatially consistent satellite scenes that honor each point's semantic label and location.

Key Contribution

Adaptive Local Attention (ALA)

At the core of TerraDiT is Adaptive Local Attention (ALA), a conditioning mechanism that incorporates point queries into the diffusion transformer effectively. A point (Sin-Cos encoded) and its text caption (from a frozen LongCLIP encoder) are fused and passed through a MetaRBF module that predicts a per-point spatial spread (σx, σy). The resulting Gaussian spatial prior modulates the cross-attention between the latent image features (queries) and the conditioning tokens (keys/values), so each point attends locally to the image tokens in its neighborhood. This injects a spatial inductive bias that yields semantically rich, spatially precise generation from sparse point supervision alone—annotation-efficient and computationally simple while achieving state-of-the-art fidelity.

Adaptive Local Attention (ALA) block diagram

The Adaptive Local Attention (ALA) block. A point and its caption are encoded and fused, then the MetaRBF module predicts (σx, σy) to form a spatial prior that modulates cross-attention over the latent image features.

1

Point Prompting

The scene is specified by a few labeled points—each a location paired with a short text label (e.g. industrial building, water, canal). Points are Sin-Cos encoded and labels embedded via frozen LongCLIP, together specifying what to generate and where without any dense pixel-level maps.

2

Conditioning — ALA

Adaptive Local Attention injects the point queries into the diffusion transformer: each point attends locally to the image tokens around its location, while text conditions the model through cross-attention—incorporating point cues precisely and efficiently.

3

Generation

The conditioned tokens drive a scalable diffusion transformer denoising process to produce the final satellite image.

BibTeX

@article{sastry2026terradit,
  title   = {TerraDiT: Point-Conditioned Diffusion Transformer for Satellite Image Synthesis},
  author  = {Sastry, Srikumar and Cher, Dan and Wei, Brian and Dhakal, Aayush and
             Khanal, Subash and Gupta, Dev and Jacobs, Nathan},
  journal = {arXiv preprint arXiv:2603.02172},
  year    = {2026}
}