TextCenGen

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

College of Computer Science and Technology, East China Normal University, Shanghai, China
Shanghai Institute of AI Education, Shanghai, China
Shanghai Artificial Intelligence Laboratory, Shanghai, China

^†Equal Contribution, ^*Corresponding Author

ICML 2025
^†Indicates Equal Contribution

Abstract

Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation. Instead of directly reducing attention in text areas, which degrades image quality, we relocate conflicting objects before background optimization. Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation, followed by attention excluding constraints to ensure smooth backgrounds. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across three seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the original semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).

Force-Directed Cross-Attention Guidance

In our approach, the model receives a blank region (R) denoted as red-dotted area, and a text prompt as its inputs. The prompt is then used concurrently in a Text-to-Image (T2I) model to generate both an original image and a result image. During each step of the diffusion model's denoising process, the cross-attention map from the U-Net associated with the original image is used to direct the denoising of the result image in the form of a loss function. Throughout this procedure, a conflict detector identifies objects that could potentially conflict with R. To mitigate such conflicts, a force-directed graph method is applied to spatially repel these objects, ensuring that the area reserved for the text prompt remains unoccupied. To further enhance the smoothness of the attention mechanism, a spatial excluding cross-attention constraint is integrated into the cross-attention map.

Comparison with Previous Works

We compared TextCenGen with several potential models to evaluate its efficiency. The baseline models included: Native Stable Diffusion (Rombach et al., 2022), Dall-E 3 (Ramesh et al., 2022), AnyText (Tuo et al., 2023) and Desigen (Weng et al., 2024). Dall-E used the prompt "text-friendly in the {position}" to specify the region R. Similar to AnyText, we chose to randomly generate several masks in a fixed pattern across the map to simulate regions need to be edited.

BibTeX

@inproceedings{liang2025textcengen, title = {TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation}, author = {Liang, Tianyi and Liu, Jiangqi and Huang, Yifei and Jiang, Shiqi and Shi, Jianshen and Wang, Changbo and Li, Chenhui}, booktitle = {International Conference on Machine Learning (ICML)}, year = {2025} }

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Abstract

Demos

Logo with Adaptive Natural Background

Mobile Devices Wallpaper

More Results

Force-Directed Cross-Attention Guidance

Comparison with Previous Works

BibTeX