TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Tianyi Liang, Jiangqi Liu, Yifei Huang, Shiqi Jiang, Jianshen Shi, Changbo Wang, Chenhui Li*
College of Computer Science and Technology, East China Normal University, Shanghai, China
Shanghai Institute of AI Education, Shanghai, China
Shanghai Artificial Intelligence Laboratory, Shanghai, China

Equal Contribution, *Corresponding Author
ICML 2025
*Indicates Equal Contribution

描述文字

TextCenGen is a training-free method designed to generate text-friendly images. By using a simple text prompt and a planned blank region as inputs, TextCenGen creates images that satisfy the prompt and provide sufficient blank space in the target region. For example, the text-friendly T2I approach helps users customize their favored text-friendly wallpapers for mobile devices with any T2I model, avoiding visual confusion caused by the main objects overlapping with UI components.

Abstract

Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation. Instead of directly reducing attention in text areas, which degrades image quality, we relocate conflicting objects before background optimization. Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation, followed by attention excluding constraints to ensure smooth backgrounds. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across three seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the original semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).

Demos

Logo with Adaptive Natural Background

Logo with Adaptive Natural Background

Mobile Devices Wallpaper

Mobile Devices Wallpaper

More Results

More Results

Force-Directed Cross-Attention Guidance

描述文字

In our approach, the model receives a blank region (R) denoted as red-dotted area, and a text prompt as its inputs. The prompt is then used concurrently in a Text-to-Image (T2I) model to generate both an original image and a result image. During each step of the diffusion model's denoising process, the cross-attention map from the U-Net associated with the original image is used to direct the denoising of the result image in the form of a loss function. Throughout this procedure, a conflict detector identifies objects that could potentially conflict with R. To mitigate such conflicts, a force-directed graph method is applied to spatially repel these objects, ensuring that the area reserved for the text prompt remains unoccupied. To further enhance the smoothness of the attention mechanism, a spatial excluding cross-attention constraint is integrated into the cross-attention map.

描述文字

Illustration of four set relationships and their associated forces. The Repulsive Force separates object and text centroids during intersections (a1) and object in text (a2). The Margin Force (b) and Warping Force (c) prevent boundary overstepping. Text within object regions (a4) requires cooperation between force and attention constraint. Separation (a3) isn't required to process.

Comparison with Previous Works

描述文字

We compared TextCenGen with several potential models to evaluate its efficiency. The baseline models included: Native Stable Diffusion (Rombach et al., 2022), Dall-E 3 (Ramesh et al., 2022), AnyText (Tuo et al., 2023) and Desigen (Weng et al., 2024). Dall-E used the prompt "text-friendly in the {position}" to specify the region R. Similar to AnyText, we chose to randomly generate several masks in a fixed pattern across the map to simulate regions need to be edited.

BibTeX

@inproceedings{liang2025textcengen,
  title     = {TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation},
  author    = {Liang, Tianyi and Liu, Jiangqi and Huang, Yifei and Jiang, Shiqi and Shi, Jianshen and Wang, Changbo and Li, Chenhui},
  booktitle = {International Conference on Machine Learning (ICML)},
  year      = {2025}
}