Utility-Aware Multimodal Contrastive Learning for Product Image Generation
Xiaohang Feng, Yiling Xie
- Published
- May 27, 2026 — 16:54 UTC
Problem
This paper addresses a critical gap in the generative AI literature concerning product image generation for online marketplaces. Existing models primarily focus on semantic alignment between images and text prompts but do not optimize for marketplace performance, which is essential for driving consumer decisions. The authors propose a novel approach that integrates consumer demand into the image generation process, thereby enhancing the utility of generated images in commercial contexts. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors introduce a utility-aware multimodal contrastive learning framework that employs a novel Utility-Aware InfoNCE loss function. This loss function is designed to optimize the generation of images that are not only semantically coherent but also aligned with consumer demand. The framework shifts the learned image-text representation space towards demand-driven visual cues, which is theoretically validated through bounds on the proposed objective. The training process involves multimodal data, although specific details regarding the dataset size, training compute, and architecture are not disclosed. The method is designed to be adaptable, allowing for integration into various generative models as they evolve.
Results
The proposed method demonstrates significant improvements over state-of-the-art baselines in downstream applications on platforms like Amazon and Airbnb. The generated images not only enhance consumer demand but also maintain fidelity and text-image consistency. The authors report that their utility-aware framework effectively preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, leading to improved demand-based performance. While specific numerical results are not provided in the abstract, the qualitative improvements suggest a substantial effect size compared to existing models.
Limitations
The authors acknowledge that their approach may not generalize across all product categories or marketplaces, as consumer demand can vary significantly based on context. Additionally, the reliance on multimodal data may introduce biases if the training data is not representative of the broader market. The paper does not discuss potential computational costs associated with implementing the utility-aware framework in real-time applications, which could be a barrier for some users.
Why it matters
This work has significant implications for the field of generative AI, particularly in commercial applications where the alignment of generated content with consumer preferences is crucial. By embedding a utility-aware component into generative models, future research can focus on enhancing the commercial viability of AI-generated content. This approach not only advances the state of multimodal learning but also provides a pathway for integrating economic considerations into AI systems, potentially transforming how product images are generated and utilized in online marketplaces.
Authors: Xiaohang Feng, Yiling Xie
Source: arXiv:2605.28733
URL: https://arxiv.org/abs/2605.28733v1
By Turing Wire editorial staff · May 27, 2026 · Editorial standards →
Source: arXiv cs.AI