Gen-n-Val: Agentic Image Data Generation and Validation

*Equal contribution

National Yang Ming Chiao Tung University Research Center for Information Technology Innovation, Academia Sinica
CVPR 2026 Findings Track
Synthetic instances generated by Gen-n-Val with segmentation masks
LVIS category distribution before and after Gen-n-Val balancing

Gen-n-Val generates high-quality synthetic data to address long-tailed distributions in instance segmentation.

Abstract

Data scarcity, label noise, and long-tailed category imbalance remain important unresolved challenges in computer vision tasks such as object detection and instance segmentation, especially on large-vocabulary benchmarks like LVIS, where most categories appear in only a few images. Current synthetic data generation methods still suffer from multiple objects per mask, inaccurate segmentation, incorrect category labels, and other issues.

Gen-n-Val is an agentic data generation framework that leverages Layer Diffusion, an LLM, and a VLLM to produce high-quality and diverse instance masks and images. It consists of two agents: an LD prompt agent that optimizes prompts for high-quality foreground single-object images with masks, and a data validation agent that filters low-quality synthetic instances. TextGrad is used to optimize the system prompts for both agents.

Compared to state-of-the-art synthetic data approaches like MosaicFusion, Gen-n-Val reduces invalid synthetic data from 50% to 7%, improves LVIS rare-class instance segmentation by 7.6% mAP with Mask R-CNN, improves COCO rare-class instance segmentation by 3.6% mAP with YOLOv9c and YOLO11m, and improves open-vocabulary object detection by 7.1% mAP over YOLO-Worldv2-M with YOLO11m.

Prompting and Validation

Synthetic data only helps long-tail recognition when each generated instance is usable: it should contain exactly one target object, the category should be correct, and the mask should tightly match the object. Existing diffusion-based pipelines often break this contract, producing incomplete masks, multiple objects under one mask, or even the wrong category. Gen-n-Val is motivated by closing this quality gap before synthetic data is added to training.

Common MosaicFusion synthetic data failures
Prior diffusion-based synthetic data can include incomplete masks, multiple objects under one mask, and wrong categories.

Why Prompting Matters

A standard prompt such as "a photo of a single object" is too ambiguous for reliable data generation. It can produce repetitive samples, cluttered foregrounds, or images with multiple unintended objects, which weakens the supervision signal for rare categories.

Layer Diffusion standard prompt orange sample 1 Layer Diffusion standard prompt orange sample 2 Layer Diffusion standard prompt orange sample 3 Layer Diffusion standard prompt orange sample 4 Layer Diffusion standard prompt orange sample 5 Layer Diffusion standard prompt orange sample 6

Why Validation Matters

Even after prompt optimization, generated instances can still be invalid. The VLLM validation agent filters failure cases before composition, so downstream detectors are trained on cleaner instances instead of noisy synthetic labels.

No-object failure case
No object
Multiple-objects failure case
Multiple objects
Wrong-category failure case
Wrong category
Incomplete-object failure case
Incomplete
Multiple-conditions failure case
Mixed conditions
Multiple-categories failure case
Mixed categories

Method Overview

Gen-n-Val replaces prompt-only generation and hand-crafted filtering with an agentic loop: optimize prompts, generate transparent foreground instances with Layer Diffusion, validate them with a VLLM, then paste validated instances into training images.

Previous generative-based augmentation methods
Pipeline of previous generative-based augmentation methods
Gen-n-Val
Pipeline of the Gen-n-Val method
1

Open-Vocabulary Prompt Generation

An LLM prompt agent uses TextGrad-refined system prompts to create detailed prompts for Layer Diffusion.

2

Foreground Image Generation

Layer Diffusion generates transparent single-instance images whose alpha channel directly provides masks.

3

Image Filtering

A VLLM validation agent rejects missing objects, multiple objects, wrong categories, incomplete instances, and noisy outputs.

4

Data Composition

Validated instances are composited into diverse scenes for downstream detection and segmentation training.

Results

+7.6 Rare mask mAP on LVIS with Mask R-CNN
727K Instances added to balance 1,203 LVIS classes to at least 1,000 images
7% Invalid synthetic data after Gen-n-Val validation, down from about 50%
+7.1 Open-vocabulary box mAP over YOLO-Worldv2-M

LVIS Instance Segmentation and Detection

Method Box mAP Rare Box Mask mAP Rare Mask
Mask R-CNN baseline22.59.121.79.6
MosaicFusion24.014.823.115.2
Gen2Det24.415.423.615.3
Gen-n-Val26.816.125.617.2

COCO Instance Segmentation

Method Box mAP Rare Box Mask mAP Rare Mask
YOLO11m baseline49.651.939.845.4
Copy-Paste50.052.141.547.1
MosaicFusion50.654.342.048.1
Gen-n-Val51.755.442.949.0
YOLO model-scale performance comparison
Gen-n-Val improves instance segmentation across YOLOv9 and YOLO11 model scales.

Qualitative Results

Ground-truth segmentation result
Ground Truth
Baseline segmentation result
Baseline
Gen-n-Val segmentation result
Gen-n-Val

The baseline misses objects in challenging scenes, while the model trained with Gen-n-Val synthetic data recovers additional instances and masks.

Agent Examples

The LD prompt agent turns short category-level prompts into detailed Layer Diffusion prompts, while the data validation agent explains why generated foreground instances should be kept or filtered before augmentation.

Initial vs. Optimized System Prompt

Standard Prompt vs. Optimized LD Prompt

Airplane
Airplane generated by a standard Layer Diffusion prompt
Standard prompt An image of a single airplane, an aircraft that has a fixed wing and is powered by propellers or jets.
Airplane generated by an optimized Layer Diffusion prompt
Optimized prompt from LD prompt agent High-resolution digital rendering of a single, sleek, silver, Boeing 747-400F airplane, alone, in a majestic, photorealistic style, evoking feelings of serenity and power, with soft golden sunlight casting long shadows across its metallic surface, from a low, 45-degree angle, showcasing riveted aluminum details.
Orange
Orange generated by a standard Layer Diffusion prompt
Standard prompt An image of a single orange (fruit), orange (FRUIT of an orange tree).
Orange generated by an optimized Layer Diffusion prompt
Optimized prompt from LD prompt agent High-resolution digital rendering of a single orange fruit alone, with vibrant orange hues ranging from deep burnt orange to bright juicy orange, ultra-realistic, under soft warm natural light with subtle shadows, viewed from a 45-degree angle with a slight macro perspective, featuring glossy skin with subtle ridges and a slightly dimpled texture.
Car
Car generated by a standard Layer Diffusion prompt
Standard prompt An image of a single car, a motor vehicle with four wheels.
Car generated by an optimized Layer Diffusion prompt
Optimized prompt from LD prompt agent High-resolution digital rendering of a single, sleek, high-performance, 2023 Lamborghini Aventador alone, with a glossy matte black and metallic silver body, ultra-realistic, conveying speed and power, with dramatic golden-hour lighting from a low, eye-level, 45-degree angle, showcasing hand-stitched leather interior and smooth steering-wheel texture.
Person
Person generated by a standard Layer Diffusion prompt
Standard prompt An image of a single person, a human being.
Person generated by an optimized Layer Diffusion prompt
Optimized prompt from LD prompt agent High-resolution digital rendering of a single person alone, wearing a bright yellow sundress with golden accents and a subtle floral pattern, rendered in an ultra-realistic style under soft warm sunlight, viewed from a low-angle perspective, with a smooth and silky dress texture and a subtle sheen to the hair.

Diversity Effect of Optimized LD Prompt

Person foreground instances generated from the standard LD prompt and optimized LD prompts show the diversity effect. The optimized prompts produce broader variation in clothing, pose, color, texture, lighting, and viewpoint.

Person foreground instances generated using the standard LD prompt
Person foreground instances generated using the standard LD prompt.
Person foreground instances generated using optimized LD prompts
Person foreground instances generated using optimized LD prompts.

Data Validation Agent's Response

Validation example with multiple oranges on a tree
Orange: multiple target objects

Image Description: The image depicts a potted tree with green leaves and multiple oranges hanging from its branches.

  • Single orangeFail
  • Single viewMeet
  • Intact orangeMeet
  • Plain backgroundMeet

Result: Filter Out

Validation example with an alarm clock and extra background objects
Alarm clock: non-plain background

Image Description: The image shows one digital clock, but it is accompanied by a potted plant and a lamp.

  • Single clockMeet
  • Single viewMeet
  • Intact clockMeet
  • Plain backgroundFail

Result: Filter Out

Validation example where the generated object is a candle instead of a birthday card
Birthday card: wrong category

Image Description: The image depicts a large blue candle with two lit wicks against a solid black background.

  • Single birthday cardFail
  • Single viewMeet
  • Intact birthday cardN/A
  • Plain backgroundMeet

Result: Filter Out

Validation example with no visible pancake
Pancake: no visible object

Image Description: The image is a solid black square with no visible objects or features.

  • Single pancakeFail
  • Single viewN/A
  • Intact pancakeN/A
  • Plain backgroundMeet

Result: Filter Out

Synthetic Data Examples

Synthetic truck instance Synthetic orange instance Synthetic laptop instance Synthetic umbrella instance Synthetic person instance Synthetic tennis racket instance
Original LVIS image
Original LVIS
LVIS image augmented by Gen-n-Val
Augmented by Gen-n-Val

BibTeX

@inproceedings{huang2026gennval,
  title     = {Gen-n-Val: Agentic Image Data Generation and Validation},
  author    = {Huang, Jing-En and Fang, I-Sheng and Huang, Tzuhsuan and Liu, Yu-Lun and Wang, Chih-Yu and Chen, Jun-Cheng},
  booktitle = {CVPR Findings},
  year      = {2026}
}