Gen-n-Val: Agentic Image Data Generation and Validation

Abstract

Data scarcity, label noise, and long-tailed category imbalance remain important unresolved challenges in computer vision tasks such as object detection and instance segmentation, especially on large-vocabulary benchmarks like LVIS, where most categories appear in only a few images. Current synthetic data generation methods still suffer from multiple objects per mask, inaccurate segmentation, incorrect category labels, and other issues.

Gen-n-Val is an agentic data generation framework that leverages Layer Diffusion, an LLM, and a VLLM to produce high-quality and diverse instance masks and images. It consists of two agents: an LD prompt agent that optimizes prompts for high-quality foreground single-object images with masks, and a data validation agent that filters low-quality synthetic instances. TextGrad is used to optimize the system prompts for both agents.

Compared to state-of-the-art synthetic data approaches like MosaicFusion, Gen-n-Val reduces invalid synthetic data from 50% to 7%, improves LVIS rare-class instance segmentation by 7.6% mAP with Mask R-CNN, improves COCO rare-class instance segmentation by 3.6% mAP with YOLOv9c and YOLO11m, and improves open-vocabulary object detection by 7.1% mAP over YOLO-Worldv2-M with YOLO11m.

Prompting and Validation

Synthetic data only helps long-tail recognition when each generated instance is usable: it should contain exactly one target object, the category should be correct, and the mask should tightly match the object. Existing diffusion-based pipelines often break this contract, producing incomplete masks, multiple objects under one mask, or even the wrong category. Gen-n-Val is motivated by closing this quality gap before synthetic data is added to training.

Common MosaicFusion synthetic data failures — Prior diffusion-based synthetic data can include incomplete masks, multiple objects under one mask, and wrong categories.

Why Prompting Matters

A standard prompt such as "a photo of a single object" is too ambiguous for reliable data generation. It can produce repetitive samples, cluttered foregrounds, or images with multiple unintended objects, which weakens the supervision signal for rare categories.

Layer Diffusion standard prompt orange sample 1

Layer Diffusion standard prompt orange sample 2

Layer Diffusion standard prompt orange sample 3

Layer Diffusion standard prompt orange sample 4

Layer Diffusion standard prompt orange sample 5

Layer Diffusion standard prompt orange sample 6

Why Validation Matters

Even after prompt optimization, generated instances can still be invalid. The VLLM validation agent filters failure cases before composition, so downstream detectors are trained on cleaner instances instead of noisy synthetic labels.

Multiple-objects failure case — Multiple objects

Wrong-category failure case — Wrong category

Incomplete-object failure case — Incomplete

Multiple-conditions failure case — Mixed conditions

Multiple-categories failure case — Mixed categories

Method Overview

Gen-n-Val replaces prompt-only generation and hand-crafted filtering with an agentic loop: optimize prompts, generate transparent foreground instances with Layer Diffusion, validate them with a VLLM, then paste validated instances into training images.

Pipeline of previous generative-based augmentation methods — Previous generative-based augmentation methods

Pipeline of the Gen-n-Val method — Gen-n-Val

1

Open-Vocabulary Prompt Generation

An LLM prompt agent uses TextGrad-refined system prompts to create detailed prompts for Layer Diffusion.

2

Foreground Image Generation

Layer Diffusion generates transparent single-instance images whose alpha channel directly provides masks.

3

Image Filtering

A VLLM validation agent rejects missing objects, multiple objects, wrong categories, incomplete instances, and noisy outputs.

4

Data Composition

Validated instances are composited into diverse scenes for downstream detection and segmentation training.

Results

+7.6 Rare mask mAP on LVIS with Mask R-CNN

727K Instances added to balance 1,203 LVIS classes to at least 1,000 images

7% Invalid synthetic data after Gen-n-Val validation, down from about 50%

+7.1 Open-vocabulary box mAP over YOLO-Worldv2-M

LVIS Instance Segmentation and Detection

Method	Box mAP	Rare Box	Mask mAP	Rare Mask
Mask R-CNN baseline	22.5	9.1	21.7	9.6
MosaicFusion	24.0	14.8	23.1	15.2
Gen2Det	24.4	15.4	23.6	15.3
Gen-n-Val	26.8	16.1	25.6	17.2

COCO Instance Segmentation

Method	Box mAP	Rare Box	Mask mAP	Rare Mask
YOLO11m baseline	49.6	51.9	39.8	45.4
Copy-Paste	50.0	52.1	41.5	47.1
MosaicFusion	50.6	54.3	42.0	48.1
Gen-n-Val	51.7	55.4	42.9	49.0

YOLO model-scale performance comparison — Gen-n-Val improves instance segmentation across YOLOv9 and YOLO11 model scales.

Qualitative Results

Ground-truth segmentation result — Ground Truth

Gen-n-Val segmentation result — Gen-n-Val

The baseline misses objects in challenging scenes, while the model trained with Gen-n-Val synthetic data recovers additional instances and masks.

Agent Examples

The LD prompt agent turns short category-level prompts into detailed Layer Diffusion prompts, while the data validation agent explains why generated foreground instances should be kept or filtered before augmentation.

Initial vs. Optimized System Prompt

LD Prompt Agent

Initial system prompt

Generate detailed positive prompts for the Stable Diffusion Juggernaut-XL-v6 model to create images focusing solely on the main subject. Each prompt must be specific and cover aspects such as the subject's status, color, style, mood/atmosphere, lighting, perspective/viewpoint, textures/material, time period, and medium. Prompts should emphasize the use of trigger words like "high-resolution" and "highly realistic" to ensure quality. Prompts should be concise, limited to under 75 tokens, and must not include disallowed or sensitive content. Background descriptions should be absent, avoiding the inclusion of additional objects.

Optimized system prompt

You are an AI assistant designed to generate detailed and realistic prompts for the Stable Diffusion XL model, focusing only on a single subject. The background and environment should be omitted in the prompts. Your prompts should be specific, descriptive, diverse, and follow the provided guidelines to ensure high-quality image generation.

Guidelines for Prompt Creation:

Subject: The only single object in the image. Ensure a wide variety of subjects, ranging from everyday items to unique or uncommon objects.
Status: The current state or condition of the subject.
Color: Dominant colors of the subject. Include specific shades and variations to enhance visual detail.
Style: Artistic style or rendering method. Incorporate a range of styles (e.g., photorealistic, hyper-realistic) to promote diversity.
Mood/Atmosphere: Emotional quality related to the subject. Convey realistic emotions or states that align with the subject.
Lighting: Specific lighting on the subject. Describe natural or artificial lighting conditions that highlight the subject's features.
Perspective/Viewpoint: Angle or perspective of the subject. Use varied viewpoints (e.g., top-down, eye-level, close-up) to add depth.
Texture/Material: Textures or materials of the subject. Detail the tactile qualities to enhance realism.
Time Period: Specific era. When relevant, specify a realistic time period to provide context.
Medium: Artistic medium or level of detail.

Key Trigger Words: Include terms like "high-resolution", "highly realistic".
Length: Keep the prompt under 75 tokens.
Avoid: Do not include any additional subjects in the prompt. Do not include any descriptions about the background.

Data Validation Agent

Initial system prompt

As an AI assistant, your role is to analyze images to determine their suitability based on specific criteria. First, provide a detailed description of the image. Second, evaluate the image against four criteria: 1. it should contain only one subject; 2. the subject should be shown from a single angle or perspective, without multiple views or angles within the same image; 3. the subject should be intact and fully visible; and 4. the background should be empty or plain, without distracting elements. Third, based on this evaluation, decide whether to filter out the image if it violates any of the criteria or keep it if it meets all of them. At last, conclude with a result stating "Keep" if the image meets all criteria or "Filter Out" if it violates any. Present your analysis in the specified output format, including the image description, detailed evaluations with explanations and results for each criterion, a conclusion, and the final result.

Output Format:

Image Description:

Evaluation Criteria:

Single [Category Name]:
- Explanation
- Result: Meet or Fail
Single View:
- Explanation
- Result: Meet or Fail
Intact [Category Name]:
- Explanation
- Result: Meet or Fail
Plain Background:
- Explanation
- Result: Meet or Fail

Conclusion:

Result: Keep or Filter Out

Optimized system prompt

You are an AI assistant that analyzes images to determine their suitability based on specific criteria.

Instructions:

Describe the image in detail.
Evaluate the image against the following criteria:
- Criteria 1 - Single subject: The image should contain only one subject.
- Criteria 2 - Single View: The subject should be shown from a single angle or perspective.
- Criteria 3 - Intact subject: The subject should be intact and fully visible.
- Criteria 4 - Plain Background: The background should be empty or plain, without distracting elements.
Decide whether to filter out the image based on these criteria.
Conclude with Result: Keep if the image meets all criteria or Result: Filter Out if it violates any criteria.

Output Format:

Image Description:

[Your detailed description here]

Evaluation Criteria:

Single [Category Name]:
- [Explanation]
- Result: [Meet/Fail]
Single View:
- [Explanation]
- Result: [Meet/Fail]
Intact [Category Name]:
- [Explanation]
- Result: [Meet/Fail]
Plain Background:
- [Explanation]
- Result: [Meet/Fail]

Conclusion:

[Your conclusion here]

Result: [Keep/Filter Out]

Standard Prompt vs. Optimized LD Prompt

Airplane

Orange

Car

Car generated by a standard Layer Diffusion prompt — Standard prompt An image of a single car, a motor vehicle with four wheels.

Car generated by an optimized Layer Diffusion prompt — Optimized prompt from LD prompt agent High-resolution digital rendering of a single, sleek, high-performance, 2023 Lamborghini Aventador alone, with a glossy matte black and metallic silver body, ultra-realistic, conveying speed and power, with dramatic golden-hour lighting from a low, eye-level, 45-degree angle, showcasing hand-stitched leather interior and smooth steering-wheel texture.

Person

Diversity Effect of Optimized LD Prompt

Person foreground instances generated from the standard LD prompt and optimized LD prompts show the diversity effect. The optimized prompts produce broader variation in clothing, pose, color, texture, lighting, and viewpoint.

Person foreground instances generated using the standard LD prompt.

Person foreground instances generated using optimized LD prompts.

Data Validation Agent's Response

Validation example with multiple oranges on a tree

Orange: multiple target objects

Image Description: The image depicts a potted tree with green leaves and multiple oranges hanging from its branches.

Single orangeFail
Single viewMeet
Intact orangeMeet
Plain backgroundMeet

Result: Filter Out

Validation example with an alarm clock and extra background objects

Alarm clock: non-plain background

Image Description: The image shows one digital clock, but it is accompanied by a potted plant and a lamp.

Single clockMeet
Single viewMeet
Intact clockMeet
Plain backgroundFail

Result: Filter Out

Validation example where the generated object is a candle instead of a birthday card

Birthday card: wrong category

Image Description: The image depicts a large blue candle with two lit wicks against a solid black background.

Single birthday cardFail
Single viewMeet
Intact birthday cardN/A
Plain backgroundMeet

Result: Filter Out

Validation example with no visible pancake

Pancake: no visible object

Image Description: The image is a solid black square with no visible objects or features.

Single pancakeFail
Single viewN/A
Intact pancakeN/A
Plain backgroundMeet

Result: Filter Out

Synthetic Data Examples

LVIS image augmented by Gen-n-Val — Augmented by Gen-n-Val

BibTeX

@inproceedings{huang2026gennval,
  title     = {Gen-n-Val: Agentic Image Data Generation and Validation},
  author    = {Huang, Jing-En and Fang, I-Sheng and Huang, Tzuhsuan and Liu, Yu-Lun and Wang, Chih-Yu and Chen, Jun-Cheng},
  booktitle = {CVPR Findings},
  year      = {2026}
}

Gen-n-Val: Agentic Image Data Generation and Validation

Gen-n-Val generates high-quality synthetic data to address long-tailed distributions in instance segmentation.

Abstract

Prompting and Validation

Why Prompting Matters

Why Validation Matters

Method Overview

Open-Vocabulary Prompt Generation

Foreground Image Generation

Image Filtering

Data Composition

Results

LVIS Instance Segmentation and Detection

COCO Instance Segmentation

Qualitative Results

Agent Examples

Initial vs. Optimized System Prompt

LD Prompt Agent

Data Validation Agent

Standard Prompt vs. Optimized LD Prompt

Airplane

Orange

Car

Person

Diversity Effect of Optimized LD Prompt

Data Validation Agent's Response

Orange: multiple target objects

Alarm clock: non-plain background

Birthday card: wrong category

Pancake: no visible object

Synthetic Data Examples

BibTeX