Start with a single agent that writes a precise image description for a reference image. Feed the reference image + the agent’s description into OpenAI (vision + image generation). Generate image sets per emotion and per angle, respecting fixed specs (puppet mechanics, mouth/tongue, fur/clothing consistency). Auto-detect extras (glasses/hat etc.), merge into identifiers, then QC against the spec. Deliver for review, then loop fixes.
0) Constants you can lock now
Angles (v1): front, left, right, back, 3q-left, 3q-right.
Emotions (v1): neutral, happy, sad, angry, surprised, disgust, fear, smirk.
Mouth states: closed, open-small, open-wide, tongue-out, teeth-showing.
Lighting: soft, even, studio-style, no harsh shadows.
Background: plain light gray (or transparent if you’ll composite later).
Output size: 1024×1024 (change if you need).
Style lock: “same character proportions, color palette, and materials across all shots.”
1) End-to-end pipeline (single agent orchestration)
Ingest
Inputs: reference_image, optional existing_identifiers.json.
Agent creates a Detailed Caption (objective description with anatomy, colors, materials, clothing, accessories, scars, markings).
Vision pass auto-detects extras (e.g., hat, glasses) and oddities (asymmetries, damage).
Build/merge Identifiers
Merge detected features into the Identifiers object (see schema below).
Freeze “fixed” specs (anatomy, proportions, fur pattern, palette, logos).
Shot list expansion
Cartesian expansion: angles × emotions × mouth states (configure which combos you actually need).
Generate a Shot Spec for each output image.
Prompting for generation
For each shot: feed the reference_image + identifiers + shot spec to the image model.
Enforce: locked palette, proportions, accessories; background; lighting; camera angle; emotion.
QC (automatic)
Vision pass compares output vs. Identifiers + Shot Spec.
Score per rule (see QC rules). Tag any drift (palette off, accessories missing, wrong mouth state, angle misaligned).
Triage
If all rules pass: mark “Ready for Review”.
If minor drift: one automatic correction attempt (tighten prompt; add “do not change” reminders).
If still failing: route to user review with a compact diff report.
Delivery
Save images with deterministic names.
Emit qc_report.json and identifiers.final.json.
Produce a short review sheet.
Corrections loop
User comments → regenerate only failed shots with updated constraints.
2) Data contracts (copy-paste JSON Schemas)
2.1 Character Identifiers (authoritative)
{
"character_id": "string",
"name": "string",
"source_reference": "path-or-url",
"anatomy": {
"species_or_type": "e.g., puppet gremlin",
"height_relative": "e.g., small",
"proportions_notes": "e.g., large head, short limbs",
"silhouette_keywords": ["rounded ears","tapered snout"]
},
"colors_materials": {
"primary_palette": ["#A1B2C3","#334455"],
"secondary_palette": ["#..."],
"materials": ["felt","faux fur","plastic eyes","stitched mouth"]
},
"surface_features": {
"fur_pattern": "describe zones and direction",
"scars_markings": "none or details",
"eye_details": {"iris_color":"hex","pupil_shape":"round"},
"mouth_teeth_tongue": {"teeth":"flat white","tongue":"pink #F29CB2"}
},
"costume_baseline": {
"garment": "yellow raincoat",
"footwear": "none",
"logo_text": null
},
"accessories": ["round glasses"],
"mechanics": {
"mouth_states_allowed": ["closed","open-small","open-wide","tongue-out","teeth-showing"],
"jaw_hinge_visibility": "hidden",
"ear_flex": "none",
"eye_gaze_rules": "camera unless specified"
},
"forbidden_changes": [
"do not change eye color",
"no new scars",
"preserve fur pattern zones"
],
"notes": "any extra lock-ins"
}
2.2 Shot Spec (one per image to generate)
{
"character_id": "string",
"angle": "front | left | right | back | 3q-left | 3q-right",
"emotion": "neutral | happy | sad | angry | surprised | disgust | fear | smirk",
"mouth_state": "closed | open-small | open-wide | tongue-out | teeth-showing",
"lighting": "soft even studio",
"background": "plain light gray",
"framing": "waist-up | full-body | bust",
"camera_height": "eye-level",
"notes": "any per-shot nuance"
}
2.3 QC Report (aggregated)
{
"batch_id": "string",
"pass_rate": 0.0,
"items": [
{
"filename": "string",
"shot_spec": { "angle":"front","emotion":"happy","mouth_state":"open-small" },
"scores": {
"palette_lock": 0.97,
"proportions_lock": 0.94,
"accessories_present": 1.0,
"angle_match": 0.92,
"emotion_match": 0.88,
"mouth_state_match": 0.95,
"artifact_check": 0.90,
"background_lock": 1.0
},
"status": "pass | auto-retry | fail",
"notes": "what drifted",
"retry_prompt_delta": "extra constraints if auto-retry"
}
]
}
3) Identifier checklist (what to capture up front)
Anatomy & silhouette: head/torso ratio, limb lengths, ear shape, tail, overall silhouette cues.
Color/palette: hex values for primary/secondary; eye/tongue/teeth color; fur zones.
Materials: fur/felt/fabric types; reflectivity; stitching lines; plastic parts.
Surface features: markings, freckles, scars, seams; fur direction; wear/tear.
Face specifics: eye shape, eyelid line, lash presence; muzzle/snout form.
Mouth/tongue/teeth: allowed states, tongue shape/length, tooth style.
Clothing baseline: exact garments, fasteners, logos, patterns.
Accessories: glasses, hats, jewelry; whether removable or always-on.
Mechanics locks: what must never change (jaw hinge visibility, ear flexibility).
Camera/lighting locks: even lighting, neutral background, camera height.
Forbidden list: any “never change” items (colors, logos, scars).
Oddities detector: hats/glasses/new stickers; damage; symmetry issues.
4) QC rules (pass/fail thresholds)
Palette lock ≥ 0.95 cosine similarity in HSV/histogram space.
Proportions lock: measured landmarks within ±3% of baseline ratios.
Accessories present: binary presence, else fail.
Angle match: head yaw/pitch/roll within target ±10° (3D estimate).
Emotion match: classifier confidence ≥ 0.80 (or model self-report with rationale).
Mouth state: classifier detects state requested; if mismatch → fail.
Background lock: uniformity > 0.98; no props.
Artifacts: no double pupils, extra fingers, melting edges; if found → auto-retry once.
5) File layout (simple and visual)
C:\tools\Character-Pipeline\
01_input\
reference\ (drop your reference image here)
identifiers.seed.json (optional)
02_captions\
character.caption.json
03_specs\
identifiers.final.json
shots.plan.json
04_generations\
<character_id>\angle=<X>\emotion=<Y>\mouth=<Z>\*.png
05_qc\
qc_report.json
diffs\*.png
06_delivery\
review_sheet.md
6) Proof-of-concept (tiny batch)
Character: 1
Angles: front, 3q-left
Emotions: neutral, happy, angry
Mouth: closed only
Total: 6 images
Success = QC pass rate ≥ 80% on first pass; no palette drift.
7) Copy-paste Python (baseline; uses OpenAI Vision + Image Gen)
This is a starter. It won’t do pixel-level landmarking; it relies on the model for description and soft QC. Treat as a PoC. This needs verification.
import os, json, uuid, pathlib
from datetime import datetime
from PIL import Image
from openai import OpenAI
BASE = r"C:\tools\Character-Pipeline"
os.makedirs(BASE, exist_ok=True)
client = OpenAI() # requires OPENAI_API_KEY in env
def save_json(data, path):
pathlib.Path(path).parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
def load_json(path, default=None):
if os.path.exists(path):
return json.load(open(path, "r", encoding="utf-8"))
return default
# -------- 1) Detailed Caption via Vision --------
def describe_reference(ref_path):
prompt = (
"You are a meticulous character describer. "
"Describe the character in the image for reproduction with image generation. "
"Capture anatomy/silhouette, exact color palette (return hex swatches), materials, "
"fur/skin patterns, eye details, mouth/teeth/tongue, clothing, accessories, seams/stitching, "
"forbidden changes, and any oddities (hat, glasses, damage). "
"Respond as strict JSON in the 'raw' field containing plain prose plus a 'palette' field with hexes."
)
msg = [
{"role":"user","content":[
{"type":"input_text","text":prompt},
{"type":"input_image","image_url":f"file://{ref_path}"}
]}
]
resp = client.chat.completions.create(
model="gpt-4.1-mini", # fast vision-capable; adjust per pricing/perf
messages=msg,
temperature=0.2
)
text = resp.choices[0].message.content
return {"raw": text}
# -------- 2) Build Identifiers --------
def build_identifiers(caption_json, seed_identifiers=None):
sys = (
"Merge the caption with the optional seed identifiers into the final Identifiers JSON "
"using the schema given below. Fill hex colors from caption palette. Preserve any 'forbidden_changes'. "
"Return ONLY the JSON object, no commentary."
)
schema = load_json(os.path.join(BASE,"schema_identifiers.json")) or {}
messages = [
{"role":"system","content":sys + "\nSCHEMA:\n" + json.dumps(schema)},
{"role":"user","content":json.dumps({
"caption": caption_json,
"seed": seed_identifiers or {}
})}
]
resp = client.chat.completions.create(
model="o3-mini",
messages=messages,
temperature=0
)
return json.loads(resp.choices[0].message.content)
# -------- 3) Expand Shots --------
ANGLES = ["front","3q-left","3q-right","left","right","back"]
EMOTIONS = ["neutral","happy","sad","angry","surprised","disgust","fear","smirk"]
MOUTHS = ["closed","open-small","open-wide","tongue-out","teeth-showing"]
def make_plan(char_id, angles, emotions, mouths):
shots = []
for a in angles:
for e in emotions:
for m in mouths:
shots.append({
"character_id": char_id,
"angle": a, "emotion": e, "mouth_state": m,
"lighting":"soft even studio",
"background":"plain light gray",
"framing":"bust",
"camera_height":"eye-level",
"notes":"lock palette and proportions; no background props"
})
return {"shots": shots}
# -------- 4) Generate One Shot --------
def generate_shot(ref_path, identifiers, shot, out_path):
# Compose an instruction that locks the character
guardrails = [
"Preserve exact colors and materials.",
"Do not change eye color, fur pattern zones, or garment.",
"Maintain proportions and silhouette.",
"Plain light-gray background only."
]
prompt = f"""
Generate a clean studio image of the SAME character.
Angle: {shot['angle']}. Emotion: {shot['emotion']}. Mouth: {shot['mouth_state']}.
Framing: {shot['framing']}. Camera height: {shot['camera_height']}.
Lighting: {shot['lighting']}. Background: {shot['background']}.
Guardrails: {', '.join(guardrails)}.
Character Identifiers (authoritative): {json.dumps(identifiers, ensure_ascii=False)}
"""
# Tool: Images generation (text + image input)
img = client.images.generate(
model="gpt-image-1",
prompt=prompt,
size="1024x1024",
image[]=[{"image": f"file://{ref_path}"}] # image conditioning
)
b64 = img.data[0].b64_json
import base64
pathlib.Path(out_path).parent.mkdir(parents=True, exist_ok=True)
with open(out_path, "wb") as f:
f.write(base64.b64decode(b64))
# -------- 5) QC (model-aided) --------
def qc_image(image_path, identifiers, shot):
q = {
"task":"Evaluate if the image matches identifiers and shot spec. "
"Score each category 0..1 and explain any drift briefly.",
"identifiers":identifiers, "shot":shot
}
msg = [{"role":"user","content":[
{"type":"input_text","text":json.dumps(q)},
{"type":"input_image","image_url":f"file://{image_path}"}
]}]
resp = client.chat.completions.create(
model="gpt-4.1-mini",
messages=msg,
temperature=0
)
analysis = resp.choices[0].message.content
# naive parse; in practice instruct strict JSON and json.loads
return {"analysis": analysis}
def main():
ref = os.path.join(BASE,"01_input","reference","ref.png")
caption = describe_reference(ref)
save_json(caption, os.path.join(BASE,"02_captions","character.caption.json"))
seed = load_json(os.path.join(BASE,"01_input","identifiers.seed.json"), {})
# save schema locally for the merger step (optional)
schema_path = os.path.join(BASE,"schema_identifiers.json")
if not os.path.exists(schema_path):
save_json({
"fields":"(this file is only to provide structure cues; optional in PoC)"
}, schema_path)
identifiers = build_identifiers(caption, seed)
save_json(identifiers, os.path.join(BASE,"03_specs","identifiers.final.json"))
char_id = identifiers.get("character_id","char-"+uuid.uuid4().hex[:8])
plan = make_plan(char_id, angles=["front","3q-left"], emotions=["neutral","happy","angry"], mouths=["closed"])
save_json(plan, os.path.join(BASE,"03_specs","shots.plan.json"))
qc_items = []
for shot in plan["shots"]:
out = os.path.join(BASE,"04_generations",char_id,
f"angle={shot['angle']}",f"emotion={shot['emotion']}",f"mouth={shot['mouth_state']}",
f"{char_id}_{shot['angle']}_{shot['emotion']}_{shot['mouth_state']}.png")
generate_shot(ref, identifiers, shot, out)
qc = qc_image(out, identifiers, shot)
qc_items.append({"filename": out, "shot_spec": {"angle":shot["angle"],"emotion":shot["emotion"],"mouth_state":shot["mouth_state"]},
"scores":{}, "status":"review", "notes": qc["analysis"]})
save_json({"batch_id": uuid.uuid4().hex, "pass_rate": 0.0, "items": qc_items},
os.path.join(BASE,"05_qc","qc_report.json"))
with open(os.path.join(BASE,"06_delivery","review_sheet.md"),"w",encoding="utf-8") as f:
f.write("# Review\n\nSee qc_report.json and images.")
if __name__ == "__main__":
main()
Notes:
Replace models as needed. Model names and features evolve; confirm the current names in the docs before running.
OpenAI Platform
+1
Image generation with text + image inputs is documented under OpenAI Images/vision guides.
OpenAI Platform
+2
OpenAI Platform
+2
8) n8n outline (if you prefer visual automation)
Trigger: “New reference image dropped in 01_input/reference”
Node 1 (OpenAI Chat): Vision caption → character.caption.json
Node 2 (OpenAI Chat): Merge to identifiers.final.json (system prompt enforces schema)
Node 3 (Function): Build shots.plan.json
Node 4 (Loop): For each shot → Image Generate (OpenAI Images)
Node 5 (OpenAI Chat Vision): QC each image; write qc_report.json
Node 6 (If): Failures? → Auto-retry once with tightened prompt; else continue
Node 7 (Write Binary Files): Save images; Node 8 (Markdown): review_sheet.md
Node 9 (Notify): Send delivery folder link to you (Telegram/Email)
9) Limitations & failure points (and how to handle)
Consistency drift (palette/proportions): Use stronger “do not change” guardrails; include the reference image on every call; generate in small batches; keep background neutral.
Angle accuracy: The model may approximate. Add explicit yaw/pitch descriptors (“head turned ~30° to camera-left”).
Emotion clarity: Add describers (“raised cheeks, crow’s feet for happy; brows down for angry”).
Mouth mechanics: Explicitly name the tongue/teeth state; reiterate per prompt.
Accessories vanishing: Re-assert “accessories present: X” early and late in prompt; fail if missing.
Model updates: Model names/features change; confirm current names in the docs before running.
OpenAI Platform
10) Quick start checklist (10 minutes)
Create folders as in section 5.
Put your reference image at 01_input/reference/ref.png.
(Optional) Draft identifiers.seed.json with any known locks.
Install openai, Pillow.
Set OPENAI_API_KEY (you have a master .env—ensure it’s loaded).
Run the script. Inspect 02_captions/character.caption.json.
Review 03_specs/identifiers.final.json for sanity.
Check 04_generations images.
Open 05_qc/qc_report.json—skim the notes.
If happy, use 06_delivery/review_sheet.md to approve or request fixes.
If you want, I can adapt this to your exact toolchain (your C:\tools orchestration, character DB, and naming scheme) and produce an n8n export JSON so you can import it in one go.
You said:
dont we want multiagent system so that the agents can work on several flows modular?
ChatGPT said:
Short answer: yes. A multi-agent system is the right move—modular, parallel, and easier to scale and debug. Below is a lean, visual, step-by-step plan you can implement now without overhauling everything.
Overview (what changes)
One orchestrator assigns work.
Small, single-purpose workers do one job well.
A queue sits between every stage so tasks run in parallel and can retry safely.
Artifacts (images, JSON, QC reports) are saved to disk with stable names so agents don’t block each other.
Minimal agent roster (start here)
Orchestrator
Splits the job into shots (angle × emotion × mouth). Pushes tasks to queues. Tracks status.
Describer (Vision)
Input: reference image. Output: 02_captions/character.caption.json + detected accessories/oddities.
Identifier Builder
Merges caption + seed into 03_specs/identifiers.final.json (the “law”).
Shot Planner
Builds 03_specs/shots.plan.json from angles/emotions/mouth states.
Generator (N parallel workers)
Input: one shot spec + identifiers + reference. Output: image file.
QC Inspector (M parallel workers)
Input: generated image + identifiers + shot spec. Output: QC scores + pass/fail + retry hints.
Auto-Fixer
Tightens prompt/constraints for failed items and re-queues once.
Publisher
Packages images, qc_report.json, and a review sheet; notifies you.
Visual flow (text diagram)
[Orchestrator]
├─> Q_caption ----> [Describer] ----------┐
│ │
├─> Q_ident ----> [Identifier Builder] --┼--> identifiers.final.json
│ │
├─> Q_plan ----> [Shot Planner] -------┘--> shots.plan.json
│
└─> Q_gen (per shot) ---> [Generator x N] ----> images
└─> Q_qc -----------> [QC x M] ---> pass/fail
├─pass → Q_publish → [Publisher]
└─fail → Q_autofix → [Auto-Fixer] → Q_gen
Why this suits ADHD + scale
Parallel: Generators and QC run at the same time.
Modular: Each agent is focused; easier to debug.
Observable: Queues show where work is stuck.
Fail-safe: Crashes don’t kill the whole run; failed items are isolated and retried.
Data contracts (unchanged, just split per agent)
Identifiers JSON (authoritative spec).
Shot Spec JSON (one per image task).
QC Report JSON (scored, pass/fail, retry reason).
Folder layout (shared, agent-safe)
C:\tools\Character-Pipeline\
01_input\reference\ref.png
02_captions\character.caption.json
03_specs\identifiers.final.json
03_specs\shots.plan.json
04_generations\<character_id>\angle=...\emotion=...\mouth=...\*.png
05_qc\qc_report.json
06_delivery\review_sheet.md
.runtime\queues\ (lightweight queue state if not using Redis/Rabbit)
.runtime\logs\
Queueing options (pick one)
Fastest to set up: File-based queues (JSON lines in .runtime\queues\*). Works on Windows, zero infra.
Better for scale: Redis streams (n8n supports Redis; easy to monitor).
n8n-only: Use n8n’s built-in “Execute Workflow” with concurrency limits per node.
n8n blueprint (concise)
Workflow A – Prep
Trigger: “file added” to 01_input\reference\
Node: Describer (OpenAI Vision) → write 02_captions\...
Node: Identifier Builder (OpenAI Chat) → write 03_specs\identifiers.final.json
Node: Shot Planner (Function) → write 03_specs\shots.plan.json
Node: For-Each (shots) → push shot specs to Q_gen (could be a simple Google Sheet row, Redis list, or local file queue)
Workflow B – Generator Worker (concurrency: 3–6)
Trigger: “new item in Q_gen”
Node: Generate Image (OpenAI Images/your model) → save to 04_generations\...
Node: Push to Q_qc
Workflow C – QC Worker (concurrency: 2–4)
Trigger: “new item in Q_qc”
Node: QC (OpenAI Vision/Text) → append results to 05_qc\qc_report.json
IF pass → Q_publish; IF fail and retry_count < 1 → Q_autofix; else → Q_publish (marked fail)
Workflow D – Auto-Fixer
Tighten prompts with explicit deltas (“re-enforce glasses, keep iris color #xxxxxx”)
Requeue to Q_gen with retry_count=1
Workflow E – Publisher
Gather passed images; collate qc_report.json; render review_sheet.md
Notify via Telegram/Email with folder link
Config file (copy-paste YAML)
character_pipeline:
angles: [front, 3q-left, 3q-right, left, right, back]
emotions: [neutral, happy, sad, angry, surprised, disgust, fear, smirk]
mouth_states: [closed, open-small, open-wide, tongue-out, teeth-showing]
lighting: "soft even studio"
background: "plain light gray"
framing: "bust"
camera_height: "eye-level"
generator:
concurrency: 4
model: "gpt-image-1" # replace if you switch vendors
size: "1024x1024"
qc:
concurrency: 3
thresholds:
palette_lock: 0.95
proportions_lock: 0.97
emotion_match: 0.80
angle_tolerance_deg: 10
max_retries: 1
storage_root: "C:\\tools\\Character-Pipeline"
queues:
type: "files" # files|redis
path: ".runtime\\queues"
Message schemas between agents
Q_gen item
{
"character_id": "char-xxxx",
"shot": {"angle":"3q-left","emotion":"happy","mouth_state":"closed"},
"identifiers_path": "03_specs/identifiers.final.json",
"reference_path": "01_input/reference/ref.png",
"retry_count": 0
}
Q_qc item
{
"image_path": "04_generations/char-xxxx/angle=3q-left/emotion=happy/mouth=closed/char-xxxx_3q-left_happy_closed.png",
"shot": {...},
"identifiers_path": "03_specs/identifiers.final.json"
}
Rollout plan (low risk)
PoC multi-agent: Keep your existing single-script logic but split it into 3 workers: Generator x2 + QC x1, using file queues.
Add Orchestrator: Build shots.plan.json and push items to Q_gen; watch concurrency.
Add Auto-Fixer: Only when you see common failures (accessories missing, mouth mismatch).
Swap queues to Redis if/when file queues feel limiting.
Failure points to expect (and fixes)
Race conditions writing qc_report.json: Append via temp file + atomic rename, or store one report per image then merge at the end.
Duplicate work: Workers must claim a task by moving the JSON from pending to processing folder (or use Redis XCLAIM).
Prompt drift under parallel load: Keep reference image + identifiers in every generation call; include “do not change” list twice (top and end).
Accessory dropouts: Add explicit presence checks in QC; Auto-Fixer adds: “glasses MUST be present and circular, black rim.”
Angle ambiguity: Encode head yaw/pitch in degrees in the shot spec (“target yaw +30°, pitch 0°, roll 0°”). Teach QC to estimate roughly.
Quick start (today)
Keep your earlier single-agent script for Describer/Identifiers/Planner.
Add two small Python scripts:
worker_generate.py → polls Q_gen, saves image, enqueues to Q_qc.
worker_qc.py → polls Q_qc, writes per-image QC JSON, enqueues pass/fail.
Start 3 consoles: worker_generate.py (run twice), worker_qc.py (once). You’ve got a multi-agent system.