Goal (plain English)
Start with a single agent that writes a precise image description for a reference image. Feed the reference image + the agent’s description into OpenAI (vision + image generation). Generate image sets per emotion and per angle, respecting fixed specs (puppet mechanics, mouth/tongue, fur/clothing consistency). Auto-detect extras (glasses/hat etc.), merge into identifiers, then QC against the spec. Deliver for review, then loop fixes.
0) Constants you can lock now
Angles (v1): front, left, right, back, 3q-left, 3q-right.
Emotions (v1): neutral, happy, sad, angry, surprised, disgust, fear, smirk.
Mouth states: closed, open-small, open-wide, tongue-out, teeth-showing.
Lighting: soft, even, studio-style, no harsh shadows.
Background: plain light gray (or transparent if you’ll composite later).
Output size: 1024×1024 (change if you need).
Style lock: “same character proportions, color palette, and materials across all shots.”
1) End-to-end pipeline (single agent orchestration)
Ingest
Inputs: reference_image, optional existing_identifiers.json.
Agent creates a Detailed Caption (objective description with anatomy, colors, materials, clothing, accessories, scars, markings).
Vision pass auto-detects extras (e.g., hat, glasses) and oddities (asymmetries, damage).
Build/merge Identifiers
Merge detected features into the Identifiers object (see schema below).
Freeze “fixed” specs (anatomy, proportions, fur pattern, palette, logos).
Shot list expansion
Cartesian expansion: angles × emotions × mouth states (configure which combos you actually need).
Generate a Shot Spec for each output image.
Prompting for generation
For each shot: feed the reference_image + identifiers + shot spec to the image model.
Enforce: locked palette, proportions, accessories; background; lighting; camera angle; emotion.
QC (automatic)
Vision pass compares output vs. Identifiers + Shot Spec.
Score per rule (see QC rules). Tag any drift (palette off, accessories missing, wrong mouth state, angle misaligned).
Triage
If all rules pass: mark “Ready for Review”.
If minor drift: one automatic correction attempt (tighten prompt; add “do not change” reminders).
If still failing: route to user review with a compact diff report.
Delivery
Save images with deterministic names.
Emit qc_report.json and identifiers.final.json.
Produce a short review sheet.
Corrections loop
User comments → regenerate only failed shots with updated constraints.
2) Data contracts (copy-paste JSON Schemas)
2.1 Character Identifiers (authoritative)
{
"character_id": "string",
"name": "string",
"source_reference": "path-or-url",
"anatomy": {
"species_or_type": "e.g., puppet gremlin",
"height_relative": "e.g., small",
"proportions_notes": "e.g., large head, short limbs",
"silhouette_keywords": ["rounded ears","tapered snout"]
},
"colors_materials": {
"primary_palette": ["#A1B2C3","#334455"],
"secondary_palette": ["#..."],
"materials": ["felt","faux fur","plastic eyes","stitched mouth"]
},
"surface_features": {
"fur_pattern": "describe zones and direction",
"scars_markings": "none or details",
"eye_details": {"iris_color":"hex","pupil_shape":"round"},
"mouth_teeth_tongue": {"teeth":"flat white","tongue":"pink #F29CB2"}
},
"costume_baseline": {
"garment": "yellow raincoat",
"footwear": "none",
"logo_text": null
},
"accessories": ["round glasses"],
"mechanics": {
"mouth_states_allowed": ["closed","open-small","open-wide","tongue-out","teeth-showing"],
"jaw_hinge_visibility": "hidden",
"ear_flex": "none",
"eye_gaze_rules": "camera unless specified"
},
"forbidden_changes": [
"do not change eye color",
"no new scars",
"preserve fur pattern zones"
],
"notes": "any extra lock-ins"
}
2.2 Shot Spec (one per image to generate)
{
"character_id": "string",
"angle": "front | left | right | back | 3q-left | 3q-right",
"emotion": "neutral | happy | sad | angry | surprised | disgust | fear | smirk",
"mouth_state": "closed | open-small | open-wide | tongue-out | teeth-showing",
"lighting": "soft even studio",
"background": "plain light gray",
"framing": "waist-up | full-body | bust",
"camera_height": "eye-level",
"notes": "any per-shot nuance"
}
2.3 QC Report (aggregated)
{
"batch_id": "string",
"pass_rate": 0.0,
"items": [
{
"filename": "string",
"shot_spec": { "angle":"front","emotion":"happy","mouth_state":"open-small" },
"scores": {
"palette_lock": 0.97,
"proportions_lock": 0.94,
"accessories_present": 1.0,
"angle_match": 0.92,
"emotion_match": 0.88,
"mouth_state_match": 0.95,
"artifact_check": 0.90,
"background_lock": 1.0
},
"status": "pass | auto-retry | fail",
"notes": "what drifted",
"retry_prompt_delta": "extra constraints if auto-retry"
}
]
}
3) Identifier checklist (what to capture up front)
Anatomy & silhouette: head/torso ratio, limb lengths, ear shape, tail, overall silhouette cues.
Color/palette: hex values for primary/secondary; eye/tongue/teeth color; fur zones.
Materials: fur/felt/fabric types; reflectivity; stitching lines; plastic parts.
Surface features: markings, freckles, scars, seams; fur direction; wear/tear.
Face specifics: eye shape, eyelid line, lash presence; muzzle/snout form.
Mouth/tongue/teeth: allowed states, tongue shape/length, tooth style.
Clothing baseline: exact garments, fasteners, logos, patterns.
Accessories: glasses, hats, jewelry; whether removable or always-on.
Mechanics locks: what must never change (jaw hinge visibility, ear flexibility).
Camera/lighting locks: even lighting, neutral background, camera height.
Forbidden list: any “never change” items (colors, logos, scars).
Oddities detector: hats/glasses/new stickers; damage; symmetry issues.
4) QC rules (pass/fail thresholds)
Palette lock ≥ 0.95 cosine similarity in HSV/histogram space.
Proportions lock: measured landmarks within ±3% of baseline ratios.
Accessories present: binary presence, else fail.
Angle match: head yaw/pitch/roll within target ±10° (3D estimate).
Emotion match: classifier confidence ≥ 0.80 (or model self-report with rationale).
Mouth state: classifier detects state requested; if mismatch → fail.
Background lock: uniformity > 0.98; no props.
Artifacts: no double pupils, extra fingers, melting edges; if found → auto-retry once.
5) File layout (simple and visual)
C:\tools\Character-Pipeline\
01_input\
reference\ (drop your reference image here)
identifiers.seed.json (optional)
02_captions\
character.caption.json
03_specs\
identifiers.final.json
shots.plan.json
04_generations\
<character_id>\angle=<X>\emotion=<Y>\mouth=<Z>\*.png
05_qc\
qc_report.json
diffs\*.png
06_delivery\
review_sheet.md
6) Proof-of-concept (tiny batch)
Character: 1
Angles: front, 3q-left
Emotions: neutral, happy, angry
Mouth: closed only
Total: 6 images
Success = QC pass rate ≥ 80% on first pass; no palette drift.
7) Copy-paste Python (baseline; uses OpenAI Vision + Image Gen)
This is a starter. It won’t do pixel-level landmarking; it relies on the model for description and soft QC. Treat as a PoC. This needs verification.
import os, json, uuid, pathlib
from datetime import datetime
from PIL import Image
from openai import OpenAI
BASE = r"C:\tools\Character-Pipeline"
os.makedirs(BASE, exist_ok=True)
client = OpenAI() # requires OPENAI_API_KEY in env
def save_json(data, path):
pathlib.Path(path).parent.mkdir(parents=True, exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
def load_json(path, default=None):
if os.path.exists(path):
return json.load(open(path, "r", encoding="utf-8"))
return default
# -------- 1) Detailed Caption via Vision --------
def describe_reference(ref_path):
prompt = (
"You are a meticulous character describer. "
"Describe the character in the image for reproduction with image generation. "
"Capture anatomy/silhouette, exact color palette (return hex swatches), materials, "
"fur/skin patterns, eye details, mouth/teeth/tongue, clothing, accessories, seams/stitching, "
"forbidden changes, and any oddities (hat, glasses, damage). "
"Respond as strict JSON in the 'raw' field containing plain prose plus a 'palette' field with hexes."
)
msg = [
{"role":"user","content":[
{"type":"input_text","text":prompt},
{"type":"input_image","image_url":f"file://{ref_path}"}
]}
]
resp = client.chat.completions.create(
model="gpt-4.1-mini", # fast vision-capable; adjust per pricing/perf
messages=msg,
temperature=0.2
)
text = resp.choices[0].message.content
return {"raw": text}
# -------- 2) Build Identifiers --------
def build_identifiers(caption_json, seed_identifiers=None):
sys = (
"Merge the caption with the optional seed identifiers into the final Identifiers JSON "
"using the schema given below. Fill hex colors from caption palette. Preserve any 'forbidden_changes'. "
"Return ONLY the JSON object, no commentary."
)
schema = load_json(os.path.join(BASE,"schema_identifiers.json")) or {}
messages = [
{"role":"system","content":sys + "\nSCHEMA:\n" + json.dumps(schema)},
{"role":"user","content":json.dumps({
"caption": caption_json,
"seed": seed_identifiers or {}
})}
]
resp = client.chat.completions.create(
model="o3-mini",
messages=messages,
temperature=0
)
return json.loads(resp.choices[0].message.content)
# -------- 3) Expand Shots --------
ANGLES = ["front","3q-left","3q-right","left","right","back"]
EMOTIONS = ["neutral","happy","sad","angry","surprised","disgust","fear","smirk"]
MOUTHS = ["closed","open-small","open-wide","tongue-out","teeth-showing"]
def make_plan(char_id, angles, emotions, mouths):
shots = []
for a in angles:
for e in emotions:
for m in mouths:
shots.append({
"character_id": char_id,
"angle": a, "emotion": e, "mouth_state": m,
"lighting":"soft even studio",
"background":"plain light gray",
"framing":"bust",
"camera_height":"eye-level",
"notes":"lock palette and proportions; no background props"
})
return {"shots": shots}
# -------- 4) Generate One Shot --------
def generate_shot(ref_path, identifiers, shot, out_path):
# Compose an instruction that locks the character
guardrails = [
"Preserve exact colors and materials.",
"Do not change eye color, fur pattern zones, or garment.",
"Maintain proportions and silhouette.",
"Plain light-gray background only."
]
prompt = f"""
Generate a clean studio image of the SAME character.
Angle: {shot['angle']}. Emotion: {shot['emotion']}. Mouth: {shot['mouth_state']}.
Framing: {shot['framing']}. Camera height: {shot['camera_height']}.
Lighting: {shot['lighting']}. Background: {shot['background']}.
Guardrails: {', '.join(guardrails)}.
Character Identifiers (authoritative): {json.dumps(identifiers, ensure_ascii=False)}
"""
# Tool: Images generation (text + image input)
img = client.images.generate(
model="gpt-image-1",
prompt=prompt,
size="1024x1024",
image[]=[{"image": f"file://{ref_path}"}] # image conditioning
)
b64 = img.data[0].b64_json
import base64
pathlib.Path(out_path).parent.mkdir(parents=True, exist_ok=True)
with open(out_path, "wb") as f:
f.write(base64.b64decode(b64))
# -------- 5) QC (model-aided) --------
def qc_image(image_path, identifiers, shot):
q = {
"task":"Evaluate if the image matches identifiers and shot spec. "
"Score each category 0..1 and explain any drift briefly.",
"identifiers":identifiers, "shot":shot
}
msg = [{"role":"user","content":[
{"type":"input_text","text":json.dumps(q)},
{"type":"input_image","image_url":f"file://{image_path}"}
]}]
resp = client.chat.completions.create(
model="gpt-4.1-mini",
messages=msg,
temperature=0
)
analysis = resp.choices[0].message.content
# naive parse; in practice instruct strict JSON and json.loads
return {"analysis": analysis}
def main():
ref = os.path.join(BASE,"01_input","reference","ref.png")
caption = describe_reference(ref)
save_json(caption, os.path.join(BASE,"02_captions","character.caption.json"))
seed = load_json(os.path.join(BASE,"01_input","identifiers.seed.json"), {})
# save schema locally for the merger step (optional)
schema_path = os.path.join(BASE,"schema_identifiers.json")
if not os.path.exists(schema_path):
save_json({
"fields":"(this file is only to provide structure cues; optional in PoC)"
}, schema_path)
identifiers = build_identifiers(caption, seed)
save_json(identifiers, os.path.join(BASE,"03_specs","identifiers.final.json"))
char_id = identifiers.get("character_id","char-"+uuid.uuid4().hex[:8])
plan = make_plan(char_id, angles=["front","3q-left"], emotions=["neutral","happy","angry"], mouths=["closed"])
save_json(plan, os.path.join(BASE,"03_specs","shots.plan.json"))
qc_items = []
for shot in plan["shots"]:
out = os.path.join(BASE,"04_generations",char_id,
f"angle={shot['angle']}",f"emotion={shot['emotion']}",f"mouth={shot['mouth_state']}",
f"{char_id}_{shot['angle']}_{shot['emotion']}_{shot['mouth_state']}.png")
generate_shot(ref, identifiers, shot, out)
qc = qc_image(out, identifiers, shot)
qc_items.append({"filename": out, "shot_spec": {"angle":shot["angle"],"emotion":shot["emotion"],"mouth_state":shot["mouth_state"]},
"scores":{}, "status":"review", "notes": qc["analysis"]})
save_json({"batch_id": uuid.uuid4().hex, "pass_rate": 0.0, "items": qc_items},
os.path.join(BASE,"05_qc","qc_report.json"))
with open(os.path.join(BASE,"06_delivery","review_sheet.md"),"w",encoding="utf-8") as f:
f.write("# Review\n\nSee qc_report.json and images.")
if __name__ == "__main__":
main()
Notes:
Replace models as needed. Model names and features evolve; confirm the current names in the docs before running.
OpenAI Platform
+1
Image generation with text + image inputs is documented under OpenAI Images/vision guides.
OpenAI Platform
+2
OpenAI Platform
+2
8) n8n outline (if you prefer visual automation)
Trigger: “New reference image dropped in 01_input/reference”
Node 1 (OpenAI Chat): Vision caption → character.caption.json
Node 2 (OpenAI Chat): Merge to identifiers.final.json (system prompt enforces schema)
Node 3 (Function): Build shots.plan.json
Node 4 (Loop): For each shot → Image Generate (OpenAI Images)
Node 5 (OpenAI Chat Vision): QC each image; write qc_report.json
Node 6 (If): Failures? → Auto-retry once with tightened prompt; else continue
Node 7 (Write Binary Files): Save images; Node 8 (Markdown): review_sheet.md
Node 9 (Notify): Send delivery folder link to you (Telegram/Email)
9) Limitations & failure points (and how to handle)
Consistency drift (palette/proportions): Use stronger “do not change” guardrails; include the reference image on every call; generate in small batches; keep background neutral.
Angle accuracy: The model may approximate. Add explicit yaw/pitch descriptors (“head turned ~30° to camera-left”).
Emotion clarity: Add describers (“raised cheeks, crow’s feet for happy; brows down for angry”).
Mouth mechanics: Explicitly name the tongue/teeth state; reiterate per prompt.
Accessories vanishing: Re-assert “accessories present: X” early and late in prompt; fail if missing.
Model updates: Model names/features change; confirm current names in the docs before running.
OpenAI Platform
10) Quick start checklist (10 minutes)
Create folders as in section 5.
Put your reference image at 01_input/reference/ref.png.
(Optional) Draft identifiers.seed.json with any known locks.
Install openai, Pillow.
Set OPENAI_API_KEY (you have a master .env—ensure it’s loaded).
Run the script. Inspect 02_captions/character.caption.json.
Review 03_specs/identifiers.final.json for sanity.
Check 04_generations images.
Open 05_qc/qc_report.json—skim the notes.
If happy, use 06_delivery/review_sheet.md to approve or request fixes.