Skip to main content
Glama
Pasted-Goal-plain-English-Start-with-a-single-agent-that-writes-a-precise-image-description-for-a-refere-1758801140163_1758801140164.txt16.9 kB
Goal (plain English) Start with a single agent that writes a precise image description for a reference image. Feed the reference image + the agent’s description into OpenAI (vision + image generation). Generate image sets per emotion and per angle, respecting fixed specs (puppet mechanics, mouth/tongue, fur/clothing consistency). Auto-detect extras (glasses/hat etc.), merge into identifiers, then QC against the spec. Deliver for review, then loop fixes. 0) Constants you can lock now Angles (v1): front, left, right, back, 3q-left, 3q-right. Emotions (v1): neutral, happy, sad, angry, surprised, disgust, fear, smirk. Mouth states: closed, open-small, open-wide, tongue-out, teeth-showing. Lighting: soft, even, studio-style, no harsh shadows. Background: plain light gray (or transparent if you’ll composite later). Output size: 1024×1024 (change if you need). Style lock: “same character proportions, color palette, and materials across all shots.” 1) End-to-end pipeline (single agent orchestration) Ingest Inputs: reference_image, optional existing_identifiers.json. Agent creates a Detailed Caption (objective description with anatomy, colors, materials, clothing, accessories, scars, markings). Vision pass auto-detects extras (e.g., hat, glasses) and oddities (asymmetries, damage). Build/merge Identifiers Merge detected features into the Identifiers object (see schema below). Freeze “fixed” specs (anatomy, proportions, fur pattern, palette, logos). Shot list expansion Cartesian expansion: angles × emotions × mouth states (configure which combos you actually need). Generate a Shot Spec for each output image. Prompting for generation For each shot: feed the reference_image + identifiers + shot spec to the image model. Enforce: locked palette, proportions, accessories; background; lighting; camera angle; emotion. QC (automatic) Vision pass compares output vs. Identifiers + Shot Spec. Score per rule (see QC rules). Tag any drift (palette off, accessories missing, wrong mouth state, angle misaligned). Triage If all rules pass: mark “Ready for Review”. If minor drift: one automatic correction attempt (tighten prompt; add “do not change” reminders). If still failing: route to user review with a compact diff report. Delivery Save images with deterministic names. Emit qc_report.json and identifiers.final.json. Produce a short review sheet. Corrections loop User comments → regenerate only failed shots with updated constraints. 2) Data contracts (copy-paste JSON Schemas) 2.1 Character Identifiers (authoritative) { "character_id": "string", "name": "string", "source_reference": "path-or-url", "anatomy": { "species_or_type": "e.g., puppet gremlin", "height_relative": "e.g., small", "proportions_notes": "e.g., large head, short limbs", "silhouette_keywords": ["rounded ears","tapered snout"] }, "colors_materials": { "primary_palette": ["#A1B2C3","#334455"], "secondary_palette": ["#..."], "materials": ["felt","faux fur","plastic eyes","stitched mouth"] }, "surface_features": { "fur_pattern": "describe zones and direction", "scars_markings": "none or details", "eye_details": {"iris_color":"hex","pupil_shape":"round"}, "mouth_teeth_tongue": {"teeth":"flat white","tongue":"pink #F29CB2"} }, "costume_baseline": { "garment": "yellow raincoat", "footwear": "none", "logo_text": null }, "accessories": ["round glasses"], "mechanics": { "mouth_states_allowed": ["closed","open-small","open-wide","tongue-out","teeth-showing"], "jaw_hinge_visibility": "hidden", "ear_flex": "none", "eye_gaze_rules": "camera unless specified" }, "forbidden_changes": [ "do not change eye color", "no new scars", "preserve fur pattern zones" ], "notes": "any extra lock-ins" } 2.2 Shot Spec (one per image to generate) { "character_id": "string", "angle": "front | left | right | back | 3q-left | 3q-right", "emotion": "neutral | happy | sad | angry | surprised | disgust | fear | smirk", "mouth_state": "closed | open-small | open-wide | tongue-out | teeth-showing", "lighting": "soft even studio", "background": "plain light gray", "framing": "waist-up | full-body | bust", "camera_height": "eye-level", "notes": "any per-shot nuance" } 2.3 QC Report (aggregated) { "batch_id": "string", "pass_rate": 0.0, "items": [ { "filename": "string", "shot_spec": { "angle":"front","emotion":"happy","mouth_state":"open-small" }, "scores": { "palette_lock": 0.97, "proportions_lock": 0.94, "accessories_present": 1.0, "angle_match": 0.92, "emotion_match": 0.88, "mouth_state_match": 0.95, "artifact_check": 0.90, "background_lock": 1.0 }, "status": "pass | auto-retry | fail", "notes": "what drifted", "retry_prompt_delta": "extra constraints if auto-retry" } ] } 3) Identifier checklist (what to capture up front) Anatomy & silhouette: head/torso ratio, limb lengths, ear shape, tail, overall silhouette cues. Color/palette: hex values for primary/secondary; eye/tongue/teeth color; fur zones. Materials: fur/felt/fabric types; reflectivity; stitching lines; plastic parts. Surface features: markings, freckles, scars, seams; fur direction; wear/tear. Face specifics: eye shape, eyelid line, lash presence; muzzle/snout form. Mouth/tongue/teeth: allowed states, tongue shape/length, tooth style. Clothing baseline: exact garments, fasteners, logos, patterns. Accessories: glasses, hats, jewelry; whether removable or always-on. Mechanics locks: what must never change (jaw hinge visibility, ear flexibility). Camera/lighting locks: even lighting, neutral background, camera height. Forbidden list: any “never change” items (colors, logos, scars). Oddities detector: hats/glasses/new stickers; damage; symmetry issues. 4) QC rules (pass/fail thresholds) Palette lock ≥ 0.95 cosine similarity in HSV/histogram space. Proportions lock: measured landmarks within ±3% of baseline ratios. Accessories present: binary presence, else fail. Angle match: head yaw/pitch/roll within target ±10° (3D estimate). Emotion match: classifier confidence ≥ 0.80 (or model self-report with rationale). Mouth state: classifier detects state requested; if mismatch → fail. Background lock: uniformity > 0.98; no props. Artifacts: no double pupils, extra fingers, melting edges; if found → auto-retry once. 5) File layout (simple and visual) C:\tools\Character-Pipeline\ 01_input\ reference\ (drop your reference image here) identifiers.seed.json (optional) 02_captions\ character.caption.json 03_specs\ identifiers.final.json shots.plan.json 04_generations\ <character_id>\angle=<X>\emotion=<Y>\mouth=<Z>\*.png 05_qc\ qc_report.json diffs\*.png 06_delivery\ review_sheet.md 6) Proof-of-concept (tiny batch) Character: 1 Angles: front, 3q-left Emotions: neutral, happy, angry Mouth: closed only Total: 6 images Success = QC pass rate ≥ 80% on first pass; no palette drift. 7) Copy-paste Python (baseline; uses OpenAI Vision + Image Gen) This is a starter. It won’t do pixel-level landmarking; it relies on the model for description and soft QC. Treat as a PoC. This needs verification. import os, json, uuid, pathlib from datetime import datetime from PIL import Image from openai import OpenAI BASE = r"C:\tools\Character-Pipeline" os.makedirs(BASE, exist_ok=True) client = OpenAI() # requires OPENAI_API_KEY in env def save_json(data, path): pathlib.Path(path).parent.mkdir(parents=True, exist_ok=True) with open(path, "w", encoding="utf-8") as f: json.dump(data, f, indent=2, ensure_ascii=False) def load_json(path, default=None): if os.path.exists(path): return json.load(open(path, "r", encoding="utf-8")) return default # -------- 1) Detailed Caption via Vision -------- def describe_reference(ref_path): prompt = ( "You are a meticulous character describer. " "Describe the character in the image for reproduction with image generation. " "Capture anatomy/silhouette, exact color palette (return hex swatches), materials, " "fur/skin patterns, eye details, mouth/teeth/tongue, clothing, accessories, seams/stitching, " "forbidden changes, and any oddities (hat, glasses, damage). " "Respond as strict JSON in the 'raw' field containing plain prose plus a 'palette' field with hexes." ) msg = [ {"role":"user","content":[ {"type":"input_text","text":prompt}, {"type":"input_image","image_url":f"file://{ref_path}"} ]} ] resp = client.chat.completions.create( model="gpt-4.1-mini", # fast vision-capable; adjust per pricing/perf messages=msg, temperature=0.2 ) text = resp.choices[0].message.content return {"raw": text} # -------- 2) Build Identifiers -------- def build_identifiers(caption_json, seed_identifiers=None): sys = ( "Merge the caption with the optional seed identifiers into the final Identifiers JSON " "using the schema given below. Fill hex colors from caption palette. Preserve any 'forbidden_changes'. " "Return ONLY the JSON object, no commentary." ) schema = load_json(os.path.join(BASE,"schema_identifiers.json")) or {} messages = [ {"role":"system","content":sys + "\nSCHEMA:\n" + json.dumps(schema)}, {"role":"user","content":json.dumps({ "caption": caption_json, "seed": seed_identifiers or {} })} ] resp = client.chat.completions.create( model="o3-mini", messages=messages, temperature=0 ) return json.loads(resp.choices[0].message.content) # -------- 3) Expand Shots -------- ANGLES = ["front","3q-left","3q-right","left","right","back"] EMOTIONS = ["neutral","happy","sad","angry","surprised","disgust","fear","smirk"] MOUTHS = ["closed","open-small","open-wide","tongue-out","teeth-showing"] def make_plan(char_id, angles, emotions, mouths): shots = [] for a in angles: for e in emotions: for m in mouths: shots.append({ "character_id": char_id, "angle": a, "emotion": e, "mouth_state": m, "lighting":"soft even studio", "background":"plain light gray", "framing":"bust", "camera_height":"eye-level", "notes":"lock palette and proportions; no background props" }) return {"shots": shots} # -------- 4) Generate One Shot -------- def generate_shot(ref_path, identifiers, shot, out_path): # Compose an instruction that locks the character guardrails = [ "Preserve exact colors and materials.", "Do not change eye color, fur pattern zones, or garment.", "Maintain proportions and silhouette.", "Plain light-gray background only." ] prompt = f""" Generate a clean studio image of the SAME character. Angle: {shot['angle']}. Emotion: {shot['emotion']}. Mouth: {shot['mouth_state']}. Framing: {shot['framing']}. Camera height: {shot['camera_height']}. Lighting: {shot['lighting']}. Background: {shot['background']}. Guardrails: {', '.join(guardrails)}. Character Identifiers (authoritative): {json.dumps(identifiers, ensure_ascii=False)} """ # Tool: Images generation (text + image input) img = client.images.generate( model="gpt-image-1", prompt=prompt, size="1024x1024", image[]=[{"image": f"file://{ref_path}"}] # image conditioning ) b64 = img.data[0].b64_json import base64 pathlib.Path(out_path).parent.mkdir(parents=True, exist_ok=True) with open(out_path, "wb") as f: f.write(base64.b64decode(b64)) # -------- 5) QC (model-aided) -------- def qc_image(image_path, identifiers, shot): q = { "task":"Evaluate if the image matches identifiers and shot spec. " "Score each category 0..1 and explain any drift briefly.", "identifiers":identifiers, "shot":shot } msg = [{"role":"user","content":[ {"type":"input_text","text":json.dumps(q)}, {"type":"input_image","image_url":f"file://{image_path}"} ]}] resp = client.chat.completions.create( model="gpt-4.1-mini", messages=msg, temperature=0 ) analysis = resp.choices[0].message.content # naive parse; in practice instruct strict JSON and json.loads return {"analysis": analysis} def main(): ref = os.path.join(BASE,"01_input","reference","ref.png") caption = describe_reference(ref) save_json(caption, os.path.join(BASE,"02_captions","character.caption.json")) seed = load_json(os.path.join(BASE,"01_input","identifiers.seed.json"), {}) # save schema locally for the merger step (optional) schema_path = os.path.join(BASE,"schema_identifiers.json") if not os.path.exists(schema_path): save_json({ "fields":"(this file is only to provide structure cues; optional in PoC)" }, schema_path) identifiers = build_identifiers(caption, seed) save_json(identifiers, os.path.join(BASE,"03_specs","identifiers.final.json")) char_id = identifiers.get("character_id","char-"+uuid.uuid4().hex[:8]) plan = make_plan(char_id, angles=["front","3q-left"], emotions=["neutral","happy","angry"], mouths=["closed"]) save_json(plan, os.path.join(BASE,"03_specs","shots.plan.json")) qc_items = [] for shot in plan["shots"]: out = os.path.join(BASE,"04_generations",char_id, f"angle={shot['angle']}",f"emotion={shot['emotion']}",f"mouth={shot['mouth_state']}", f"{char_id}_{shot['angle']}_{shot['emotion']}_{shot['mouth_state']}.png") generate_shot(ref, identifiers, shot, out) qc = qc_image(out, identifiers, shot) qc_items.append({"filename": out, "shot_spec": {"angle":shot["angle"],"emotion":shot["emotion"],"mouth_state":shot["mouth_state"]}, "scores":{}, "status":"review", "notes": qc["analysis"]}) save_json({"batch_id": uuid.uuid4().hex, "pass_rate": 0.0, "items": qc_items}, os.path.join(BASE,"05_qc","qc_report.json")) with open(os.path.join(BASE,"06_delivery","review_sheet.md"),"w",encoding="utf-8") as f: f.write("# Review\n\nSee qc_report.json and images.") if __name__ == "__main__": main() Notes: Replace models as needed. Model names and features evolve; confirm the current names in the docs before running. OpenAI Platform +1 Image generation with text + image inputs is documented under OpenAI Images/vision guides. OpenAI Platform +2 OpenAI Platform +2 8) n8n outline (if you prefer visual automation) Trigger: “New reference image dropped in 01_input/reference” Node 1 (OpenAI Chat): Vision caption → character.caption.json Node 2 (OpenAI Chat): Merge to identifiers.final.json (system prompt enforces schema) Node 3 (Function): Build shots.plan.json Node 4 (Loop): For each shot → Image Generate (OpenAI Images) Node 5 (OpenAI Chat Vision): QC each image; write qc_report.json Node 6 (If): Failures? → Auto-retry once with tightened prompt; else continue Node 7 (Write Binary Files): Save images; Node 8 (Markdown): review_sheet.md Node 9 (Notify): Send delivery folder link to you (Telegram/Email) 9) Limitations & failure points (and how to handle) Consistency drift (palette/proportions): Use stronger “do not change” guardrails; include the reference image on every call; generate in small batches; keep background neutral. Angle accuracy: The model may approximate. Add explicit yaw/pitch descriptors (“head turned ~30° to camera-left”). Emotion clarity: Add describers (“raised cheeks, crow’s feet for happy; brows down for angry”). Mouth mechanics: Explicitly name the tongue/teeth state; reiterate per prompt. Accessories vanishing: Re-assert “accessories present: X” early and late in prompt; fail if missing. Model updates: Model names/features change; confirm current names in the docs before running. OpenAI Platform 10) Quick start checklist (10 minutes) Create folders as in section 5. Put your reference image at 01_input/reference/ref.png. (Optional) Draft identifiers.seed.json with any known locks. Install openai, Pillow. Set OPENAI_API_KEY (you have a master .env—ensure it’s loaded). Run the script. Inspect 02_captions/character.caption.json. Review 03_specs/identifiers.final.json for sanity. Check 04_generations images. Open 05_qc/qc_report.json—skim the notes. If happy, use 06_delivery/review_sheet.md to approve or request fixes.

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/bermingham85/mcp-puppet-pipeline'

If you have feedback or need assistance with the MCP directory API, please join our Discord server