# Model Conformance Testing Plan for Python Plugins
> **Status:** Infrastructure Complete (P0–P2 done, P3 pending manual validation)
> **Date:** 2026-02-06 (updated)
> **Owner:** Python Genkit Team
> **Scope:** Phase 1 covers google-genai, anthropic, and compat-oai (OpenAI).
> All 10 plugins have entry points and specs (Phase 2 plugins included early).
---
## Problem Statement
The Genkit CLI provides a `genkit dev:test-model` command
([genkit-tools/cli/src/commands/dev-test-model.ts][dev-test-model]) that runs
standardized conformance tests against model providers. This command already
works cross-runtime (JS and Python) via the reflection API, but we have no
Python-side conformance test specs, entry points, or automation to exercise it.
We need to:
1. Verify that Python model provider plugins produce correct responses for the
same test cases used by JS plugins.
2. Establish a repeatable, per-plugin conformance testing workflow.
3. Identify and close feature parity gaps between Python and JS plugins.
[dev-test-model]: https://github.com/firebase/genkit/blob/main/genkit-tools/cli/src/commands/dev-test-model.ts
---
## Architecture
```
py/bin/test-model-conformance
|
v
genkit dev:test-model --from-file spec.yaml
|
discovers runtime
|
v
Reflection Server (:3100)
|
/api/runAction
|
v
Plugin: GoogleAI / Anthropic / etc.
^
|
conformance_entry.py
```
**How it works:**
1. A lightweight Python entry point (`conformance_entry.py`) initializes Genkit
with a single plugin and starts the reflection server.
2. The `genkit dev:test-model` JS CLI discovers the running Python runtime via
`.genkit/runtimes/*.json` discovery files.
3. The CLI sends standardized test requests to models through `POST /api/runAction`.
4. Responses are validated using built-in validators (tool calling, structured
output, multimodal, streaming, etc.).
---
## Cross-Runtime Feature Parity Analysis
### Plugins with JS Counterparts
| Plugin | JS Location | JS Models | Python Models | Parity | Gaps in Python | Python Extras |
|--------|-------------|-----------|---------------|--------|----------------|---------------|
| **google-genai** | In-repo `js/plugins/google-genai/` | 24 (Gemini, TTS, Gemini-Image, Gemma, Imagen, Veo) | 23+ (same families) | **Partial** | Imagen under `googleai/` prefix (only registered under `vertexai/`) | More legacy Gemini preview versions |
| **anthropic** | In-repo `js/plugins/anthropic/` | 8 (Claude 3-haiku through opus-4-5) | 8 (identical list and capabilities) | **Full** | None | None |
| **compat-oai** | In-repo `js/plugins/compat-oai/` | 49 (30 chat, 2 image gen, 3 TTS, 3 STT, 3 embed, 2 DeepSeek, 6 xAI) | 30+ (22+ chat, 2 image gen, 3 TTS, 3 STT, 3 embed) | **Full** | Vision (gpt-4-vision*), gpt-4-32k (older models) | DeepSeek/xAI split into dedicated plugins |
| **ollama** | In-repo `js/plugins/ollama/` | Dynamic discovery | Dynamic discovery | **Full** | Cosmetic: JS declares `media=true`, `toolChoice=true`; Python omits | Python declares `output=['text','json']` |
| **amazon-bedrock** | External [aws-bedrock-js-plugin][bedrock-js] | ~35 (Amazon, Claude 2-3.7, Cohere, Mistral, AI21, Llama) | 50+ (all JS models included) | **Python superset** | None | DeepSeek, Gemma, NVIDIA, Qwen, Writer, Moonshot, newer Claude 4.x |
| **microsoft-foundry** | External [azure-foundry-js-plugin][foundry-js] | ~32 chat + DALL-E + TTS + Whisper + embed | 30+ chat + embed + dynamic catalog | **Partial** | DALL-E image gen, TTS, Whisper STT | Claude, DeepSeek, Grok, Llama, Mistral; dynamic Azure catalog (11k+ models) |
| **deepseek** | JS: in `compat-oai` as `deepseek/` prefix | 2 (deepseek-chat, deepseek-reasoner) | 4 (+ deepseek-v3, deepseek-r1) | **Python superset** | None | 2 additional models |
| **xai** | JS: in `compat-oai` as `xai/` prefix | 6 (grok-3 family, grok-2-vision, grok-2-image) | 6 (grok-3 family, grok-4, grok-2-vision) | **Partial** | Image gen (grok-2-image-1212) | grok-4 (newer model) |
[bedrock-js]: https://github.com/genkit-ai/aws-bedrock-js-plugin
[foundry-js]: https://github.com/genkit-ai/azure-foundry-js-plugin
### Python-Only Plugins (no JS counterpart)
| Plugin | Models | Notes |
|--------|--------|-------|
| **mistral** | 30+ (Large 3, Medium 3.1, Small 3.2, Ministral 3, Magistral, Codestral, Devstral, Voxtral, Pixtral, Embed) | No JS plugin exists. PR #4485: embeddings + streaming fix. PR #4486: full capability update. |
| **huggingface** | 10+ popular models + any HF model ID | No JS plugin exists |
| **cloudflare-workers-ai** | 15+ (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek) | No JS plugin exists |
### Gaps Summary (Ordered by Priority)
| Priority | Plugin | Gap | Impact | Fix Effort |
|----------|--------|-----|--------|------------|
| **HIGH** | google-genai | Imagen under `googleai/` prefix | Blocks spec symlink for conformance tests | Low (~20 lines in `google.py`) |
| ~~MEDIUM~~ | compat-oai | ~~Image gen (dall-e-3, gpt-image-1)~~ | ✅ Done (PR #4477) | -- |
| ~~MEDIUM~~ | compat-oai | ~~TTS (tts-1, tts-1-hd, gpt-4o-mini-tts)~~ | ✅ Done (PR #4477) | -- |
| ~~MEDIUM~~ | compat-oai | ~~STT (whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe)~~ | ✅ Done (PR #4477) | -- |
| **MEDIUM** | microsoft-foundry | DALL-E, TTS, Whisper | Mirrors compat-oai gaps | Medium |
| **LOW** | xai | Image gen (grok-2-image-1212) | Single model missing | Medium (new handler) |
| **LOW** | compat-oai | Vision models (gpt-4-vision*), gpt-4-32k | Older models, multimodal works via gpt-4o | Low (add model defs) |
| **LOW** | ollama | `media`, `toolChoice` metadata | Cosmetic only, no functional impact | Trivial |
---
## Dependency Graph
All tasks for Phase 1 and their dependency relationships:
```
DEPENDENCY GRAPH
================
+-----------------+ +-----------------+
| fix-imagen-gap | | setup-dir |
| (P0) | | (P0) |
+----+-------+----+ +--+---------+--+-+
| | | | |
| +----+---------------+ | |
| | | | |
+----v--v-+ +----v-----------+ +-----v--+ +-----v--------+
| symlink | | entry- | | spec- | | spec- |
| gemini | | google-genai | | anthr. | | compat-oai |
| (P1) | | (P1) | | (P1) | | (P1) |
+----+----+ +-------+--------+ +---+----+ +-----+--------+
| | | |
+-------+-------+-------+-------+--------------+
|
+----v-----------+
| runner-script |
| (P2) |
+----+-----------+
|
+----v-----------+
| validate- |
| google-genai |
| (P3) |
+----------------+
```
**Edge list (A -> B means "A must complete before B can start"):**
- `fix-imagen-gap` -> `symlink-gemini-spec`
- `fix-imagen-gap` -> `entry-google-genai`
- `setup-dir` -> `symlink-gemini-spec`
- `setup-dir` -> `entry-google-genai`
- `setup-dir` -> `spec-anthropic`
- `setup-dir` -> `spec-compat-oai`
- `symlink-gemini-spec` -> `runner-script`
- `entry-google-genai` -> `runner-script`
- `spec-anthropic` -> `runner-script`
- `spec-compat-oai` -> `runner-script`
- `runner-script` -> `validate-google-genai`
---
## Phased Execution Plan (Reverse Topological Order)
Execute each phase to completion before starting the next. **All tasks within a
phase are independent and should run in parallel** for fastest completion.
**Critical path:** `fix-imagen-gap` -> `symlink-gemini-spec` -> `runner-script`
-> `validate-google-genai`
### Phase 0: Leaves ✅ COMPLETE
| Task | Description | File(s) | Effort | Status |
|------|-------------|---------|--------|--------|
| `fix-imagen-gap` | GoogleAI already registers Imagen under `googleai/` (verified in code) | `google.py` lines 378-380, 523-527, 596-601 | N/A | ✅ Already done |
| `setup-dir` | Created `py/tests/conformance/` with dirs for all 10 plugins | `py/tests/conformance/{google-genai,anthropic,compat-oai,...}/` | Trivial | ✅ Done |
**Parallelizable:** Yes, both tasks are independent.
### Phase 1: Specs + Entry Points ✅ COMPLETE
| Task | Description | Depends On | File(s) | Status |
|------|-------------|------------|---------|--------|
| `symlink-gemini-spec` | Symlinked JS spec into conformance dir | P0 | `google-genai/model-conformance.yaml` → JS spec | ✅ Done |
| `entry-google-genai` | Minimal google-genai entry point | P0 | `google-genai/conformance_entry.py` | ✅ Done |
| `spec-anthropic` | Anthropic entry point + YAML spec | P0 | `anthropic/{conformance_entry.py,model-conformance.yaml}` | ✅ Done |
| `spec-compat-oai` | compat-oai entry point + YAML spec (gpt-4o, gpt-4o-mini, dall-e-3, tts-1) | P0 | `compat-oai/{conformance_entry.py,model-conformance.yaml}` | ✅ Done (updated with multimodal, PR #4477) |
**Note:** All 10 plugins (including Phase 2 plugins) have entry points and specs.
### Phase 2: Orchestration ✅ COMPLETE
| Task | Description | Depends On | File(s) | Status |
|------|-------------|------------|---------|--------|
| `runner-script` | Shell script to orchestrate per-plugin conformance test runs | All Phase 1 tasks | `py/bin/test-model-conformance` | ✅ Done |
### Phase 3: Validation ⏳ PENDING
| Task | Description | Depends On | File(s) | Status |
|------|-------------|------------|---------|--------|
| `validate-google-genai` | Manual end-to-end validation with live API via `genkit dev:test-model` | `runner-script` | -- (manual run) | ⏳ Not yet run |
### Execution Timeline
```
TIME -->
==========================================================================
P0: [fix-imagen-gap ~~~~~~~~~~~~] [setup-dir ~~~]
(parallel) (parallel)
|
--- all P0 complete ----------------+--------
|
P1: [symlink-gemini-spec ~] [entry-google-genai ~]
[spec-anthropic ~~~~~~] [spec-compat-oai ~~~~]
(all 4 in parallel)
|
--- all P1 complete ---
|
P2: [runner-script ~~~~~~~~~~~~]
|
P3: [validate-google-genai ~~~~]
|
=== PHASE 1 SCOPE COMPLETE ===
```
---
## What To Build
### Prerequisite: Fix Imagen Gap in Python google-genai Plugin
The JS plugin supports Imagen under the `googleai/` prefix but the Python plugin
only registers it under `vertexai/`. The `ImagenModel` class is already
client-agnostic (uses `client.aio.models.generate_images()` which works for
both); only the registration code needs updating.
**File:** `py/plugins/google-genai/src/genkit/plugins/google_genai/google.py`
**Changes (~20 lines):**
1. **`GoogleAI.init()`** -- Add Imagen model loop after Gemini registration:
```python
for name in genai_models.imagen:
actions.append(self._resolve_model(googleai_name(name)))
```
2. **`GoogleAI._resolve_model()`** -- Add Imagen detection branch (mirror
VertexAI logic):
```python
if clean_name.lower().startswith('imagen'):
model_ref = vertexai_image_model_info(clean_name)
model = ImagenModel(clean_name, self._client)
IMAGE_SUPPORTED_MODELS[clean_name] = model_ref
config_schema = ImagenConfigSchema
# ... create and return Action
```
3. **`GoogleAI.list_actions()`** -- Include Imagen in discovered actions list:
```python
for name in genai_models.imagen:
actions_list.append(
model_action_metadata(
name=googleai_name(name),
info=vertexai_image_model_info(name).model_dump(by_alias=True),
config_schema=ImagenConfigSchema,
)
)
```
### Directory Layout
All conformance testing files live under `py/tests/conformance/` to avoid
disturbing other runtimes:
```
py/tests/conformance/
google-genai/
conformance_entry.py # minimal Genkit entry point
model-conformance.yaml -> symlink # -> js/plugins/google-genai/tests/model-tests-tts.yaml
anthropic/
conformance_entry.py
model-conformance.yaml # anthropic-specific spec
compat-oai/
conformance_entry.py
model-conformance.yaml # openai-specific spec
py/bin/
test-model-conformance # orchestrator shell script
```
### Entry Point Template
Each plugin gets a minimal Python script that initializes Genkit with just that
plugin. The reflection server starts automatically in dev mode (`GENKIT_ENV=dev`,
set by `genkit start`).
```python
"""Minimal entry point for model conformance testing via genkit dev:test-model."""
import asyncio
from genkit.ai import Genkit
from genkit.plugins.google_genai import GoogleAI # varies per plugin
ai = Genkit(plugins=[GoogleAI()])
async def main():
while True:
await asyncio.sleep(3600)
if __name__ == '__main__':
ai.run_main(main())
```
### Spec Files
**google-genai:** Symlink to the JS spec file so both runtimes test the same
models with the same expectations:
```bash
# From py/tests/conformance/google-genai/
ln -s "$(git rev-parse --show-toplevel)/js/plugins/google-genai/tests/model-tests-tts.yaml" model-conformance.yaml
```
The JS spec tests:
- `googleai/imagen-4.0-generate-001` (output-image)
- `googleai/gemini-2.5-flash-preview-tts` (custom TTS test)
- `googleai/gemini-2.5-pro` (tool-request, structured-output, multiturn, system-role, image-base64, image-url, video-youtube)
- `googleai/gemini-3-pro-preview` (same + reasoning, streaming, tool-response custom tests)
- `googleai/gemini-2.5-flash` (same as gemini-2.5-pro)
Env: `GEMINI_API_KEY`
**anthropic:** New spec. Models: `anthropic/claude-sonnet-4` and
`anthropic/claude-haiku-4-5`. Tests: tool-request, multiturn, system-role,
input-image-base64, input-image-url, streaming-multiturn, streaming-tool-request.
Haiku-4-5 adds structured-output and streaming-structured-output.
Env: `ANTHROPIC_API_KEY`
**compat-oai (OpenAI):** New spec. Models: `openai/gpt-4o` and
`openai/gpt-4o-mini`. Tests: tool-request, structured-output, multiturn,
system-role, input-image-base64, input-image-url, streaming-multiturn,
streaming-tool-request, streaming-structured-output.
Env: `OPENAI_API_KEY`
### Test Runner Script
**Location:** `py/bin/test-model-conformance`
```bash
#!/usr/bin/env bash
# Usage:
# py/bin/test-model-conformance google-genai # test one plugin
# py/bin/test-model-conformance --all # test all plugins
```
The script:
- Accepts a plugin name (or `--all`) as argument
- Validates the required env vars are set for that plugin
- Runs: `genkit dev:test-model --from-file <spec> -- uv run <entry_point>`
> **Note:** [`uv`](https://docs.astral.sh/uv/) is the project's standard Python
> package manager and task runner, already used throughout the repository (see
> `py/pyproject.toml` workspace configuration and `py/bin/` scripts). It is
> installed as part of the developer setup via `bin/setup`.
- `dev:test-model` handles process lifecycle (start, wait for runtime, run
tests, shut down)
- Reports aggregate pass/fail and exits non-zero on failure
### Built-in Test Capabilities
The following test types are available from `dev:test-model` (from
[dev-test-model.ts lines 254-476][dev-test-model]):
| Test | Description |
|------|-------------|
| `tool-request` | Tool/function calling conformance |
| `structured-output` | JSON schema output |
| `multiturn` | Multi-turn conversation |
| `streaming-multiturn` | Streaming + multiturn |
| `streaming-tool-request` | Streaming tool calls |
| `streaming-structured-output` | Streaming structured output |
| `system-role` | System message handling |
| `input-image-base64` | Base64 image input |
| `input-image-url` | URL image input |
| `input-video-youtube` | YouTube video input |
| `output-audio` | TTS/audio output |
| `output-image` | Image generation |
### Built-in Validators
`has-tool-request[:toolName]`, `valid-json`, `text-includes:expected`,
`text-starts-with:prefix`, `text-not-empty`, `valid-media:type`, `reasoning`,
plus streaming variants (`stream-text-includes`, `stream-has-tool-request`,
`stream-valid-json`).
---
## Phase 2 (Future -- after Phase 1 validated)
Add conformance specs for remaining plugins. The parity analysis above informs
which capabilities to test per plugin:
| Plugin | Test Capabilities | Notes |
|--------|-------------------|-------|
| **mistral** | tool-request, structured-output, multiturn, system-role, streaming-multiturn, input-image-base64, input-image-url | All Large 3/Medium 3.1/Small 3.2/Ministral 3/Magistral support vision. Voxtral adds audio input. |
| **deepseek** | tool-request, structured-output, multiturn, system-role, streaming-multiturn | |
| **xai** | tool-request, structured-output, multiturn, system-role, streaming-multiturn | grok-2-vision adds input-image |
| **ollama** | tool-request, structured-output, multiturn, system-role | Depends on locally installed model |
| **amazon-bedrock** | tool-request, structured-output, multiturn, system-role, streaming-multiturn, input-image-base64 | Model-dependent |
| **huggingface** | tool-request, structured-output, multiturn, system-role | Model-dependent |
| **microsoft-foundry** | tool-request, structured-output, multiturn, system-role, streaming-multiturn, input-image-base64 | Model-dependent |
| **cloudflare-workers-ai** | tool-request, structured-output, multiturn, system-role | Model-dependent |
---
## CI Integration Notes
- These are **live API tests** -- they call real model endpoints. Do NOT run in
standard CI.
- Gate behind manual trigger or CI label (e.g., `run-conformance-tests`).
- Each plugin requires its own API key/credentials.
- Consider a `--dry-run` mode in the runner script that validates spec files
parse correctly without making API calls.
---
## Effort Estimates
| Phase | Tasks | Effort | Parallelizable |
|-------|-------|--------|----------------|
| **P0** | 2 tasks (fix-imagen-gap, setup-dir) | ~1 hour | Yes |
| **P1** | 4 tasks (symlink, entry, 2 specs) | ~2 hours | Yes |
| **P2** | 1 task (runner script) | ~1 hour | No |
| **P3** | 1 task (E2E validation) | ~1 hour | No |
| **Total** | 8 tasks | ~3-5 hours (with parallelism) | |