batch_ingest_embeddings
Process content files for batch embedding generation by extracting text from formats like CSV, JSON, TXT, and MD, then formatting as validated JSONL with embedded task types.
Instructions
EMBEDDINGS CONTENT INGESTION - Specialized ingestion for embeddings batch processing. WORKFLOW: 1) Analyzes content structure, 2) Extracts text for embedding, 3) Formats as JSONL with proper embedContent structure including task_type, 4) Validates format. OPTIMIZED FOR: Text extraction from various formats (CSV columns, JSON fields, TXT lines, MD sections). RETURNS: JSONL file ready for batch_create_embeddings with task_type embedded in each request.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| inputFile | Yes | Path to content file | |
| outputFile | No | Optional output JSONL path | |
| textField | No | For CSV/JSON: field name containing text to embed (auto-detected if not provided) | |
| taskType | Yes | Embedding task type (RETRIEVAL_DOCUMENT, SEMANTIC_SIMILARITY, CLASSIFICATION, CLUSTERING, RETRIEVAL_QUERY, CODE_RETRIEVAL_QUERY, QUESTION_ANSWERING, FACT_VERIFICATION). Use batch_query_task_type if unsure. |