segment
Tokenize a list of marks in a given language using language-specific segmentation, returning tokens per mark for direct use in chapter creation.
Instructions
Tokenise a list of marks via cwseg (jieba/fugashi/kiwipiepy for CJK, regex for EU). Always batch every mark in ONE call — per-mark calls flake under rate limits.
Returns {"tokens_per_mark": [["tok1","tok2",...], ...]} parallel to
marks — feed this directly to cjk-glosser and back into
create_chapter_from_marks(tokens_per_mark=...) so cwbe and the agent
see the same tokens.
Args: language: Source language code (EN | FR | ES | DE | IT | PT | ZH | JA | KO). marks: List of mark texts. All in the same language.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| language | Yes | ||
| marks | Yes |
Output Schema
| Name | Required | Description | Default |
|---|---|---|---|
| result | Yes |