Libragen

Overview Schema Related Servers Score Discussions

building.md•8.3 KiB

--- title: Building Libraries description: Create optimized RAG libraries from your documentation section: Guides order: 3 --- This guide covers advanced options for building high-quality RAG libraries. ## Basic Build ```bash libragen build ./docs --name my-docs ``` This processes all markdown, text, and HTML files, generating embeddings and a full-text search index. ## Supported File Types | Extension | Format | Chunking | |-----------|--------|----------| | `.ts`, `.tsx`, `.mts`, `.cts` | TypeScript | AST-aware | | `.js`, `.jsx`, `.mjs`, `.cjs` | JavaScript | AST-aware | | `.py`, `.pyi` | Python | AST-aware | | `.rs` | Rust | AST-aware | | `.go` | Go | AST-aware | | `.java` | Java | AST-aware | | `.md`, `.mdx` | Markdown | Text-based | | `.txt` | Plain text | Text-based | | `.html` | HTML (text extracted) | Text-based | Customize with `--include`: ```bash libragen build ./docs --name my-docs --include "**/*.md" "**/*.rst" "**/*.txt" ``` ## AST-Aware Code Chunking For supported code files (TypeScript, JavaScript, Python, Rust, Go, Java), libragen uses AST-aware chunking by default. This produces higher-quality embeddings by: - **Respecting language boundaries** — Chunks align with functions, classes, and methods instead of arbitrary character positions - **Including semantic context** — Each chunk includes scope chain, imports, and sibling entities - **Contextualizing for embeddings** — A special contextualized text is generated that includes file path, scope, and signatures ### Context Modes Control how much context is included with each chunk: ```bash # Full context (default) - includes scope chain, imports, siblings, signatures libragen build ./src --name my-code --context-mode full # Minimal context - only scope chain and entity signatures libragen build ./src --name my-code --context-mode minimal # No context - raw code only libragen build ./src --name my-code --context-mode none ``` ### Disabling AST Chunking If you prefer text-based chunking for code files: ```bash libragen build ./src --name my-code --no-ast-chunking ``` Non-code files (Markdown, JSON, text, etc.) always use text-based chunking regardless of this setting. ## Text-Based Chunking Documents are split into chunks for indexing. The chunking parameters affect search quality. The defaults work well for most cases, but you can tune them for specific content types. ### When to Adjust Chunk Settings **Use defaults** (chunk-size: 1000, overlap: 100) when: - You have typical documentation with mixed content types - You're not sure what settings to use - Your docs have clear headings and sections **Use smaller chunks** (256-384) when: - Content is dense with distinct concepts (API references, glossaries) - Users ask very specific questions ("What does X parameter do?") - Each paragraph covers a different topic **Use larger chunks** (768-1024) when: - Content is narrative or tutorial-style - Context matters more than precision (architecture guides, explanations) - You want results to include more surrounding context ### Chunk Size Target size in characters. ```bash # Smaller chunks = more precise, more results libragen build ./docs --name my-docs --chunk-size 256 # Larger chunks = more context per result libragen build ./docs --name my-docs --chunk-size 1024 ``` **By content type:** | Content Type | Chunk Size | Why | |--------------|------------|-----| | API reference | 256-384 | Each function/method is self-contained | | Configuration docs | 384-512 | Options are usually independent | | Tutorials | 512-768 | Steps need some context | | Architecture guides | 768-1024 | Concepts span multiple paragraphs | | FAQs | 256-384 | Each Q&A is standalone | ### Chunk Overlap Characters shared between adjacent chunks. Overlap helps when: - Important context spans chunk boundaries - You have long sentences or paragraphs - Search queries might match text split across chunks ```bash libragen build ./docs --name my-docs --chunk-overlap 100 ``` **Guidelines:** - **10% overlap** (default): Good for well-structured docs with clear sections - **15-20% overlap**: Better for prose-heavy content or long paragraphs - **25%+ overlap**: Use when search quality suffers from split context (rare) Higher overlap increases library size and build time, so only increase if needed. ## Versioning Content Track documentation versions to match your releases: ```bash libragen build ./docs \ --name my-api \ --content-version 2.1.0 ``` Query specific versions later: ```bash libragen query -l my-api --content-version 2.1.0 "authentication" ``` ## Excluding Files Skip files matching glob patterns: ```bash libragen build ./docs \ --name my-docs \ --exclude "**/node_modules/**" \ --exclude "**/drafts/**" \ --exclude "**/*.test.md" ``` ## Adding Metadata Provide description for better discoverability: ```bash libragen build ./docs \ --name react-docs \ --description "Official React documentation including hooks, components, and API reference" ``` The description appears in `libragen list` and helps AI tools understand what the library contains. ## License Tracking When building from git repositories, licenses are automatically detected from LICENSE files. You can also specify licenses explicitly: ```bash # Explicit license libragen build ./docs --name my-docs --license MIT # Multiple licenses (dual licensing) libragen build ./docs --name my-docs --license MIT Apache-2.0 ``` View license information with: ```bash libragen inspect my-docs.libragen ``` **Supported licenses:** MIT, Apache-2.0, GPL-3.0, GPL-2.0, BSD-3-Clause, BSD-2-Clause, ISC, Unlicense, and more. ## CI/CD Integration Automate library builds in your pipeline. See the [CI Integration guide](/docs/ci-integration) for complete examples with GitHub Actions, GitLab CI, CircleCI, and Azure Pipelines. ## Output Location By default, libraries are saved to the current directory. Specify a different location: ```bash libragen build ./docs \ --name my-docs \ --output ~/.libragen/libraries/ ``` ## Performance Tips ### Large Documentation Sets For very large doc sets (>10,000 files): 1. Use larger chunk sizes to reduce total chunks 2. Exclude non-essential files (changelogs, drafts) 3. Build incrementally if possible ### Optimizing for Search Quality 1. **Structure your docs** - Use clear headings and sections 2. **Front-load important content** - Key information at the start of sections 3. **Use consistent terminology** - Same terms across related docs 4. **Include examples** - Code examples improve retrieval for technical queries ## Programmatic Building Use the `Builder` class from `@libragen/core` to build libraries programmatically: ```typescript import { Builder } from '@libragen/core'; const builder = new Builder(); const result = await builder.build('./docs', { name: 'my-docs', version: '1.0.0', description: 'My documentation', chunkSize: 1000, chunkOverlap: 100, }); console.log(`Built: ${result.outputPath}`); console.log(`Chunks: ${result.stats.chunkCount}`); console.log(`Time: ${result.stats.embedDuration}s`); ``` Build from git repositories: ```typescript const result = await builder.build('https://github.com/user/repo', { gitRef: 'v2.0.0', include: ['docs/**/*.md'], }); if (result.git) { console.log(`Commit: ${result.git.commitHash}`); console.log(`License: ${result.git.detectedLicense?.identifier}`); } ``` Track progress during builds: ```typescript await builder.build('./docs', { name: 'my-docs' }, (progress) => { console.log(`[${progress.progress}%] ${progress.phase}: ${progress.message}`); }); ``` ### Custom Embedders Use a custom embedding provider by implementing the `IEmbedder` interface: ```typescript import { Builder } from '@libragen/core'; import type { IEmbedder } from '@libragen/core'; class OpenAIEmbedder implements IEmbedder { dimensions = 1536; async initialize() { /* setup */ } async embed(text: string) { /* call OpenAI */ } async embedBatch(texts: string[]) { /* batch call */ } async dispose() { /* cleanup */ } } const builder = new Builder({ embedder: new OpenAIEmbedder() }); const result = await builder.build('./docs', { name: 'my-docs' }); ``` See the [API Reference](/docs/api#iembedder-interface) for the full interface specification. ## Need Help? See the [Troubleshooting guide](/docs/troubleshooting) for solutions to common build issues like slow builds, memory errors, and poor search results.

Loading blob content...

Latest Blog Posts

Redis vs ioredis vs valkey-glide
By punkpeye on January 26, 2026.
benchmark
Redis
valkey
Quickstart: Publish an MCP Server to the MCP Registry
By punkpeye on January 24, 2026.
mcp
official reference mirror
Official MCP Registry Server.json Requirements
By punkpeye on January 24, 2026.
mcp
official reference mirror

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/libragen/libragen'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

building.md•8.3 KiB