Trigger App Crawl
trigger_crawlTrigger a browser-agent crawl of a web app to populate its knowledge graph. Use after significant features or onboarding to provide context for evaluations. Supports localhost URLs automatically.
Instructions
Trigger a browser-agent crawl of a web app to build the project's knowledge graph. The crawl systematically explores pages, UI states, and navigation flows, then populates the backend's knowledge graph so future evaluations and tests have context about the app.
LOCALHOST SUPPORT: Pass any localhost URL (e.g. http://localhost:3000) and it Just Works. A secure tunnel is automatically created so the remote browser can reach your local dev server.
WHEN TO USE: after a significant new feature, a new environment, or when onboarding a project. NOT for per-change verification — use check_app_in_browser for that.
SCOPE: one crawl per call against one URL. The crawl is long-running (minutes to tens of minutes depending on app size) and populates backend state asynchronously; the tool returns the execution status once the workflow completes. This does NOT return pass/fail — it returns executionId + status + outcome.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| url | Yes | URL to crawl. Can be any public URL or a localhost/local dev server URL. For localhost URLs, a secure tunnel is automatically created — just make sure your dev server is running on that port. | |
| projectUuid | No | UUID of the project whose knowledge graph the crawl should populate. Auto-detected from the current git repo if omitted. | |
| environmentId | No | UUID of a specific environment to use for the crawl. See available environments in the tool description above. | |
| credentialId | No | UUID of a specific credential for authenticated crawls. See available credentials in the tool description above. | |
| credentialRole | No | Pick a credential by role (e.g. 'admin', 'guest') from the resolved environment. | |
| username | No | A real, existing account email for the target app. Do NOT invent credentials — use one from the available credentials or ask the user. | |
| password | No | The real password for the username above. Do NOT guess. | |
| headless | No | Run the browser in headless mode. Defaults to backend configuration. | |
| timeoutSeconds | No | Maximum wall-time the crawl may run, in seconds (1..1800). Backend enforces this per workflow execution. | |
| repoName | No | GitHub repository name (e.g. 'my-org/my-repo'). Auto-detected from the current git repo — only provide this to run against a different project. |
Implementation Reference
- handlers/triggerCrawlHandler.ts:53-346 (handler)Main handler function for trigger_crawl tool. Implements the 4-step pattern: find template, provision tunnel if localhost, execute workflow, poll for result. Handles localhost URLs with automatic tunnel provisioning, crawl workflow execution via DebuggAI API, polling with transient error retry, and response formatting with crawl metrics and knowledge graph import results.
export async function triggerCrawlHandler( input: TriggerCrawlInput, context: ToolContext, rawProgressCallback?: ProgressCallback, ): Promise<ToolResponse> { const startTime = Date.now(); logger.toolStart('trigger_crawl', input); // Bead 0bq: progress circuit-breaker — see testPageChangesHandler for rationale. let progressDisabled = false; const progressCallback: ProgressCallback | undefined = rawProgressCallback ? async (update) => { if (progressDisabled) return; try { await rawProgressCallback(update); } catch (err) { progressDisabled = true; logger.warn('Progress emission failed; disabling further emissions for this request', { error: err instanceof Error ? err.message : String(err), }); } } : undefined; const client = new DebuggAIServerClient(config.api.key); await client.init(); const originalUrl = resolveTargetUrl(input); let ctx = buildContext(originalUrl); let keyId: string | undefined; const abortController = new AbortController(); const onStdinClose = () => { abortController.abort(); progressDisabled = true; }; process.stdin.once('close', onStdinClose); try { // --- Tunnel: reuse existing or provision a fresh one --- if (ctx.isLocalhost) { // Bead 1om: pre-flight local port probe BEFORE provision/ngrok/backend. const localPort = extractLocalhostPort(ctx.originalUrl); if (typeof localPort === 'number') { const probe = await probeLocalPort(localPort); if (!probe.reachable) { const payload = { error: 'LocalServerUnreachable', message: `No server listening on 127.0.0.1:${localPort}. Start your dev server on that port before running trigger_crawl. Probe result: ${probe.code} (${probe.detail ?? 'no detail'}).`, detail: { port: localPort, probeCode: probe.code, probeDetail: probe.detail, elapsedMs: probe.elapsedMs }, }; logger.warn(`Pre-flight port probe failed for ${ctx.originalUrl}: ${probe.code} in ${probe.elapsedMs}ms`); return { content: [{ type: 'text', text: JSON.stringify(payload, null, 2) }], isError: true }; } } if (progressCallback) { await progressCallback({ progress: 1, total: 4, message: 'Provisioning secure tunnel for localhost...' }); } const reused = findExistingTunnel(ctx); if (reused) { ctx = reused; } else { let tunnel; try { tunnel = await client.tunnels!.provisionWithRetry(); } catch (provisionError) { const msg = provisionError instanceof Error ? provisionError.message : String(provisionError); const diag = provisionError instanceof TunnelProvisionError ? ` ${provisionError.diagnosticSuffix()}` : ''; throw new Error( `Failed to provision tunnel for ${ctx.originalUrl}. ` + `The remote browser needs a secure tunnel to reach your local dev server. ` + `(Detail: ${msg})${diag}`, ); } keyId = tunnel.keyId; ctx = await ensureTunnel( ctx, tunnel.tunnelKey, tunnel.tunnelId, tunnel.keyId, () => client.revokeNgrokKey(tunnel.keyId), ); } // Bead 1om: post-tunnel health check — verify traffic actually flows. if (ctx.targetUrl) { const health = await probeTunnelHealth(ctx.targetUrl); if (!health.healthy) { const payload = { error: 'TunnelTrafficBlocked', message: `Tunnel was established but traffic isn't reaching the dev server. ${health.detail ?? ''} Common causes: dev server binds to 0.0.0.0 or ::1 but not 127.0.0.1; dev server crashed; firewall.`, detail: { code: health.code, status: health.status, ngrokErrorCode: health.ngrokErrorCode, elapsedMs: health.elapsedMs, }, }; logger.warn(`Tunnel health probe failed for ${ctx.targetUrl}: ${health.code} ${health.ngrokErrorCode ?? ''} in ${health.elapsedMs}ms`); if (ctx.tunnelId) { tunnelManager.stopTunnel(ctx.tunnelId).catch((err) => logger.warn(`Failed to stop broken tunnel ${ctx.tunnelId}: ${err}`), ); } keyId = undefined; return { content: [{ type: 'text', text: JSON.stringify(payload, null, 2) }], isError: true }; } } } // --- Find the crawl workflow template (cached across calls) --- if (progressCallback) { await progressCallback({ progress: 2, total: 4, message: 'Locating crawl workflow template...' }); } const templateUuid = await getCachedTemplateUuid(TEMPLATE_KEYWORD, async (name) => { return client.workflows!.findTemplateByName(name); }); if (!templateUuid) { throw new Error( `Raw Crawl Workflow Template not found. ` + `Ensure the backend has a template matching "${TEMPLATE_KEYWORD}" seeded and accessible.`, ); } // --- Build contextData + env --- const contextData: Record<string, any> = { targetUrl: ctx.targetUrl ?? ctx.originalUrl, }; if (input.projectUuid) contextData.projectId = input.projectUuid; if (typeof input.headless === 'boolean') contextData.headless = input.headless; if (typeof input.timeoutSeconds === 'number') contextData.timeoutSeconds = input.timeoutSeconds; const env: Record<string, any> = {}; if (input.environmentId) env.environmentId = input.environmentId; if (input.credentialId) env.credentialId = input.credentialId; if (input.credentialRole) env.credentialRole = input.credentialRole; if (input.username) env.username = input.username; if (input.password) env.password = input.password; // --- Execute --- if (progressCallback) { await progressCallback({ progress: 3, total: 4, message: 'Queuing crawl execution...' }); } // --- Execute + Poll (with bounded retry on transient errors, bead kbo9) --- const TERMINAL_STATUSES = new Set(['completed', 'failed', 'cancelled']); const MAX_RETRIES = getMaxTransientRetries(); let executeResponse: import('../services/workflows.js').WorkflowExecuteResponse | undefined; let executionUuid = ''; let finalExecution: import('../services/workflows.js').WorkflowExecution | undefined; let attempt = 0; while (true) { attempt++; if (attempt > 1) { Telemetry.capture(TelemetryEvents.WORKFLOW_TRANSIENT_RETRY, { tool: 'trigger_crawl', attempt, reason: transientReasonTag(finalExecution), previousExecutionId: executionUuid, previousErrorMessage: finalExecution?.errorMessage?.slice(0, 200), previousStateError: finalExecution?.state?.error?.slice(0, 200), }); if (progressCallback) { await progressCallback({ progress: 3, total: 4, message: `Transient backend error — retrying crawl (attempt ${attempt}/${MAX_RETRIES + 1})...`, }); } await new Promise(r => setTimeout(r, 1000 * (attempt - 1))); } executeResponse = await client.workflows!.executeWorkflow( templateUuid, contextData, Object.keys(env).length > 0 ? env : undefined, ); executionUuid = executeResponse.executionUuid; logger.info(`Crawl execution queued: ${executionUuid}${attempt > 1 ? ` (retry ${attempt - 1}/${MAX_RETRIES})` : ''}`); // --- Poll --- // Bead 0bq: emit the final progress (4/4 "Complete:...") INSIDE onUpdate // when terminal status detected, so there's no post-resolve emission that // could race the response and cause stale-progressToken transport tear-down. finalExecution = await client.workflows!.pollExecution(executionUuid, async (exec) => { if (ctx.tunnelId) touchTunnelById(ctx.tunnelId); if (!progressCallback) return; const nodeCount = (exec.nodeExecutions ?? []).length; if (TERMINAL_STATUSES.has(exec.status)) { await progressCallback({ progress: 4, total: 4, message: `Crawl ${exec.status} (${nodeCount} nodes)`, }); return; } await progressCallback({ progress: 4, total: 4, message: `Crawl ${exec.status} (${nodeCount} nodes)`, }); }, abortController.signal); if (attempt > MAX_RETRIES) break; if (!isTransientWorkflowError(finalExecution)) break; logger.warn( `Transient backend error detected on crawl (${transientReasonTag(finalExecution) ?? 'unknown'}) — ` + `retrying (attempt ${attempt + 1}/${MAX_RETRIES + 1})`, ); } const duration = Date.now() - startTime; const nodes = finalExecution.nodeExecutions ?? []; // --- Format response --- const responsePayload: Record<string, any> = { executionId: executionUuid, status: finalExecution.status, targetUrl: ctx.originalUrl, durationMs: finalExecution.durationMs ?? duration, }; const outcome = finalExecution.state?.outcome; if (outcome !== undefined && outcome !== null) responsePayload.outcome = outcome; if (finalExecution.errorMessage) responsePayload.errorMessage = finalExecution.errorMessage; if (finalExecution.errorInfo?.failedNodeId) responsePayload.failedNode = finalExecution.errorInfo.failedNodeId; if (executeResponse.resolvedEnvironmentId) responsePayload.resolvedEnvironmentId = executeResponse.resolvedEnvironmentId; if (executeResponse.resolvedCredentialId) responsePayload.resolvedCredentialId = executeResponse.resolvedCredentialId; // Backend release 2026-04-25: browser_session block on execution detail. // Crawl runs through the same browser pipeline, so the field is populated // here too. Pass through verbatim (presigned S3 URLs). if (finalExecution.browserSession) { responsePayload.browserSession = finalExecution.browserSession; } // Extract crawl metrics from surfer.crawl node (absent in older graph shapes) const crawlNode = nodes.find(n => n.nodeType === 'surfer.crawl'); if (crawlNode?.outputData) { const d = crawlNode.outputData; responsePayload.crawlSummary = { pagesDiscovered: d.pagesDiscovered, actionsExecuted: d.actionsExecuted, stepsTaken: d.stepsTaken, transitionsRecorded: d.transitionsRecorded, knowledgeGraphStates: d.knowledgeGraphStates, success: d.success, ...(d.error ? { error: d.error } : {}), }; } // Extract KG import result from knowledge_graph.import node (absent in older graph shapes) const kgNode = nodes.find(n => n.nodeType === 'knowledge_graph.import'); if (kgNode?.outputData) { const d = kgNode.outputData; responsePayload.knowledgeGraph = { imported: !d.skipped, skipped: d.skipped ?? false, reason: d.reason ?? '', edgesImported: d.edgesImported ?? 0, statesImported: d.statesImported ?? 0, knowledgeGraphId: d.knowledgeGraphId ?? '', ...(Array.isArray(d.importErrors) && d.importErrors.length > 0 ? { importErrors: d.importErrors } : {}), }; } logger.toolComplete('trigger_crawl', duration); // Bead 0bq: final progress is emitted INSIDE pollExecution's onUpdate when // terminal status is detected. Emitting it here would race the response // and could cause stale-progressToken transport tear-down. const sanitizedPayload = sanitizeResponseUrls(responsePayload, ctx); return { content: [{ type: 'text', text: JSON.stringify(sanitizedPayload, null, 2) }], }; } catch (error) { const duration = Date.now() - startTime; logger.toolError('trigger_crawl', error as Error, duration); if (error instanceof Error && (error.message.includes('not found') || error.message.includes('401'))) { invalidateTemplateCache(); } throw handleExternalServiceError(error, 'DebuggAI', 'crawl execution'); } finally { process.stdin.removeListener('close', onStdinClose); // Tunnel intentionally NOT torn down (reuse path per bead vwd). // If tunnel creation failed after key provision, revoke the orphaned key. if (!ctx.tunnelId && keyId) { client.revokeNgrokKey(keyId).catch(err => logger.warn(`Failed to revoke unused ngrok key ${keyId}: ${err}`), ); } } } - types/index.ts:29-45 (schema)Zod validation schema and TypeScript type for TriggerCrawlInput. Validates url (required, auto-normalized), projectUuid, environmentId, credentialId, credentialRole, username, password, headless, timeoutSeconds (max 1800), and repoName. Uses .strict() to reject unknown fields.
export const TriggerCrawlInputSchema = z.object({ url: z.preprocess( normalizeUrl, z.string().url('Invalid URL. Pass a full URL like "http://localhost:3000" or "https://example.com". Localhost URLs are auto-tunneled to the remote browser.'), ), projectUuid: z.string().uuid().optional(), environmentId: z.string().uuid().optional(), credentialId: z.string().uuid().optional(), credentialRole: z.string().optional(), username: z.string().optional(), password: z.string().optional(), headless: z.boolean().optional(), timeoutSeconds: z.number().int().positive().max(1800, 'timeoutSeconds cannot exceed 1800 (30 min)').optional(), repoName: z.string().optional(), }).strict(); export type TriggerCrawlInput = z.infer<typeof TriggerCrawlInputSchema>; - tools/index.ts:24-58 (registration)Tool registration in initTools(). buildTriggerCrawlTool(ctx) registers the plain Tool for listTools, and buildValidatedTriggerCrawlTool(ctx) registers the validated tool with schema+handler in the toolRegistry map under the name 'trigger_crawl'.
export function initTools(ctx: ProjectContext | null): void { const tools: Tool[] = [ buildTestPageChangesTool(ctx), buildTriggerCrawlTool(ctx), buildProbePageTool(), buildSearchProjectsTool(), buildSearchEnvironmentsTool(), buildCreateEnvironmentTool(), buildUpdateEnvironmentTool(), buildDeleteEnvironmentTool(), buildUpdateProjectTool(), buildDeleteProjectTool(), buildSearchExecutionsTool(), buildCreateProjectTool(), ]; const validated: ValidatedTool[] = [ buildValidatedTestPageChangesTool(ctx), buildValidatedTriggerCrawlTool(ctx), buildValidatedProbePageTool(), buildValidatedSearchProjectsTool(), buildValidatedSearchEnvironmentsTool(), buildValidatedCreateEnvironmentTool(), buildValidatedUpdateEnvironmentTool(), buildValidatedDeleteEnvironmentTool(), buildValidatedUpdateProjectTool(), buildValidatedDeleteProjectTool(), buildValidatedSearchExecutionsTool(), buildValidatedCreateProjectTool(), ]; _tools = tools; _validatedTools = validated; toolRegistry.clear(); for (const v of validated) toolRegistry.set(v.name, v); - tools/triggerCrawl.ts:45-98 (helper)Tool definition builder that constructs the Tool object with name 'trigger_crawl', title 'Trigger App Crawl', description (enriched with project environments/credentials at startup), and input schema with all parameter descriptions.
export function buildTriggerCrawlTool(ctx: ProjectContext | null): Tool { return { name: 'trigger_crawl', title: 'Trigger App Crawl', description: buildTriggerCrawlDescription(ctx), inputSchema: { type: 'object', properties: { url: { type: 'string', description: 'URL to crawl. Can be any public URL or a localhost/local dev server URL. For localhost URLs, a secure tunnel is automatically created — just make sure your dev server is running on that port.', }, projectUuid: { type: 'string', description: 'UUID of the project whose knowledge graph the crawl should populate. Auto-detected from the current git repo if omitted.', }, environmentId: { type: 'string', description: 'UUID of a specific environment to use for the crawl. See available environments in the tool description above.', }, credentialId: { type: 'string', description: 'UUID of a specific credential for authenticated crawls. See available credentials in the tool description above.', }, credentialRole: { type: 'string', description: "Pick a credential by role (e.g. 'admin', 'guest') from the resolved environment.", }, username: { type: 'string', description: 'A real, existing account email for the target app. Do NOT invent credentials — use one from the available credentials or ask the user.', }, password: { type: 'string', description: 'The real password for the username above. Do NOT guess.', }, headless: { type: 'boolean', description: 'Run the browser in headless mode. Defaults to backend configuration.', }, timeoutSeconds: { type: 'number', description: 'Maximum wall-time the crawl may run, in seconds (1..1800). Backend enforces this per workflow execution.', }, repoName: { type: 'string', description: "GitHub repository name (e.g. 'my-org/my-repo'). Auto-detected from the current git repo — only provide this to run against a different project.", }, }, required: ['url'], additionalProperties: false, }, }; }