Skip to main content
Glama

DevDocs MCP Server

by cyberagiinc
selective_consolidation_plan.md12.6 kB
# Feature Plan: Selective URL Consolidation **Objective:** Implement a two-step process where users first discover URLs and then selectively choose which URLs to crawl and consolidate into a single file per job. **Workflow:** 1. **Discovery:** User enters a URL and depth, clicks "Discover". * Backend (`/api/discover` -> `discover_pages`) finds all reachable internal URLs within the specified depth. * Backend updates job status with discovered URLs, marking them as `pending_crawl`. * **Important:** Backend `discover_pages` function will *not* fetch content or perform consolidation during this phase. * Frontend (`CrawlStatusMonitor`) polls for job status and displays the list of discovered URLs once the status is `discovery_complete`. 2. **Selection:** * Frontend displays checkboxes next to each discovered URL in the `CrawlStatusMonitor` (or a similar component). * Frontend provides a "Select All" checkbox. * Frontend enables a "Crawl Selected" button when at least one URL is checked. 3. **Selective Crawl & Consolidation:** * User clicks "Crawl Selected". * Frontend sends the `jobId` and the list of selected URLs to the backend (`/api/crawl`). * Backend (`/api/crawl` -> `crawl_pages`) receives the request. * Backend `crawl_pages` function iterates *only* through the selected URLs: * Updates the URL status to `crawling`. * Calls the `crawl4ai` service to fetch content for the URL. * Polls for the result. * If successful, appends the fetched markdown content to the job's consolidated file (`storage/markdown/<root_url_filename>.md`). * Updates the job's consolidated metadata file (`storage/markdown/<root_url_filename>.json`). * Updates the URL status to `completed`. * If fetching/processing fails, updates the URL status to `crawl_error`. * Backend updates the overall job status (`completed` or `completed_with_errors`) when all selected URLs are processed. 4. **Display Results:** * Frontend implements a `ConsolidatedFiles` component (similar to the provided image). * This component appears when a job is finished. * It reads the metadata (`.json`) for the job's consolidated file. * It displays details: filename (derived from root URL), number of pages consolidated, total size, last updated time. * It provides buttons to view the raw markdown and JSON metadata. (Download functionality can be added later). * Frontend implements a statistics display area showing "Subdomains Parsed", "Pages Crawled" (updated after selective crawl), "Data Extracted" (updated after selective crawl), and "Errors Encountered". --- **Implementation Steps & Subtasks:** **1. Backend Refinement (`backend/app/crawler.py`)** * **Modify `discover_pages` function:** * [x] Remove `requests.post` call to `/crawl` endpoint (lines ~221-230). > Changes: `backend/app/crawler.py`, Lines [185-213] (Removed call and related logic) * [x] Remove polling loop (`for attempt in range(max_attempts):`) (lines ~249-373). > Changes: `backend/app/crawler.py`, Lines [185-213] (Removed call and related logic) * [x] Remove file writing logic (saving `.md` and `.json`) (lines ~270-358). > Changes: `backend/app/crawler.py`, Lines [215-219] (Removed placeholder comment block), Line [233] (Removed syntax error) * [x] Ensure URL status is updated to `pending_crawl` after successful discovery (modify line ~267). > Changes: `backend/app/crawler.py`, Line [212] (Status update confirmed) * [x] Ensure overall status is updated to `discovery_complete` at the end of the root call (line ~460). > Changes: `backend/app/crawler.py`, Line [312] (Status update confirmed) * [x] Add logging to confirm content fetching is skipped. > Changes: `backend/app/crawler.py`, Line [188] (Added log message) * **Modify `crawl_pages` function:** * [x] Verify function iterates *only* over URLs passed in the `pages` argument. (Check loop starting line ~357). > Changes: `backend/app/crawler.py`, Lines [357] (Verification only - OK) * [x] Confirm `requests.post` call to `/crawl` is present and correct (lines ~388-396). > Changes: `backend/app/crawler.py`, Lines [388-396] (Verification only - OK) * [x] Confirm polling loop is present and correct (lines ~402-527). > Changes: `backend/app/crawler.py`, Lines [402-527] (Verification only - OK) * [x] Confirm logic for appending to `.md` file exists and uses `root_task_id` (lines ~450-471). > Changes: `backend/app/crawler.py`, Lines [450-471] (Verification only - OK) * [x] Confirm logic for updating `.json` metadata exists and uses `root_task_id` (lines ~473-510). > Changes: `backend/app/crawler.py`, Lines [473-510] (Verification only - OK) * [x] Confirm URL status updates (`crawling`, `completed`, `crawl_error`) are correct (lines ~364, ~572, ~519, ~532, ~582, ~588, ~596). > Changes: `backend/app/crawler.py`, Lines [...] (Verification only - OK) * [x] Confirm overall job status update (`completed` or `completed_with_errors`) is correct (lines ~615-616). > Changes: `backend/app/crawler.py`, Lines [615-616] (Verification only - OK) * [x] Add robust `try...except` blocks around file I/O operations (appending `.md`, writing `.json`) with specific logging. > Changes: `backend/app/crawler.py`, Lines [467-471], [477-484], [507-512] (Added try/except blocks) **2. Frontend UI (`components/CrawlStatusMonitor.tsx`)** * [x] Add state variable for selected URLs (e.g., `useState<Set<string>>(new Set())`). > Changes: `components/CrawlStatusMonitor.tsx`, Line [27] * [x] Add state variable for "Select All" checkbox status (e.g., `useState<boolean>(false)`). > Changes: `components/CrawlStatusMonitor.tsx`, Line [28] * [x] Modify the URL list rendering (`.map` function around line ~175): * [x] Add a checkbox input before each URL. Its `checked` state should depend on the selected URLs state. Its `onChange` should call `handleCheckboxChange`. > Changes: `components/CrawlStatusMonitor.tsx`, Lines [249-255] * [x] Conditionally render checkboxes only when `status.overall_status === 'discovery_complete'`. > Changes: `components/CrawlStatusMonitor.tsx`, Line [248] (Uses `isCrawlable` flag) * [x] Add a "Select All" checkbox above the URL list. Its `checked` state depends on the select all state. Its `onChange` should call `handleSelectAllChange`. Conditionally render only when `status.overall_status === 'discovery_complete'`. > Changes: `components/CrawlStatusMonitor.tsx`, Lines [218-230] * [x] Add a "Crawl Selected (`selectedCount`/`totalCount`)" button below the URL list. It should be disabled if `selectedCount === 0`. Its `onClick` should call `handleCrawlSelectedClick`. Conditionally render only when `status.overall_status === 'discovery_complete'`. > Changes: `components/CrawlStatusMonitor.tsx`, Lines [218-238] **3. Frontend Logic (`components/CrawlStatusMonitor.tsx`)** * [x] Implement `handleCheckboxChange(url: string)` function: Update the selected URLs state (add/remove URL). Update "Select All" state based on whether all URLs are now selected. > Changes: `components/CrawlStatusMonitor.tsx`, Lines [113-127] * [x] Implement `handleSelectAllChange(isChecked: boolean)` function: Update selected URLs state (add all or clear all). Update "Select All" state. > Changes: `components/CrawlStatusMonitor.tsx`, Lines [129-138] * [x] Implement `handleCrawlSelectedClick()` function: * [x] Get the array of selected URLs from the state. > Changes: `components/CrawlStatusMonitor.tsx`, Line [154] * [x] Get the current `jobId`. > Changes: `components/CrawlStatusMonitor.tsx`, Line [141] * [x] Call `crawlPages({ jobId, pages: selectedUrlsAsDiscoveredPageObjects })` from `lib/crawl-service.ts`. (Note: May need to format URLs into `DiscoveredPage` objects if required by the service function). > Changes: `components/CrawlStatusMonitor.tsx`, Lines [154-158], [163] * [x] Add basic logging or toast notification for crawl initiation request. > Changes: `components/CrawlStatusMonitor.tsx`, Lines [146-150], [166-170], [179-183] * [x] Consider disabling the button after clicking to prevent multiple submissions. > Changes: `components/CrawlStatusMonitor.tsx`, Lines [145], [185], [233], [236] **4. Consolidated File Display Component (`components/ConsolidatedFiles.tsx`)** * [x] Create the new file `components/ConsolidatedFiles.tsx`. > Changes: New file `components/ConsolidatedFiles.tsx` * [x] Define component props (e.g., `jobId: string | null`, `rootUrl: string | null`, `jobStatus: OverallStatus | null`). > Changes: `components/ConsolidatedFiles.tsx`, Lines [19-23] * [x] Add state for metadata (e.g., `useState<Metadata | null>(null)`). > Changes: `components/ConsolidatedFiles.tsx`, Line [25] * [x] Add `useEffect` hook that triggers when `jobStatus` is `completed` or `completed_with_errors`. > Changes: `components/ConsolidatedFiles.tsx`, Lines [32-70] * [x] Inside `useEffect`: * [x] Check if `jobId` and `rootUrl` are available. > Changes: `components/ConsolidatedFiles.tsx`, Line [34] * [x] Derive the expected filename using a helper function (needs `url_to_filename` logic ported or called via API). > Changes: `components/ConsolidatedFiles.tsx`, Line [44] (Imported from `lib/utils.ts`) * [x] Fetch the `.json` metadata content using `/api/storage/file-content?file_path=<filename>.json`. > Changes: `components/ConsolidatedFiles.tsx`, Lines [48-51] * [x] Parse the JSON and update the metadata state. > Changes: `components/ConsolidatedFiles.tsx`, Lines [58-63] * [x] Handle fetch errors. > Changes: `components/ConsolidatedFiles.tsx`, Lines [52-56], [64-67] * [x] Render the component structure based on the provided image. > Changes: `components/ConsolidatedFiles.tsx`, Lines [118-168] * [x] Display data from the metadata state (filename/project name, pages count, size, last updated). Size might need separate calculation or be added to metadata. > Changes: `components/ConsolidatedFiles.tsx`, Lines [110-113], [136-146] * [x] Implement view buttons linking to `/api/storage/file-content?file_path=<filename>.md` and `.json`. > Changes: `components/ConsolidatedFiles.tsx`, Lines [114-116], [148-154] **5. Integrate `ConsolidatedFiles` (`app/page.tsx`)** * [x] Import the `ConsolidatedFiles` component. > Changes: `app/page.tsx`, Line [15] * [x] Pass necessary props (`jobId`, `rootUrl`, `jobStatus`) from the main page state to `ConsolidatedFiles`. > Changes: `app/page.tsx`, Lines [393-395] * [x] Conditionally render `ConsolidatedFiles` when a job is finished (`completed` or `completed_with_errors`). > Changes: `app/page.tsx`, Line [392] (Implicit via props) **6. Stats Display Component** * [x] Identify or create a component for the stats display (e.g., `components/JobStatsSummary.tsx`). > Changes: New file `components/JobStatsSummary.tsx` * [x] Pass `jobStatus` and potentially the consolidated file metadata as props. > Changes: `components/JobStatsSummary.tsx`, Lines [15-18] * [x] Display "Subdomains Parsed" (count of URLs in `jobStatus.urls`). > Changes: `components/JobStatsSummary.tsx`, Line [25], [41-46] * [x] Display "Pages Crawled" (count of URLs with status `completed` in `jobStatus.urls` *after* crawl). > Changes: `components/JobStatsSummary.tsx`, Line [26], [49-55] * [x] Display "Data Extracted" (fetch size from metadata or calculate from `.md` file). > Changes: `components/JobStatsSummary.tsx`, Line [31], [58-64] (Placeholder 'N/A' used) * [x] Display "Errors Encountered" (count of URLs with `discovery_error` or `crawl_error` status). > Changes: `components/JobStatsSummary.tsx`, Line [27], [67-73] * [x] Integrate this component into `app/page.tsx`. > Changes: `app/page.tsx`, Line [6], [347] (Replaced `ProcessingBlock`) ---

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/cyberagiinc/DevDocs'

If you have feedback or need assistance with the MCP directory API, please join our Discord server