Skip to main content
Glama
apify_actors_guide.md18.7 kB
# Apify Actor Development Guide Instruct the user on how to develop an Actor step by step: - Start with first steps like creating a directory for the Actor, suggest to use it in Cursor or other AI tools that support AI coding assistants. - Ask the user to provide the name of the Actor - Based on the structure of the Actor, fill in the default and example values for all fields in the actor.json, input_schema.json, output_schema.json, and main.js files - Install dependencies for the Actor using `npm install` - Run the Actor locally using `apify run` - this is the exact command to use, do not change it. - Log in to the Apify platform using `apify login` - this is the exact command to use, do not change it. - Push the Actor to the Apify platform using `apify push` - this is the exact command to use, do not change it. - Create AGENTS.md - copy the content of this file from the "What are Apify Actors?" section and all content below it to the new file. - Important: Before you begin coding, fill in the generatedBy property in the meta section of `.actor/actor.json`. Replace it with the model you're currently using. This helps Apify monitor and improve AGENTS.md for specific LLM models. ## Structure of initial Apify Actor ### Important - the initial structure of the Actor is scraper example that uses Cheerio and Crawlee to parse HTML. It's fast, but it can't run the website's JavaScript or pass JS anti-scraping challenges. - do not use version in actor.json other than 0.0 (initial version). - do not change the structure of the initial Actor. - do not initialize the storage directory. ```text .actor/ ├── actor.json # Actor config: name, version, env vars, runtime settings ├── input_schema.json # Input validation & Console form definition ├── dataset_schema.json # Dataset schema definition └── output_schema.json # Specifies where an Actor stores its output src/ └── main.js # Actor entry point and orchestrator storage/ # Local storage (mirrors Cloud during development) ├── datasets/ # Output items (JSON objects) ├── key_value_stores/ # Files, config, INPUT └── request_queues/ # Pending crawl requests Dockerfile # Container image definition AGENTS.md # AI agent instructions (this file) ``` ### actor.json (default and example values) ```json { "actorSpecification": 1, "name": "<ACTOR-NAME-FROM-USER>", "title": "<ACTOR-NAME-FROM-USER>", "description": "<ACTOR-NAME-FROM-USER>", "version": "0.0", "meta": { "templateId": "ai-generated-actor", "generatedBy": "<MODEL>" }, "input": "./input_schema.json", "output": "./output_schema.json", "storages": { "dataset": "./dataset_schema.json" }, "dockerfile": "../Dockerfile" } ``` ### input_schema.json (default and example values) ```json { "title": "Input schema of the <ACTOR-NAME-FROM-USER>", "type": "object", "schemaVersion": 1, "properties": { "startUrls": { "title": "Start URLs", "type": "array", "description": "URLs to start with.", "editor": "requestListSources", "prefill": [ { "url": "https://crawlee.dev" } ] }, "maxRequestsPerCrawl": { "title": "Max Requests per Crawl", "type": "integer", "description": "Maximum number of requests that can be made by this crawler.", "default": 100 } } } ``` ### output_schema.json (default and example values) ```json { "actorOutputSchemaVersion": 1, "title": "Output schema of the <ACTOR-NAME-FROM-USER>", "properties": { "overview": { "type": "string", "title": "Overview", "template": "{{links.apiDefaultDatasetUrl}}/items?view=overview" } } } ``` ### dataset_schema.json (default and example values) ```json { "actorSpecification": 1, "fields": {}, "views": { "overview": { "title": "Overview", "transformation": { "fields": ["title", "url"] }, "display": { "component": "table", "properties": { "title": { "label": "Title", "format": "text" }, "url": { "label": "URL", "format": "link" } } } } } } ``` ### main.js (default and example values) - The code is a JavaScript script that uses Cheerio to scrape data from a website. It then stores the website titles in a dataset. ```javascript // Apify SDK - toolkit for building Apify Actors (Read more at https://docs.apify.com/sdk/js/) import { Actor } from 'apify'; // Crawlee - web scraping and browser automation library (Read more at https://crawlee.dev) import { CheerioCrawler, Dataset } from 'crawlee'; // The init() call configures the Actor for its environment. It's recommended to start every Actor with an init() await Actor.init(); // Structure of input is defined in input_schema.json const { startUrls = ['https://apify.com'], maxRequestsPerCrawl = 100 } = (await Actor.getInput()) ?? {}; // Proxy configuration to rotate IP addresses and prevent blocking (https://docs.apify.com/platform/proxy) const proxyConfiguration = await Actor.createProxyConfiguration(); const crawler = new CheerioCrawler({ proxyConfiguration, maxRequestsPerCrawl, async requestHandler({ enqueueLinks, request, $, log }) { log.info('enqueueing new URLs'); await enqueueLinks(); // Extract title from the page. const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); // Save url and title to Dataset - a table-like storage. await Dataset.pushData({ url: request.loadedUrl, title }); }, }); await crawler.run(startUrls); // Gracefully exit the Actor process. It's recommended to quit all Actors with an exit() await Actor.exit(); ``` ### Dockerfile (default values) ```dockerfile # Specify the base Docker image. You can read more about # the available images at https://docs.apify.com/sdk/js/docs/guides/docker-images # You can also use any other image from Docker Hub. FROM apify/actor-node:22 # Check preinstalled packages RUN npm ls crawlee apify puppeteer playwright # Copy just package.json and package-lock.json # to speed up the build using Docker layer cache. COPY --chown=myuser:myuser package*.json ./ # Install NPM packages, skip optional and development dependencies to # keep the image small. Avoid logging too much and print the dependency # tree for debugging RUN npm --quiet set progress=false && npm install --omit=dev --omit=optional && echo "Installed NPM packages:" && (npm list --omit=dev --all || true) && echo "Node.js version:" && node --version && echo "NPM version:" && npm --version && rm -r ~/.npm # Next, copy the remaining files and directories with the source code. # Since we do this after NPM install, quick build will be really fast # for most source file changes. COPY --chown=myuser:myuser . ./ # Run the image. CMD npm start --silent ``` ### package.json (default values) ```json { "name": "<ACTOR-NAME-FROM-USER>", "version": "0.0.1", "type": "module", "description": "<ACTOR-NAME-FROM-USER>", "engines": { "node": ">=20.0.0" }, "dependencies": { "apify": "^3.4.2", "crawlee": "^3.13.8" }, "devDependencies": { "@apify/eslint-config": "^1.0.0", "eslint": "^9.29.0", "eslint-config-prettier": "^10.1.5", "prettier": "^3.5.3" }, "scripts": { "start": "node src/main.js", "format": "prettier --write .", "format:check": "prettier --check .", "lint": "eslint", "lint:fix": "eslint --fix", "test": "echo 'Error: oops, the Actor has no tests yet, sad!' && exit 1" }, "author": "It's not you it's me", "license": "ISC" } ``` ## What are Apify Actors? - Actors are serverless cloud programs that can perform anything from a simple action, like filling out a web form, to a complex operation, like crawling an entire website or removing duplicates from a large dataset. - Actors are programs packaged as Docker images, which accept a well-defined JSON input, perform an action, and optionally produce a well-defined JSON output. ### Apify Actor directory structure ```text .actor/ ├── actor.json # Actor config: name, version, env vars, runtime settings ├── input_schema.json # Input validation & Console form definition ├── dataset_schema.json # Dataset schema definition └── output_schema.json # Specifies where an Actor stores its output src/ └── main.js # Actor entry point and orchestrator storage/ # Local storage (mirrors Cloud during development) ├── datasets/ # Output items (JSON objects) ├── key_value_stores/ # Files, config, INPUT └── request_queues/ # Pending crawl requests Dockerfile # Container image definition AGENTS.md # AI agent instructions (this file) ``` ## Apify CLI ### Installation - Install Apify CLI only if it is not already installed. - If Apify CLI is not installed, install it using the following commands: - macOS/Linux: `curl -fsSL https://apify.com/install-cli.sh | bash` - Windows: `irm https://apify.com/install-cli.ps1 | iex` ### Apify CLI Commands ```bash # Local development apify run # Run Actor locally # Authentication & deployment apify login # Authenticate account apify push # Deploy to Apify platform # Help apify help # List all commands ``` ## Do - use the default values for all fields in the actor.json, input_schema.json, output_schema.json, and main.js files - use Apify CLI to run the Actor locally, and push it to the Apify platform - accept well-defined JSON input and produce structured JSON output - use Apify SDK (`apify`) for code running ON Apify platform - validate input early with proper error handling and fail gracefully - use CheerioCrawler for static HTML content (10x faster than browsers) - use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content - use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls - implement retry strategies with exponential backoff for failed requests - use proper concurrency settings (HTTP: 10-50, Browser: 1-5) - set sensible defaults in `.actor/input_schema.json` for all optional fields - set up output schema in `.actor/output_schema.json` - clean and validate data before pushing to dataset - use semantic CSS selectors and fallback strategies for missing elements - respect robots.txt, ToS, and implement rate limiting with delays - check which tools (cheerio/playwright/crawlee) are installed before applying guidance ## Don't - do not run apify create command - do not rely on `Dataset.getInfo()` for final counts on Cloud platform - do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP) - do not hard code values that should be in input schema or environment variables - do not skip input validation or error handling - do not overload servers - use appropriate concurrency and delays - do not scrape prohibited content or ignore Terms of Service - do not store personal/sensitive data unless explicitly permitted - do not use deprecated options like `requestHandlerTimeoutMillis` on CheerioCrawler (v3.x) - do not use `additionalHttpHeaders` - use `preNavigationHooks` instead ## Actor Input Schema The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform. ### Structure ```json { "title": "<INPUT-SCHEMA-TITLE>", "type": "object", "schemaVersion": 1, "properties": { /* define input fields here */ }, "required": [] } ``` ## Actor Output Schema The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results. ### Structure ```json { "actorOutputSchemaVersion": 1, "title": "<OUTPUT-SCHEMA-TITLE>", "properties": { /* define your outputs here */ } } ``` ### Output Schema Template Variables - `links` (object) - Contains quick links to most commonly used URLs - `links.publicRunUrl` (string) - Public run url in format `https://console.apify.com/view/runs/:runId` - `links.consoleRunUrl` (string) - Console run url in format `https://console.apify.com/actors/runs/:runId` - `links.apiRunUrl` (string) - API run url in format `https://api.apify.com/v2/actor-runs/:runId` - `links.apiDefaultDatasetUrl` (string) - API url of default dataset in format `https://api.apify.com/v2/datasets/:defaultDatasetId` - `links.apiDefaultKeyValueStoreUrl` (string) - API url of default key-value store in format `https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId` - `links.containerRunUrl` (string) - URL of a webserver running inside the run in format `https://<containerId>.runs.apify.net/` - `run` (object) - Contains information about the run same as it is returned from the `GET Run` API endpoint - `run.defaultDatasetId` (string) - ID of the default dataset - `run.defaultKeyValueStoreId` (string) - ID of the default key-value store ## Dataset Schema Specification The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console. ### Structure ```json { "actorSpecification": 1, "fields": {}, "views": { "<VIEW_NAME>": { "title": "string (required)", "description": "string (optional)", "transformation": { "fields": ["string (required)"], "unwind": ["string (optional)"], "flatten": ["string (optional)"], "omit": ["string (optional)"], "limit": "integer (optional)", "desc": "boolean (optional)" }, "display": { "component": "table (required)", "properties": { "<FIELD_NAME>": { "label": "string (optional)", "format": "text|number|date|link|boolean|image|array|object (optional)" } } } } } } ``` **Dataset Schema Properties:** - `actorSpecification` (integer, required) - Specifies the version of dataset schema structure document (currently only version 1) - `fields` (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible) - `views` (DatasetView object, required) - Object with API and UI views description **DatasetView Properties:** - `title` (string, required) - Visible in UI Output tab and API - `description` (string, optional) - Only available in API response - `transformation` (ViewTransformation object, required) - Data transformation applied when loading from Dataset API - `display` (ViewDisplay object, required) - Output tab UI visualization definition **ViewTransformation Properties:** - `fields` (string[], required) - Fields to present in output (order matches column order) - `unwind` (string[], optional) - Deconstructs nested children into parent object - `flatten` (string[], optional) - Transforms nested object into flat structure - `omit` (string[], optional) - Removes specified fields from output - `limit` (integer, optional) - Maximum number of results (default: all) - `desc` (boolean, optional) - Sort order (true = newest first) **ViewDisplay Properties:** - `component` (string, required) - Only `table` is available - `properties` (Object, optional) - Keys matching `transformation.fields` with ViewDisplayProperty values **ViewDisplayProperty Properties:** - `label` (string, optional) - Table column header - `format` (string, optional) - One of: `text`, `number`, `date`, `link`, `boolean`, `image`, `array`, `object` ## Key-Value Store Schema Specification The key-value store schema organizes keys into logical groups called collections for easier data management. ### Structure ```json { "actorKeyValueStoreSchemaVersion": 1, "title": "string (required)", "description": "string (optional)", "collections": { "<COLLECTION_NAME>": { "title": "string (required)", "description": "string (optional)", "key": "string (conditional - use key OR keyPrefix)", "keyPrefix": "string (conditional - use key OR keyPrefix)", "contentTypes": ["string (optional)"], "jsonSchema": "object (optional)" } } } ``` **Key-Value Store Schema Properties:** - `actorKeyValueStoreSchemaVersion` (integer, required) - Version of key-value store schema structure document (currently only version 1) - `title` (string, required) - Title of the schema - `description` (string, optional) - Description of the schema - `collections` (Object, required) - Object where each key is a collection ID and value is a Collection object **Collection Properties:** - `title` (string, required) - Collection title shown in UI tabs - `description` (string, optional) - Description appearing in UI tooltips - `key` (string, conditional) - Single specific key for this collection - `keyPrefix` (string, conditional) - Prefix for keys included in this collection - `contentTypes` (string[], optional) - Allowed content types for validation - `jsonSchema` (object, optional) - JSON Schema Draft 07 format for `application/json` content type validation Either `key` or `keyPrefix` must be specified for each collection, but not both. ## Apify MCP Tools If MCP server is configured, use these tools for documentation: - `search-apify-docs` - Search documentation - `fetch-apify-docs` - Get full doc pages Otherwise, reference: `@https://mcp.apify.com/` ## Resources - [docs.apify.com/llms.txt](https://docs.apify.com/llms.txt) - Quick reference - [docs.apify.com/llms-full.txt](https://docs.apify.com/llms-full.txt) - Complete docs - [crawlee.dev](https://crawlee.dev) - Crawlee documentation - [whitepaper.actor](https://raw.githubusercontent.com/apify/actor-whitepaper/refs/heads/master/README.md) - Complete Actor specification

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/oabolade/figma_mcp_server_actor'

If you have feedback or need assistance with the MCP directory API, please join our Discord server