EOSC Data Commons Search

Official

MIT License

data-commons-search

README.md•8.24 kB

# 🔭 EOSC Data Commons Search server [![Build](https://github.com/EOSC-Data-Commons/data-commons-search/actions/workflows/build.yml/badge.svg)](https://github.com/EOSC-Data-Commons/data-commons-search/actions/workflows/build.yml) [![Docker image](https://img.shields.io/badge/docker-ghcr.io-blue.svg?logo=docker)](https://github.com/EOSC-Data-Commons/data-commons-search/pkgs/container/data-commons-search)  A server for the [EOSC Data Commons project](https://eosc.eu/horizon-europe-projects/eosc-data-commons/) MatchMaker service, providing natural language search over open-access datasets. It exposes an HTTP POST endpoint and supports the [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) to help users discover datasets and tools via a Large Language Model–assisted search. ## 🧩 Endpoints The HTTP API comprises 2 main endpoints: - `/mcp`: **MCP server** that searches for relevant data to answer a user question using the EOSC Data Commons OpenSearch service - Uses Streamable HTTP transport - Available tools: - [x] Search datasets - [x] Get metadata for the files in a dataset (name, description, type of files) - [ ] Search tools - [ ] Search citations related to datasets or tools - `/chat`: **HTTP POST** endpoint (JSON) for chatting with the MCP server tools via an LLM provider (API key provided through env variable at deployment) - Streams Server-Sent Events (SSE) response complying with the [AG-UI protocol](https://ag-ui.com). > [!TIP] > > It can also be used just as a MCP server through the pip package. ## 🔌 Connect client to MCP server The system can be used directly as a MCP server using either STDIO, or Streamable HTTP transport. > [!WARNING] > > You will need access to a pre-indexed OpenSearch instance for the MCP server to work. Follow the instructions of your client, and use the `/mcp` URL of your deployed server (e.g. http://localhost:8000/mcp) To add a new MCP server to **VSCode GitHub Copilot**: - Open the Command Palette (`ctrl+shift+p` or `cmd+shift+p`) - Search for `MCP: Add Server...` - Choose `HTTP`, and provide the MCP server URL http://localhost:8000/mcp Your VSCode `mcp.json` should look like: ```json { "servers": { "data-commons-search-http": { "url": "http://localhost:8000/mcp", "type": "http" } }, "inputs": [] } ``` Or with STDIO transport: ```json { "servers": { "data-commons-search": { "type": "stdio", "command": "uvx", "args": ["data-commons-search"], "env": { "OPENSEARCH_URL": "OPENSEARCH_URL" } } } } ``` Or using local folder for development: ```json { "servers": { "data-commons-search": { "type": "stdio", "cwd": "~/dev/data-commons-search", "env": { "OPENSEARCH_URL": "OPENSEARCH_URL" }, "command": "uv", "args": ["run", "data-commons-search"] } } } ``` ## 🛠️ Development > [!IMPORTANT] > > Requirements: > > - [x] [`uv`](https://docs.astral.sh/uv/getting-started/installation/), to easily handle scripts and virtual environments > - [x] docker, to deploy the OpenSearch service (or just access to a running instance) > - [x] API key for a LLM provider: [e-infra CZ](https://chat.ai.e-infra.cz/), [Mistral.ai](https://console.mistral.ai/api-keys), or [OpenRouter](https://openrouter.ai/) > ### 📥 Install dev dependencies ```sh uv sync --extra agent ``` Install pre-commit hooks: ```sh uv run pre-commit install ``` Create a `keys.env` file with your LLM provider API key(s): ```sh EINFRACZ_API_KEY=YOUR_API_KEY MISTRAL_API_KEY=YOUR_API_KEY OPENROUTER_API_KEY=YOUR_API_KEY ``` ### ⚡️ Start dev server Start the server in dev at http://localhost:8000, with MCP endpoint at http://localhost:8000/mcp ```sh uv run uvicorn src.data_commons_search.main:app --log-config logging.yml --reload ``` > Default `OPENSEARCH_URL=http://localhost:9200` Customize server configuration through environment variables: ```sh SERVER_PORT=8001 OPENSEARCH_URL=http://localhost:9200 uv run uvicorn src.data_commons_search.main:app --host 0.0.0.0 --port 8001 --log-config logging.yml --reload ``` > [!TIP] > > Example `curl` request: > > ```sh > curl -X POST http://localhost:8000/chat \ > -H "Content-Type: application/json" -H "Authorization: SECRET_KEY" \ > -d '{"messages": [{"role": "user", "content": "Educational datasets from Switzerland covering student assessments, language competencies, and learning outcomes, including experimental or longitudinal studies on pupils or students."}], "model": "einfracz/qwen3-coder"}' > ``` > > Recommended model per supported provider: > > - `einfracz/qwen3-coder` or `einfracz/gpt-oss-120b` (smaller, faster) >- `mistralai/mistral-medium-latest` (large is older, and not as good with tool calls) > - `groq/moonshotai/kimi-k2-instruct` > - `openai/gpt-4.1` > [!IMPORTANT] > > To build and integrate the frontend web app to the server, from the [frontend folder](https://github.com/EOSC-Data-Commons/eoscdcpoc) run: > > ```sh > npm run build && rm -rf ../data-commons-search/src/data_commons_search/webapp/ && cp -R dist/spa/ ../data-commons-search/src/data_commons_search/webapp/ > ``` > ### 📦 Build for production Build binary in `dist/` ```sh uv build ``` ### 🐳 Deploy with Docker Create a `keys.env` file with the API keys: ```sh EINFRACZ_API_KEY=YOUR_API_KEY MISTRAL_API_KEY=YOUR_API_KEY OPENROUTER_API_KEY=YOUR_API_KEY SEARCH_API_KEY=SECRET_KEY_YOU_CAN_USE_IN_FRONTEND_TO_AVOID_SPAM ``` > [!TIP] > > `SEARCH_API_KEY` can be used to add a layer of protection against bots that might spam the LLM, if not provided no API key will be needed to query the API. You can use the prebuilt docker image [`ghcr.io/eosc-data-commons/data-commons-search:main`](https://github.com/EOSC-Data-Commons/data-commons-search/pkgs/container/data-commons-search) Example `compose.yml`: ```yaml services: mcp: image: ghcr.io/eosc-data-commons/data-commons-search:main ports: - "127.0.0.1:8000:8000" environment: OPENSEARCH_URL: "http://opensearch:9200" EINFRACZ_API_KEY: "${EINFRACZ_API_KEY}" ``` Build and deploy the service: ```sh docker compose up ``` > [!IMPORTANT] > > Current deployment to staging server is done automatically through GitHub Actions at each push to the `main` branch. > > When a push is made the workflow will: > > - Pull the `main` branch from the frontend repository > - Build the frontend, and add it to `src/data_commons_search/webapp` > - Build the docker image for the server > - Publish the docker image as `main`/`latest` > - The staging infrastructure then automatically pull the `latest` version of the image and deploys it. ### ✅ Run tests > [!CAUTION] > > You need to first start the server on port 8001 (see start dev server section) ```bash uv run pytest ``` To display all logs when debugging: ```bash uv run pytest -s ``` ### 🧹 Format code and type check ```bash uvx ruff format uvx ruff check --fix uv run mypy ``` ### ♻️ Reset the environment Upgrade `uv`: ```sh uv self update ``` Clean `uv` cache: ```sh uv cache clean ``` ### 🏷️ Release process > [!IMPORTANT] > > Get a PyPI API token at [pypi.org/manage/account](https://pypi.org/manage/account). Run the release script providing the version bump: `fix`, `minor`, or `major` ```sh .github/release.sh fix ``` > [!TIP] > > Add your PyPI token to your environment, e.g. in `~/.zshrc` or `~/.bashrc`: > > ```sh > export UV_PUBLISH_TOKEN=YOUR_TOKEN > ``` ## 🤝 Acknowledments The LLM provider `einfracz` is a service provided by e-INFRA CZ and operated by CERIT-SC Masaryk University Computational resources were provided by the e-INFRA CZ project (ID:90254), supported by the Ministry of Education, Youth and Sports of the Czech Republic.

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/EOSC-Data-Commons/data-commons-search'

If you have feedback or need assistance with the MCP directory API, please join our Discord server