Skip to main content
Glama
AGENTS.md23.1 kB
# EFSA OpenFoodTox MCP Server - Project Instructions ## Project Overview Python MCP (Model Context Protocol) server providing LLM access to EFSA food safety toxicology data (~138,000 entries in 9 tables, 8K Chemical substance details). Enables natural language queries about food additives, pesticides, and chemical safety assessments. ## Dataset Structure ### Core Tables (SQLite) - **Opinion** (2,437) - EFSA published opinions/documents - **Synonym** (45,582) - Alternative names, E-numbers, trade names - **Component** (8,006) - Chemical substance details (CAS, formulas, IUPAC) - **Study** (54,621) - Fact table linking substances to studies - **Genotox** (246) - Genotoxicity study details - **Endpoint_study** (11,698) - Toxicity endpoints (NOAEL, LD50, etc.) - **Chem_assess** (11,357) - Risk assessments (ADI, TDI, safety factors) - **Question** (5,296) - Opinion-related questions - **Dictionary** (204) - Metadata/column descriptions ### Key Relationships ``` SYNONYM (45,582 names/codes) ↓ SUB_COM_ID COMPONENT (8,006 substances) ↓ SUB_COM_ID STUDY (54,621 study records) ↓ branches to three study types: ├─ GENOTOX_ID → GENOTOX (246 studies) ├─ TOX_ID → ENDPOINT_STUDY (11,698 studies) └─ HAZARD_ID → CHEM_ASSESS (11,357 assessments) ↓ OP_ID OPINION (2,437 opinions) ↓ OP_ID QUESTION (5,296 questions) ``` ### Important Fields - **TRX_ID**: Transaction ID (published transmission identifier) - **OP_ID**: Opinion identifier - **SUB_COM_ID**: Substance-component link - **COM_TYPE**: Chemical complexity (single/mixture/botanical/synthetic) - **SUB_OP_CLASS**: Usage category (food additive/pesticide/flavoring) - **REGULATION_CODE**: EU regulation (1333/2008=additives, 1107/2009=pesticides) ## Database Schema Reference <details> <summary>Detailed column metadata (197 columns across 9 tables)</summary> *Reference only - use when needed for column names, types, and nullability* ``` Dictionary table <class 'pandas.core.frame.DataFrame'> RangeIndex: 204 entries, 0 to 203 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Table_name 197 non-null object 1 Name 197 non-null object 2 Type 197 non-null object 3 Description 197 non-null object 4 isNullable 197 non-null object 5 isRecordUniqueIdentifier 197 non-null object 6 Catalogue Code 33 non-null object 7 Last update 190 non-null object dtypes: object(8) memory usage: 12.9+ KB None ---------- Synonym table <class 'pandas.core.frame.DataFrame'> RangeIndex: 45582 entries, 0 to 45581 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SYNONYM_ID 45582 non-null int64 1 SUB_COM_ID 45582 non-null int64 2 TRX_ID 45582 non-null int64 3 TYPE 45582 non-null object 4 DESCRIPTION 45582 non-null object dtypes: int64(3), object(2) memory usage: 1.7+ MB None ---------- Opinion table <class 'pandas.core.frame.DataFrame'> RangeIndex: 2437 entries, 0 to 2436 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DOCUMENT_ID 2437 non-null int64 1 OP_ID 2437 non-null int64 2 TRX_ID 2437 non-null int64 3 DOCTYPE_ID 2437 non-null int64 4 DOCTYPE_CODE 2437 non-null object 5 DOCTYPE 2437 non-null object 6 AUTHOR 2436 non-null object 7 TITLE 2437 non-null object 8 ADOPTION_DATE 2437 non-null int64 9 ADOPTIONDATE 2437 non-null datetime64[ns] 10 PUBLICATION_DATE 2437 non-null int64 11 PUBLICATIONDATE 2437 non-null datetime64[ns] 12 PUBLICATIONYEAR 2437 non-null int64 13 DOI 2437 non-null object 14 URL 2437 non-null object 15 REGULATION_ID 2422 non-null float64 16 REGULATION_CODE 2422 non-null object 17 REGULATION 2422 non-null object 18 REGULATIONFULLTEXT 2422 non-null object 19 OWNER 2437 non-null object dtypes: datetime64[ns](2), float64(1), int64(7), object(10) memory usage: 380.9+ KB None ---------- Component table <class 'pandas.core.frame.DataFrame'> RangeIndex: 8006 entries, 0 to 8005 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SUBSTANCECOMPONENT_ID 8006 non-null int64 1 SUB_COM_ID 8006 non-null int64 2 SUB_ID 8006 non-null int64 3 COM_ID 8006 non-null int64 4 TRX_ID 8006 non-null int64 5 SUB_NAME 8006 non-null object 6 SUB_ECSUBINVENTENTRYREF 3111 non-null object 7 SUB_CASNUMBER 4263 non-null object 8 SUB_DESCRIPTION 4398 non-null object 9 SUBPARAM_ID 8006 non-null int64 10 SUBPARAM_CODE 8006 non-null object 11 SUBPARAMNAME 8006 non-null object 12 SUB_TYPE 8006 non-null object 13 QUALIFIER_ID 8006 non-null int64 14 QUALIFIER_CODE 8006 non-null object 15 QUALIFIER 8006 non-null object 16 COMP_VALUE 340 non-null float64 17 COM_NAME 8006 non-null object 18 COM_ECSUBINVENTENTRYREF 4705 non-null object 19 COM_CASNUMBER 6713 non-null object 20 IUPACNAME 6563 non-null object 21 COMPARAM_ID 8006 non-null int64 22 COMPARAM_CODE 8006 non-null object 23 COMPARAMNAME 8006 non-null object 24 MOLECULARFORMULA 6670 non-null object 25 SMILESNOTATION 6695 non-null object 26 INCHI 6743 non-null object 27 COM_TYPE 8006 non-null object 28 COM_STRUCTURESHOWN 0 non-null float64 29 SMILESNOTATIONSOURCE 0 non-null float64 30 INCHI_NOTATIONSOURCE 0 non-null float64 dtypes: float64(4), int64(8), object(19) memory usage: 1.9+ MB None ---------- Study table <class 'pandas.core.frame.DataFrame'> RangeIndex: 54621 entries, 0 to 54620 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FACTSTUDY_ID 54621 non-null int64 1 STUDY_ID 54621 non-null int64 2 SUB_COM_ID 54621 non-null int64 3 OP_ID 54621 non-null int64 4 GENOTOX_ID 1452 non-null float64 5 TOX_ID 30192 non-null float64 6 HAZARD_ID 42475 non-null float64 7 TRX_ID 54621 non-null int64 8 SUB_OP_CLASS 54621 non-null object 9 IS_MUTAGENIC 54621 non-null object 10 IS_GENOTOXIC 54621 non-null object 11 IS_CARCINOGENIC 54621 non-null object 12 REMARKS_STUDY 54590 non-null object 13 TOXREF_ID 84 non-null float64 dtypes: float64(4), int64(5), object(5) memory usage: 5.8+ MB None ---------- Chem_assess table <class 'pandas.core.frame.DataFrame'> RangeIndex: 11357 entries, 0 to 11356 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CHEMASSESS_ID 11357 non-null int64 1 HAZARD_ID 11357 non-null int64 2 TRX_ID 11357 non-null int64 3 ASSESSMENTTYPE_ID 11357 non-null int64 4 ASSESSMENTTYPE_CODE 11357 non-null object 5 ASSESSMENTTYPE 11357 non-null object 6 RISKQUALIFIER_ID 10492 non-null float64 7 RISKQUALIFIER_CODE 10492 non-null object 8 RISKQUALIFIER 10492 non-null object 9 RISKVALUE 10492 non-null float64 10 RISKUNIT_ID 10492 non-null float64 11 RISKUNIT_CODE 10492 non-null object 12 RISKUNIT 10486 non-null object 13 RISKUNITFULLTEXT 10492 non-null object 14 RISKVALUE_MILLI 10492 non-null float64 15 RISKUNIT_MILLI 10492 non-null object 16 SAFETY_FACTOR 4440 non-null float64 17 ID_POPULATION 11328 non-null object 18 POPULATIONTEXT 11328 non-null object 19 REMARKS 9075 non-null object 20 ASSESS 1305 non-null object 21 COM_GROUP_ID 340 non-null float64 22 GROUP_UNIT 340 non-null object 23 GROUP_REMARKS 16 non-null object dtypes: float64(6), int64(4), object(14) memory usage: 2.1+ MB None ---------- Question table <class 'pandas.core.frame.DataFrame'> RangeIndex: 5296 entries, 0 to 5295 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 QUESTION_ID 5296 non-null int64 1 OP_ID 5296 non-null int64 2 TRX_ID 5296 non-null int64 3 QUESTION 5296 non-null object dtypes: int64(3), object(1) memory usage: 165.6+ KB None ---------- Genotox table <class 'pandas.core.frame.DataFrame'> RangeIndex: 246 entries, 0 to 245 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 GENOTOXICITY_ID 246 non-null int64 1 GENOTOX_ID 246 non-null int64 2 TRX_ID 246 non-null int64 3 STUDY_CATEGORY 246 non-null object 4 GUIDELINE_QUALIFIER 246 non-null object 5 GENOTOXGUIDELINE_ID 31 non-null float64 6 GENOTOXGUIDELINE_CODE 31 non-null object 7 GENOTOXGUIDELINE 246 non-null object 8 GENOTOXGUIDELINEFULLTXT 246 non-null object 9 DEVIATION 20 non-null object 10 GLP_COMPL 32 non-null object 11 SPECIES_CODE_ID 246 non-null int64 12 SPECIES_CODE 246 non-null object 13 SPECIES 246 non-null object 14 STRAIN_ID 104 non-null float64 15 STRAIN_CODE 104 non-null object 16 STRAIN 104 non-null object 17 SEX 42 non-null object 18 MET_INDICATOR 75 non-null object 19 ROUTE_ID 61 non-null float64 20 ROUTE_CODE 61 non-null object 21 ROUTE 246 non-null object 22 EXP_PERIOD 35 non-null float64 23 EXPPERIODUNIT_ID 35 non-null float64 24 EXPPERIODUNIT_CODE 35 non-null object 25 EXPPERIODUNIT 35 non-null object 26 EXPPERIODUNITFULLTXT 35 non-null object 27 EXPPERIOD_DAY 35 non-null float64 28 NUMBER_INDIVIDUALS 13 non-null float64 29 CONTROL 40 non-null object 30 IS_GENOTOXIC 246 non-null object 31 REMARKS 231 non-null object dtypes: float64(7), int64(4), object(21) memory usage: 61.6+ KB None ---------- Endpoint_study table <class 'pandas.core.frame.DataFrame'> RangeIndex: 11698 entries, 0 to 11697 Data columns (total 60 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ENDPOINTSTUDY_ID 11698 non-null int64 1 TOX_ID 11698 non-null int64 2 TRX_ID 11698 non-null int64 3 STUDY_CATEGORY 11698 non-null object 4 TESTSUBSTANCE 1931 non-null object 5 TESTTYPE_ID 11698 non-null int64 6 TESTTYPE_CODE 11698 non-null object 7 TESTTYPE 11698 non-null object 8 LIMITTEST 1174 non-null object 9 GUIDELINE_QUALIFIER 11698 non-null object 10 GUIDELINE_ID 1362 non-null float64 11 GUIDELINE_CODE 1362 non-null object 12 GUIDELINE 11698 non-null object 13 GUIDELINEFULLTXT 11698 non-null object 14 DEVIATION 11698 non-null object 15 GLP_COMPL 11698 non-null object 16 SPECIES_ID 11646 non-null float64 17 SPECIES_CODE 11698 non-null object 18 SPECIES 11698 non-null object 19 STRAIN_ID 2 non-null float64 20 STRAIN_CODE 458 non-null object 21 STRAIN 458 non-null object 22 SEX 1042 non-null object 23 ROUTE_ID 4310 non-null float64 24 ROUTE_CODE 4310 non-null object 25 ROUTE 11698 non-null object 26 EXP_DURATION 6304 non-null float64 27 DURATIONUNIT_ID 6304 non-null float64 28 DURATIONUNIT_CODE 6304 non-null object 29 DURATIONUNIT 6304 non-null object 30 EXP_DURATION_DAYS 6304 non-null float64 31 NUMBER_INDIVIDUALS 647 non-null float64 32 CONTROL 731 non-null object 33 ENDPOINT_ID 11698 non-null int64 34 ENDPOINT_CODE 11698 non-null object 35 ENDPOINT 11698 non-null object 36 QUALIFIER_ID 11698 non-null int64 37 QUALIFIER_CODE 11698 non-null object 38 QUALIFIER 11698 non-null object 39 VALUE 11698 non-null float64 40 DOSEUNIT_ID 11698 non-null int64 41 DOSEUNIT_CODE 11698 non-null object 42 DOSEUNIT 11691 non-null object 43 DOSEUNITFULLTEXT 11698 non-null object 44 VALUE_MILLI 11698 non-null float64 45 UNIT_MILLI 11698 non-null object 46 BASIS_ID 11698 non-null int64 47 BASIS_CODE 11698 non-null object 48 BASIS 11698 non-null object 49 TOXICITY_ID 5253 non-null float64 50 TOXICITY_CODE 5253 non-null object 51 TOXICITY 5253 non-null object 52 TARGETTISSUE_ID 674 non-null float64 53 TARGETTISSUE_CODE 674 non-null object 54 TARGETTISSUE 657 non-null object 55 EFFECT_DESC 3567 non-null object 56 REMARKS 5363 non-null object 57 GROUP_UNIT 236 non-null object 58 COMGROUP_ID 236 non-null float64 59 GROUP_REMARKS 26 non-null object ``` *Add full schema export here when available* </details> ## MCP Tools Implemented ### 1. `search_substance` **Input:** `description_search: str` (substance name, E-number, or description) **Returns:** List of unique substances (by SUB_COM_ID) with aggregated study data (GENOTOX_ID, TOX_ID, HAZARD_ID, OP_ID arrays) **Method:** `query_search_substance()` - searches SYNONYM first (DESCRIPTION field), falls back to COMPONENT (SUB_NAME, COM_NAME) if no results **Joins:** SYNONYM → COMPONENT (by SUB_COM_ID) → STUDY (by SUB_COM_ID) **Note:** Returns aggregated arrays of study IDs, not full study details. Use other tools with these IDs for detailed data. ### 2. `get_substance_safety_assessment` **Input:** `sub_com_id: int` (use search_substance first to get SUB_COM_ID) **Returns:** Safety flags (IS_MUTAGENIC, IS_GENOTOXIC, IS_CARCINOGENIC) with opinion metadata, sorted chronologically **Method:** `query_safety_assessment()` - filters STUDY by SUB_COM_ID, joins OPINION for dates/metadata **Joins:** STUDY (filtered by SUB_COM_ID) → OPINION (LEFT JOIN on OP_ID) **Note:** Returns DataFrame as JSON, sorted by PUBLICATIONDATE ascending ### 3. `get_toxicity_endpoints` **Input:** `tox_id: Union[int, list[int]]` (use search_substance to get TOX_IDs from STUDY table) **Returns:** Toxicity endpoint study data (NOAEL, LD50, target organs, study conditions) **Method:** `query_by_id()` - direct query to ENDPOINT_STUDY table **Joins:** None (direct table query by TOX_ID) **Note:** Returns all columns from ENDPOINT_STUDY table as JSON ### 4. `get_risk_assessments` **Input:** `hazard_id: Union[int, list[int]]` (use search_substance to get HAZARD_IDs from STUDY table) **Returns:** Risk assessment data (ADI/TDI values, RISKVALUE_MILLI, SAFETY_FACTOR, POPULATIONTEXT) **Method:** `query_by_id()` - direct query to CHEM_ASSESS table **Joins:** None (direct table query by HAZARD_ID) **Note:** Returns all columns from CHEM_ASSESS table as JSON. Use this for safe intake limits (ADI/TDI). ### 5. `get_genotox_details` **Input:** `genotox_id: Union[int, list[int]]` (use search_substance to get GENOTOX_IDs from STUDY table) **Returns:** Genotoxicity study details (test guidelines, species, exposure conditions, IS_GENOTOXIC result) **Method:** `query_by_id()` - direct query to GENOTOX table **Joins:** None (direct table query by GENOTOX_ID) ### 6. `get_opinions` **Input:** `op_id: Union[int, list[int]]` (use search_substance to get OP_IDs from STUDY table) **Returns:** EFSA opinion documents (title, author, publication date, DOI, URL, regulation info) **Method:** `query_by_id()` - direct query to OPINION table **Joins:** None (direct table query by OP_ID) ### 7. `list_substances_by_class_and_safety` **Input:** Optional filters: `sub_class: str`, `is_mutagenic: str`, `is_genotoxic: str`, `is_carcinogenic: str`, `remarks_contains: str`, `limit: int` **Returns:** Filtered substance list with synonyms aggregated (SUB_COM_ID, COM_NAME, COM_TYPE, SUB_TYPE, DESCRIPTION) **Method:** `query_substances_by_class_and_safety()` - filters STUDY by safety criteria, joins to COMPONENT and SYNONYM **Joins:** STUDY (filtered by safety criteria) → COMPONENT (INNER JOIN on SUB_COM_ID) → SYNONYM (LEFT JOIN on SUB_COM_ID) **Note:** Returns unique substances (GROUP BY SUB_COM_ID) with synonyms aggregated as comma-separated list ### 8. `list_hazard_ids_by_assessment` **Input:** Optional filters: `population_text_contains: str`, `assessment_type: str`, `risk_value_milli_max: float`, `risk_value_milli_min: float`, `has_no_risk_value: bool`, `limit: int` **Returns:** List of HAZARD_IDs (as JSON array) matching assessment criteria **Method:** `query_hazard_ids_by_assessment()` - filters CHEM_ASSESS table directly **Joins:** None (direct query to CHEM_ASSESS table) **Note:** Building block tool - use returned HAZARD_IDs with `list_substances_by_study` (via `list_substances_by_assessment`) ### 9. `list_substances_by_assessment` **Input:** Optional filters: `population_text_contains: str`, `assessment_type: str`, `risk_value_milli_max: float`, `risk_value_milli_min: float`, `has_no_risk_value: bool`, `limit: int` **Returns:** Filtered substance list with synonyms aggregated (SUB_COM_ID, COM_NAME, COM_TYPE, SUB_TYPE, DESCRIPTION) **Method:** Two-step: `query_hazard_ids_by_assessment()` → `query_substances_by_study(study_type="hazard")` **Joins:** CHEM_ASSESS (filtered by assessment criteria) → STUDY (by HAZARD_ID) → COMPONENT (by SUB_COM_ID) → SYNONYM (by SUB_COM_ID) **Note:** Combines assessment filtering with substance retrieval in one tool ## Tools Not Yet Implemented ### `compare_substances` **Input:** `category: str, metric: str` **Returns:** Ranked comparison (e.g., sweeteners by ADI) **Joins:** COMPONENT → STUDY → CHEM_ASSESS (aggregated) **Status:** Not implemented - would require aggregation and ranking logic ## Search Strategy ### Multi-source Search Always search both SYNONYM and COMPONENT tables: - SYNONYM: E-numbers, common names, trade names (incomplete coverage) - COMPONENT: SUB_NAME, COM_NAME (authoritative chemical names) ### Handle Multiple Results - Same synonym → multiple SUB_COM_IDs (different sources/formulations) - Differentiate using: COM_TYPE, SUB_OP_CLASS, REGULATION_CODE, SUB_DESCRIPTION ### Name Variations Handle: "E951" vs "E 951" vs "aspartame" vs "L-aspartyl-L-phenylalanine methyl ester" ## Data Quirks 1. **Incomplete E-number coverage**: Not all E-numbers in SYNONYM (e.g., E962 missing) 2. **Duplicate synonyms**: Multiple SUB_COM_IDs share same synonym (e.g., "Provitamin A") 3. **Sparse nulls**: Many optional fields (GLP_COMPL, STRAIN, CONTROL, etc.) 4. **Regulation complexity**: 60+ regulations, key ones: 1333/2008 (additives), 1107/2009 (pesticides) ## Implementation Notes ### Query Architecture - **Hard-code SQL joins** in tools (deterministic, reliable) - LLM selects tool and extracts parameters - Avoid LLM-generated SQL queries ### Response Format Return structured JSON with: - Primary result (chemical details/safety data) - Context (opinion titles, study counts, regulation info) - Citations (OP_ID, TRX_ID, DOCUMENT_ID) ### Error Handling - Substance not found → suggest similar names - Multiple matches → return ranked list with differentiators - Missing data → explicitly state "no data available for X" ## Target Users 1. General public (E-number safety checks) 2. Food industry (ADI limits, specifications) 3. Regulators (genotoxicity flags, study gaps) 4. Researchers (comparative analysis, study protocols) 5. Health professionals (dietary advice, patient safety) ## Common Query Patterns - "What is [substance]?" → `identify_substance` - "Is [substance] safe?" → `get_safety_assessment` - "What are side effects of [substance]?" → `get_toxicity_effects` - "How much [substance] is safe daily?" → `get_safe_intake_limits` - "Compare all sweeteners" → `compare_substances` - "List natural food additives" → `list_substances_by_criteria` ## Development Priorities 1. Core search (SYNONYM + COMPONENT fuzzy matching) 2. Tool 1-4 (80% of user queries) 3. Result ranking/disambiguation 4. Tool 5-6 (advanced queries) 5. Citation formatting ## Completed Tasks - [x] Database setup and schema understanding - [x] Basic query function in main.py (SYNONYM → COMPONENT → STUDY joins) - [x] Project documentation and tool specifications - [x] MCP server structure and initialization - [x] Core search functionality (SYNONYM + COMPONENT fuzzy matching) - [x] Tool 1: `search_substance` - [SUB_COM_IDs] + basic info + study data aggregated into arrays - [x] Tool 2: `get_safety_assessment` - Safety flags (mutagenic/genotoxic/carcinogenic) - [x] Tool 3: `get_toxicity_effects` - Toxicity endpoints, target organs - [ ] Tool 4: `get_safe_intake_limits` - ADI/TDI values, safety factors - [ ] Tool 5: `compare_substances` - Ranked comparisons by category/metric - [x] Tool 6: `list_substances_by_criteria` - Filtered substance lists joining STUDY, COMPONENT and SYNONYM tables - [x] Tool 6: `list_substances_by_study` - Query by study IDs (GENOTOX_ID, TOX_ID, HAZARD_ID, or OP_ID) joining STUDY, COMPONENT and SYNONYM tables. meant for a building block that can combine a study filter eg CHEM_ASSESS filter -> list_substances_by_study -> component - [ ] Result ranking and disambiguation logic - [ ] Error handling and similar name suggestions - [ ] Citation formatting

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/spyrosze/mcp-openfoodtox'

If you have feedback or need assistance with the MCP directory API, please join our Discord server