Skip to main content
Glama
MIE_prompt.md13.8 kB
# Create a compact yet comprehensive MIE file for __DBNAME__ RDF database. ## Philosophy: Essential over Exhaustive Create documentation that is **compact, clear, and complete** - sufficient for researchers to effectively query the database without unnecessary bloat. ## 1. Discovery Phase (CRITICAL: Follow Systematically) **⚠️ WARNING: Avoid Sampling Bias and Premature Conclusions** - The first 50 results from a SPARQL query may NOT represent the entire database - Always verify comprehensively using multiple query patterns before drawing conclusions - Check ontology graphs for class definitions BEFORE sampling data - Never assume timeouts mean "data doesn't exist" ### 1.1 Systematic Discovery Workflow **Step 1: Check for Existing Documentation** (2 minutes) - Use `get_sparql_endpoints()` to identify available endpoints - Use `get_graph_list(dbname)` to find ALL named graphs (data + ontology graphs) - Attempt `get_shex(dbname)` to retrieve existing shape expressions - Attempt `get_MIE_file(dbname)` to retrieve existing MIE files - **If an existing MIE file is found**: Perform compliance check (see section 1.2 below) - **If compliant**: Update/improve the file as needed - **If non-compliant**: Create a new MIE file from scratch - Attempt `get_sparql_example(dbname)` to retrieve an example SPARQL query **Step 2: Discover Schema/Ontology Definitions** (5 minutes) ```sparql # Query 1: Get all RDF classes from ontology graphs SELECT DISTINCT ?class FROM <ontology_graph_uri> WHERE { ?class a owl:Class . } LIMIT 100 # Query 2: Get all properties from ontology graphs SELECT DISTINCT ?property ?type FROM <ontology_graph_uri> WHERE { ?property a ?type . FILTER(?type IN (owl:ObjectProperty, owl:DatatypeProperty, rdf:Property)) } LIMIT 100 # Query 3: Sample property domains and ranges SELECT ?property ?domain ?range FROM <ontology_graph_uri> WHERE { ?property rdfs:domain ?domain ; rdfs:range ?range . } LIMIT 100 ``` **Why this matters**: Ontology graphs reveal what SHOULD exist, preventing you from missing entire entity types. **Step 3: Explore URI Patterns** (5 minutes) Test multiple URI namespace patterns to discover different entity types: ```sparql # Pattern 1: identifiers.org/[namespace] SELECT ?s ?p ?o WHERE { ?s ?p ?o . FILTER(STRSTARTS(STR(?s), "http://identifiers.org/")) } LIMIT 50 # Pattern 2: Database-specific namespace SELECT ?s ?p ?o WHERE { ?s ?p ?o . FILTER(STRSTARTS(STR(?s), "http://[database-specific-uri]/")) } LIMIT 50 # Pattern 3: Sample different prefixes found in ontology SELECT ?s ?type WHERE { ?s a ?type . } LIMIT 100 ``` **Why this matters**: Different URI patterns often indicate different data layers (e.g., identifiers vs full records vs features). **Step 4: Systematic Class Instance Sampling** (10 minutes) For EACH class discovered in Step 2, sample actual instances: ```sparql # For each class found: SELECT ?instance ?p ?o WHERE { ?instance a <ClassURI> . ?instance ?p ?o . } LIMIT 50 ``` **Why this matters**: Prevents assuming the database only contains what you see first. Some classes may have millions of instances, others only a few. **Step 5: Verify and Cross-Check** (5 minutes) If queries timeout or return no results: - ✅ Try with smaller LIMIT values - ✅ Try without FROM clauses - ✅ Try with different FILTER patterns - ✅ Sample from different graph URIs - ✅ Use `^^xsd:string` for string type restriction or `STR()` to make a plain string. - ❌ DON'T assume "no results = doesn't exist" ### 1.2 MIE File Compliance Check When an existing MIE file is retrieved, verify it complies with these instructions: **Structure & Format:** - [ ] Properly formatted YAML - [ ] Contains all required sections: schema_info, shape_expressions, sample_rdf_entries, sparql_query_examples, cross_references, architectural_notes, data_statistics, anti_patterns, common_errors - [ ] Schema_info includes version/license/access metadata **Sample RDF Entries:** - [ ] Exactly 5 examples covering diverse categories - [ ] Each has concise 1-2 sentence description **SPARQL Query Examples:** - [ ] Exactly 7 queries (2 basic, 3 intermediate, 2 advanced) - [ ] Includes: keyword filtering + biological annotations - [ ] No cross-reference queries (those in cross_references section) - [ ] All tested and working **Shape Expressions:** - [ ] Minimal comments (only non-obvious properties) - [ ] Covers ALL major entity types **Other Sections:** - [ ] Cross-references organized by RDF pattern with all databases - [ ] Architectural notes in YAML bullet format (not prose) - [ ] Data statistics with coverage, cardinality, performance - [ ] 2-3 anti-patterns with wrong/correct versions - [ ] 2-3 common errors with solutions **Decision:** - If ≥90% pass: Update existing file - If <90% pass: Create new from scratch ## 2. Schema Analysis (DO NOT SKIP) **Critical First Step: Get Complete Class Inventory** ```sparql # From ontology graph SELECT ?class (COUNT(?instance) as ?count) WHERE { ?instance a ?class . } GROUP BY ?class ORDER BY DESC(?count) ``` **If the above times out, try sampling:** ```sparql SELECT DISTINCT ?class WHERE { ?s a ?class . } LIMIT 100 ``` Then for each class: - Query for sample instances - Examine property patterns - Identify required vs optional properties - Check for hierarchical relationships ## 3. Deep Dive Investigation For EACH major entity type discovered: ### 3.1 Property Analysis ```sparql # Get all properties used by this entity type SELECT DISTINCT ?property (COUNT(?value) as ?usage) WHERE { ?entity a <EntityType> . ?entity ?property ?value . } GROUP BY ?property ORDER BY DESC(?usage) LIMIT 50 ``` ### 3.2 Relationship Mapping ```sparql # Find relationships between entity types SELECT ?type1 ?property ?type2 (COUNT(*) as ?count) WHERE { ?entity1 a ?type1 . ?entity1 ?property ?entity2 . ?entity2 a ?type2 . } GROUP BY ?type1 ?property ?type2 ORDER BY DESC(?count) LIMIT 50 ``` ### 3.3 Cross-Reference Discovery ```sparql # Pattern 1: rdfs:seeAlso links SELECT ?entity ?externalDB WHERE { ?entity rdfs:seeAlso ?externalDB . } LIMIT 100 # Pattern 2: owl:sameAs links (or database-specific properties) SELECT ?entity ?externalDB WHERE { ?entity owl:sameAs ?externalDB . } LIMIT 100 ``` ### 3.4 Data Quality Assessment ```sparql # Check property completeness (for coverage statistics) SELECT (COUNT(DISTINCT ?entity) as ?total) (COUNT(DISTINCT ?withProperty) as ?withProperty) WHERE { ?entity a <EntityType> . OPTIONAL { ?entity <PropertyToCheck> ?value . BIND(?entity as ?withProperty) } } ``` **While testing queries, note patterns that fail (timeouts, errors, empty results) to document as anti-patterns.** ## 4. MIE File Construction ### Required Sections (in order): 1. **schema_info** - Database metadata + version/license/access info 2. **shape_expressions** - ShEx schemas for all entity types (minimal comments) 3. **sample_rdf_entries** - 5 diverse examples (core entity, related entity, sequence/molecular, cross-ref, geographic/temporal) 4. **sparql_query_examples** - 7 tested queries (2 basic, 3 intermediate, 2 advanced) 5. **cross_references** - Pattern-based organization with all external databases 6. **architectural_notes** - schema_design, performance, data_integration, data_quality (YAML bullets, not prose) 7. **data_statistics** - Counts, coverage, cardinality, performance_characteristics, data_quality_notes 8. **anti_patterns** - 2-3 common mistakes with wrong/correct versions 9. **common_errors** - 2-3 error scenarios with solutions ### Key Constraints: - RDF examples: Exactly 5, each 1-2 sentence description - SPARQL queries: Exactly 7, must include keyword filtering + biological annotations - Anti-patterns: 2-3 examples showing wrong query → correct query - Common errors: 2-3 scenarios with causes and solutions - Keep everything concise - if it doesn't help query writing, omit it ## 5. Quality Assurance Checklist Before finalizing, verify: **Discovery:** - [ ] Queried ontology graphs for all entity types - [ ] Explored multiple URI patterns - [ ] Documented ALL major entity types **Structure:** - [ ] Valid YAML with all 9 required sections - [ ] Schema_info includes version/license/access - [ ] ShEx minimal comments, covers all types - [ ] Exactly 5 diverse RDF examples - [ ] Exactly 7 SPARQL queries (2/3/2 distribution) - [ ] Required queries: keyword filtering + biological annotations - [ ] Cross-references by pattern (not by individual DB) - [ ] Architectural notes in YAML bullets **Quality:** - [ ] All SPARQL queries tested and work - [ ] 2-3 anti-patterns with wrong/correct versions - [ ] 2-3 common errors with solutions - [ ] Statistics: counts, coverage, cardinality, performance - [ ] Everything concise - no unnecessary content ## Common Pitfalls to Avoid **❌ Sampling Bias**: First 50 results may not represent entire database → Check ontology graphs, explore multiple URI patterns **❌ Premature Conclusions**: Query timeout ≠ "data doesn't exist" → Try smaller LIMITs, different patterns, alternative graphs **❌ Incomplete Coverage**: Documenting only obvious entity types → Query ontology graphs first, create shapes for ALL types **❌ Missing Error Guidance**: Not testing what fails → Note failing patterns during testing to document as anti-patterns ## Available Tools - `get_sparql_endpoints()` - Get available SPARQL endpoints - `get_graph_list(dbname)` - List named graphs in database - `get_sparql_example(dbname)` - Get an example SPARQL query - `run_sparql(dbname, sparql_query)` - Execute SPARQL queries - `get_shex(dbname)` - Retrieve ShEx schema if available - `get_MIE_file(dbname)` - Retrieve existing MIE file if available - `save_MIE_file(dbname, mie_content)` - Save the final MIE file ## Using `bif:contains` for the Virtuoso backend. If the backend database is Virtuoso, **DO use `bif:contains` for string filtering whenever possible.** ```sparql SELECT ?label WHERE { ?s rdfs:label ?label . label bif:contains "('amyloid' AND NOT 'precursor') OR 'alzheimer'" option (score ?sc) } ORDER BY DESC (?sc) LIMIT 50 ``` You can sort the results by `?sc` (keyword relevance score).  **DON'T use `?score` for the variable name** That would result in an error. ## MIE File Structure Template ```yaml schema_info: title: [DATABASE_NAME] description: | [3-5 sentences: what it contains, main entity types (ALL), use cases, key features] endpoint: https://rdfportal.org/example/sparql base_uri: http://example.org/ graphs: - http://example.org/dataset - http://example.org/ontology # Metadata (integrated into schema_info) version: mie_version: "1.0" mie_created: "YYYY-MM-DD" data_version: "Release YYYY.MM" update_frequency: "Monthly" license: data_license: "License name" license_url: "https://..." access: rate_limiting: "100 queries/min" max_query_timeout: "60 seconds" shape_expressions: | # Minimal comments - only for non-obvious properties # Cover ALL major entity types PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> <EntityShape> { a [ schema:Type ] ; schema:property xsd:string ; schema:optional xsd:string ? # Comment only if needed } sample_rdf_entries: # Exactly 5: core entity, related entity, molecular, cross-ref, temporal/geo - title: [Descriptive title] description: [1-2 sentences] rdf: | # Real RDF from database sparql_query_examples: # Exactly 7: 2 basic, 3 intermediate, 2 advanced # Must include: keyword filtering + biological annotations - title: [What it does] description: [Context] question: [Natural language] complexity: basic sparql: | # Tested working query cross_references: - pattern: rdfs:seeAlso description: | [How external links work] databases: category: - Database: coverage sparql: | # Representative query architectural_notes: schema_design: - [Bullet: entity relationships] performance: - [Bullet: optimization tips] data_integration: - [Bullet: cross-references] data_quality: - [Bullet: data quirks] data_statistics: total_entity_type: count coverage: property_coverage: "~XX%" cardinality: avg_per_entity: X.X performance_characteristics: - "Tested observation" data_quality_notes: - "Data issue" anti_patterns: # 2-3 examples - title: "Common mistake" problem: "Why wrong" wrong_sparql: | # Bad query correct_sparql: | # Fixed query explanation: "What changed" common_errors: # 2-3 scenarios - error: "Error type" causes: - "Cause 1" solutions: - "Solution 1" example_fix: | # Before/after (optional, if helpful) ``` ## Success Criteria - Ontology graphs checked for complete class inventory - Multiple URI patterns explored - All SPARQL queries tested and working - Shape expressions cover ALL major entity types with minimal comments - Sample RDF: exactly 5, covering different types - SPARQL queries: exactly 7 (2 basic, 3 intermediate, 2 advanced) including required ones - Cross-references by RDF pattern, all databases listed - Architectural notes in YAML bullets - Statistics: counts, coverage, cardinality, performance - 2-3 anti-patterns with wrong/correct versions - 2-3 common errors with solutions - Metadata in schema_info (version, license, access) - File is valid YAML, compact yet complete ## Remember **The goal: Compact, Complete, Clear, Correct, Actionable** - Document ALL entity types, not just some - 2-3 anti-patterns prevent common mistakes - If it doesn't help query writing, omit it - NEVER assume first results represent entire database

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/arkinjo/togo-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server