import json
from datetime import date
generate_sql_prompt = f"""
You are an expert SQL query generator for **MySQL and ClickHouse (analytical) databases**.
Your goal is to generate **portable, read-only SQL** that executes correctly on **both engines**.
Assume ClickHouse queries may run against **Views built on top of Materialized Views**.
========================
CORE RULES (MANDATORY)
========================
1. ONLY generate read-only queries (SELECT / WITH / SHOW / DESCRIBE / EXPLAIN)
2. NEVER generate INSERT, UPDATE, DELETE, CREATE, DROP, ALTER, TRUNCATE
3. Use ANSI-compatible SQL supported by MySQL 8.0+ and ClickHouse
4. Return ONLY raw SQL — no explanations, no markdown, no comments
5. Embed literal values directly — NO parameter placeholders (?, :param, etc.)
6. Use EXACT column names from the provided schema — NEVER guess
7. Use explicit JOIN syntax — never implicit joins
8. Use backticks for identifiers when needed
9. Always include WHERE clauses when filtering
10. Use LIMIT when appropriate (see LIMIT rules below)
========================
AGGREGATION & ANALYTICS
========================
11. Prefer COUNT(*) for row counts
12. Use SUM(), AVG(), MIN(), MAX() ONLY on explicitly numeric columns
13. NEVER use SUM(*)
14. Aggregations are expected to run efficiently on large datasets (10M+ rows)
========================
MOST / LEAST / TOP / BOTTOM RULES (CRITICAL)
========================
If the question asks for:
- most, least
- maximum, minimum
- highest, lowest
- top, bottom
You MUST:
- Handle ties correctly
- Return ALL entities that match the max/min value
- Use a subquery or CTE to compute the max/min first
- NEVER use LIMIT 1 for these cases
Correct pattern:
- Compute the aggregate value first
- Filter results using that value
- Return all matching rows
========================
WINDOW FUNCTIONS (PORTABLE SUBSET ONLY)
========================
You MAY use ONLY the following window functions supported by both MySQL 8+ and ClickHouse:
- ROW_NUMBER()
- RANK()
- DENSE_RANK()
- SUM() OVER (...)
- COUNT() OVER (...)
- AVG() OVER (...)
- MIN() OVER (...)
- MAX() OVER (...)
Use window functions for:
- Top N per group
- Bottom N per group
- Rankings within groups
- Running totals
- Cumulative aggregates
- Comparing rows within a group
Rules:
- Use PARTITION BY for grouping
- Use ORDER BY inside OVER()
- Use ROW_NUMBER() for sequential ranking
- Use RANK() or DENSE_RANK() when ties must be preserved
- Avoid engine-specific window extensions or framing clauses
========================
CTE (WITH) RULES (STRICT)
========================
15. Any multi-step query MUST use CTEs
16. CTEs MUST start with WITH — NEVER start with SELECT
17. Each CTE must be named
18. Syntax must be exactly:
WITH cte_name AS (
SELECT ...
),
another_cte AS (
SELECT ...
)
SELECT ...
FROM cte_name
JOIN another_cte ...
19. Parentheses must open immediately after AS
20. The main SELECT must come AFTER all CTEs
21. Multiple CTEs must be comma-separated
========================
DATE & TIME RULES (HYBRID SAFE)
========================
22. Prefer date RANGE filters over date functions:
WHERE created_at >= '2024-01-01'
AND created_at < '2024-02-01'
23. Avoid wrapping timestamp/date columns in functions inside WHERE clauses
24. Avoid engine-specific date/time functions, including but not limited to:
DATE(), DATE_FORMAT(), STR_TO_DATE(),
toDate(), toStartOfDay(), toStartOfMonth(),
NOW(), CURDATE(), addDays(), DATE_ADD()
25. When date extraction is unavoidable, use ONLY:
CAST(timestamp_column AS DATE)
26. When grouping by day/month/year:
- Prefer existing date columns if available
- Otherwise use CAST(timestamp_column AS DATE)
========================
LIMIT & ORDER RULES (HYBRID SAFETY)
========================
27. When using LIMIT for top/bottom queries, ALWAYS include ORDER BY
28. LIMIT without ORDER BY is allowed ONLY for exploratory sampling
29. NEVER use LIMIT in subqueries with IN / ALL / ANY
========================
EMPLOYEE-SPECIFIC RULES
========================
30. For employee-related questions:
- ALWAYS include the employee name in the SELECT
- Always include the employee identifier
31. Return DISTINCT employees only
32. Eliminate duplicate employee rows
========================
ENGINE TRANSPARENCY
========================
33. Do NOT reference Materialized Views explicitly
34. Do NOT reference storage engines
35. Do NOT use engine-specific hints
36. Assume Views may transparently map to pre-aggregated data
========================
CONTEXT
========================
- SQL Version: MySQL 8.0+ / ClickHouse compatible
- Today: {date.today().isoformat()}
- Database may contain very large tables
- Queries must be efficient, deterministic, and portable
========================
SCHEMA
========================
"""
def get_table_identification_prompt(user_query: str, all_tables: list[str]) -> str:
"""Generate prompt for identifying relevant tables for a query."""
return f"""Given this database question: "{user_query}"
Available tables: {', '.join(all_tables)}
Which tables are likely needed to answer this question?
Return ONLY the table names, comma-separated. If unsure, include all potentially relevant tables.
Relevant tables:"""
def get_validation_prompt(query: str) -> str:
"""Generate prompt for validating if a query is a valid database question."""
return f"""Is this a valid database query question?
Question: "{query}"
A valid database question should:
- Ask about data in the database (tables, records, relationships)
- Request information that can be retrieved via SQL
- Be clear and meaningful
Examples of VALID questions:
- "Show me all authors"
- "How many books are there?"
- "Find books by J.K. Rowling"
Examples of INVALID questions:
- "sns" (gibberish)
- "hello" (greeting, not a query)
- "test" (too vague)
- Random characters or single words
Respond with ONLY:
VALID: <yes or no>
REASON: <brief explanation>"""
def get_combined_confidence_prompt(
query: str, sql: str, query_error: str | None = None, query_results: list | None = None
) -> str:
"""Generate prompt for combined confidence scoring and analysis."""
if query_error:
return f"""Analyze this SQL query and provide both a confidence score (0-1) and brief analysis.
Question: {query}
SQL: {sql}
Query Error: {query_error}
Tasks:
1. Rate confidence (0-1) - consider error type and severity
2. Provide brief analysis (2-3 sentences) explaining the issue
Format your response as:
CONFIDENCE: <number between 0.0 and 1.0>
ANALYSIS: <2-3 sentence analysis>"""
elif query_results is not None:
results_str = json.dumps(query_results[:5], indent=2, default=str)
is_empty = len(query_results) == 0
empty_note = (
" NOTE: Query returned NO results (empty array). This may indicate a logical error in the query."
if is_empty
else ""
)
return f"""Analyze this SQL query and provide both a confidence score (0-1) and brief analysis.
Question: {query}
SQL: {sql}
Query Results (sample, up to 5 rows):
{results_str}{empty_note}
CRITICAL: When scoring confidence, you MUST consider:
1. Does the SQL query actually answer the question asked?
- If the question asks for specific information, does the query retrieve it?
- If the question has multiple parts (e.g., "find X, then find Y"), does the query handle ALL parts?
- Are the results relevant to what was asked?
2. If the question asks for "most/least/maximum/minimum", does the query handle ties correctly?
- Questions like "who has the most books" should return ALL entities with the maximum count, not just one
- Using LIMIT 1 when there are ties is INCORRECT - confidence should be low
- Should use HAVING COUNT = (SELECT MAX(...)) pattern to handle ties
3. If empty results, is this expected (no matching data) or a logical error?
4. Are GROUP BY, HAVING, JOINs, and subqueries used correctly?
5. For multi-part questions (with "and then"), does the query handle ALL parts?
IMPORTANT: If the query does NOT answer the question, confidence MUST be low (< 0.6).
If the query asks for "most/least" and uses LIMIT 1 (which doesn't handle ties), confidence MUST be low (< 0.6).
If the query answers the question correctly, confidence should be high (≥ 0.8).
Tasks:
1. Rate confidence (0-1) - MUST consider if results answer the question
2. Provide brief analysis (2-3 sentences) on query quality and whether it answers the question
Format your response as:
CONFIDENCE: <number between 0.0 and 1.0>
ANALYSIS: <2-3 sentence analysis>"""
else:
return f"""Analyze this SQL query and provide both a confidence score (0-1) and brief analysis.
Question: {query}
SQL: {sql}
CRITICAL: When scoring confidence, consider:
1. Does the SQL query structure suggest it will answer the question?
2. Are the correct tables and columns selected?
3. Are JOINs, WHERE clauses, and filters appropriate for the question?
IMPORTANT: If the query structure does NOT match the question, confidence MUST be low (< 0.6).
Tasks:
1. Rate confidence (0-1) - consider if query structure matches the question
2. Provide brief analysis (2-3 sentences) on potential issues
Format your response as:
CONFIDENCE: <number between 0.0 and 1.0>
ANALYSIS: <2-3 sentence analysis>"""
def get_refinement_prompt(
query: str,
sql: str,
analysis: str,
query_error: str | None = None,
query_results: list | None = None,
is_simple_column_error: bool = False,
wrong_column: str | None = None,
has_multi_part: bool = False,
) -> str:
"""Generate prompt for refining SQL queries based on errors or analysis."""
if is_simple_column_error:
refinement_prompt = f"""The SQL query below has a column name error. Fix ONLY the column name, keep everything else the same.
Original Question: {query}
Original SQL: {sql}
Query Error: {query_error}
CRITICAL INSTRUCTIONS:
1. Keep the EXACT same query structure - only change the column name
2. The column '{wrong_column}' does not exist - find the correct column name from the schema below
3. Replace '{wrong_column}' with the correct column name
4. Do NOT change the query logic, JOINs, subqueries, WHERE clauses, or any other part
5. The query structure is correct - only the column name needs to be fixed
"""
refinement_prompt += """Find the correct column name in the schema below and replace the wrong column name in the SQL.
Keep everything else exactly the same - only change the column name.
Return ONLY the corrected SQL query (no explanations, no markdown formatting):"""
return refinement_prompt
else:
refinement_prompt = f"""The SQL query below has issues identified in the analysis. Generate a corrected version.
Original Question: {query}
Original SQL: {sql}
Analysis of Issues: {analysis}
"""
if query_error:
refinement_prompt += f"Query Error: {query_error}\n\n"
elif query_results is not None:
results_str = json.dumps(query_results[:5], indent=2, default=str)
refinement_prompt += f"Query Results (sample):\n{results_str}\n\n"
if has_multi_part:
refinement_prompt += """Based on the analysis above, generate a corrected SQL query that:
1. Fixes the issues identified in the analysis
2. Uses EXACT column names from the schema
3. PRESERVES the original query structure - only fix the specific issues
4. Correctly answers the ORIGINAL QUESTION IN FULL - this question has multiple parts (e.g., "and then")
5. For multi-part questions like "find X and then find Y":
- Use a subquery or CTE to first find X
- Then use that result to find Y in the main query
- Pattern: "find entities with most related items, then find their items after a date":
* First: Identify ALL entities with max count (subquery) - use MAX() to find max count, then HAVING COUNT = (SELECT MAX(...)) to get ALL entities with that count (not LIMIT 1)
* Then: Find related items by those entities matching the second condition
* Use WHERE foreign_key IN (subquery) or JOIN with subquery
6. Has proper JOIN conditions if needed
7. Includes all necessary WHERE clauses and filters
8. Ensure GROUP BY is correct - don't group by columns that should be in SELECT for aggregation
9. For queries involving ranking/comparison/aggregation within groups, USE WINDOW FUNCTIONS when appropriate (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER, COUNT() OVER, etc.) - DO NOT use LIMIT in subqueries with IN/ALL/ANY (MySQL doesn't support this)
10. Handle ties correctly - when finding "entities with most X", find ALL entities who have the maximum count, not just one (use HAVING COUNT = (SELECT MAX(...)), not LIMIT 1)
CRITICAL: The question has multiple parts. Your SQL MUST:
- First solve the first part (e.g., "entities with most related items") - find ALL entities with max count, not just one
- Then use that result to solve the second part (e.g., "related items by those entities after a date")
- Use subqueries, CTEs, or IN clauses to connect the parts
- Handle ties: if multiple entities have the same max count, include ALL of them
- PRESERVE the query structure - only fix the specific issues mentioned
Return ONLY the corrected SQL query (no explanations, no markdown formatting):"""
else:
refinement_prompt += """Based on the analysis above, generate a corrected SQL query that:
1. Fixes the issues identified in the analysis
2. Uses EXACT column names from the schema
3. PRESERVES the original query structure - only fix the specific issues
4. Correctly answers the ORIGINAL QUESTION IN FULL
5. For "top N per group" queries (e.g., "top 2 per department"), USE WINDOW FUNCTIONS (ROW_NUMBER() OVER ... WHERE rn <= N) - DO NOT use LIMIT in subqueries with IN/ALL/ANY (MySQL doesn't support this)
6. If the question asks for entities with the "most", "least", "maximum", "minimum", "highest", "lowest" (single value per group):
- Handle ties correctly by returning ALL entities that match the maximum/minimum value
- Use a subquery to find the MAX or MIN value first
- Then use HAVING COUNT = (SELECT MAX(...)) or HAVING COUNT = (SELECT MIN(...)) to find ALL entities with that value
- DO NOT use LIMIT 1 - this would only return one entity even if there are ties
7. Has proper JOIN conditions if needed
8. Includes all necessary WHERE clauses and filters
9. Ensure GROUP BY is correct - don't group by columns that should be in SELECT for aggregation
Return ONLY the corrected SQL query (no explanations, no markdown formatting):"""
return refinement_prompt
def get_analysis_prompt(
query: str,
sql: str,
confidence: float,
query_error: str | None = None,
query_results: list | None = None,
schema_info: dict | None = None,
) -> str:
"""Generate prompt for analyzing SQL query results or errors."""
if query_error:
return f"""Analyze this SQL query and explain why it failed or succeeded.
Question: {query}
SQL: {sql}
Query Error: {query_error}
Confidence Score: {confidence:.2f}
Analyze the error and identify:
1. What specific issue caused the error?
2. Is it a column name problem, table name problem, or syntax issue?
3. What needs to be fixed in the query?
4. How does this error relate to the confidence score?
Provide a brief analysis (2-3 sentences) explaining the issue and how to fix it:"""
elif query_results is not None:
results_str = json.dumps(query_results[:5], indent=2, default=str)
return f"""Analyze this SQL query and explain why it might be correct or incorrect.
Question: {query}
SQL: {sql}
Query Results (sample, up to 5 rows):
{results_str}
Confidence Score: {confidence:.2f}
Analyze if the results make sense:
1. Do the results answer the question correctly?
2. Are the column names and data types correct?
3. Is the query logic correct (JOINs, filters, aggregations)?
4. Are there any issues with the data returned?
5. How does the confidence score relate to the actual results?
Provide a brief analysis (2-3 sentences) explaining the query quality and any potential issues:"""
else:
relevant_tables = schema_info.get('relevant_tables', []) if schema_info else []
return f"""Analyze this SQL query and explain why it might be correct or incorrect.
Question: {query}
SQL: {sql}
Confidence Score: {confidence:.2f}
Available schema information:
- Relevant tables: {', '.join(relevant_tables)}
- Column names available in schema
Identify specific issues:
1. Are there any column names that might be incorrect?
2. Are JOIN conditions correct?
3. Does the query structure match the question?
4. Are there any missing WHERE clauses or filters?
5. Are there any syntax issues?
6. How does the confidence score relate to the query structure?
Provide a brief analysis (2-3 sentences) explaining the potential issues:"""
def get_sql_generation_prompt(user_query: str, has_multi_part: bool = False) -> str:
"""Generate prompt for converting natural language to SQL."""
if has_multi_part:
return f"""Convert to SQL query. IMPORTANT: This question has multiple parts separated by "and then" or similar - you must answer ALL parts.
For multi-part questions like "find X and then find Y":
- Use a subquery or CTE to first find X
- CTE syntax: WITH cte_name AS (SELECT ...) SELECT ... FROM cte_name
- Multiple CTEs: WITH cte1 AS (...), cte2 AS (...) SELECT ...
- CRITICAL: Each CTE must use proper syntax: `cte_name AS (SELECT ...)` - no closing parenthesis before AS
- Then use that result to find Y
- Example pattern: "find entities with the most related items, then find their items after a date" should:
1. First identify ALL entities with max count (subquery/CTE) - use MAX() to find the max count, then find ALL entities with that count (not LIMIT 1)
2. Then find related items by those entities matching the second condition
3. Join or filter using the first result
IMPORTANT: When finding "entities with the most X" or similar maximum count queries:
- Find the MAX count first using a subquery
- Then find ALL entities who have that count (use HAVING COUNT = (SELECT MAX(...)), not LIMIT 1)
- This handles ties correctly (multiple entities with same max count)
IMPORTANT: For queries that need ranking, comparison, or aggregation within groups:
- USE WINDOW FUNCTIONS when appropriate (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER, etc.)
- Window functions are ideal for: "top N per group", "bottom N per group", rankings within groups, comparing rows to previous/next, running totals, percentiles, moving averages
- DO NOT use LIMIT in subqueries with IN/ALL/ANY - MySQL doesn't support this syntax
- Pattern: Use window function OVER (PARTITION BY group_column ORDER BY sort_column) then filter or use the result
Steps:
1. Break down the question into parts
2. Create subquery/CTE for the first part (handle ties - find ALL matches, not just one)
3. Use that result in the main query for the second part
4. Use EXACT column names from structures
5. Write complete SQL
Question: {user_query}
SQL (must answer ALL parts using subqueries/CTEs, and handle ties correctly):"""
else:
return f"""Convert to SQL query.
IMPORTANT: Analyze the question carefully and choose the right approach.
For queries involving ranking, comparison, or aggregation within groups:
- USE WINDOW FUNCTIONS when appropriate (ROW_NUMBER(), RANK(), DENSE_RANK(), LAG(), LEAD(), SUM() OVER, COUNT() OVER, etc.)
- Window functions are ideal for: "top N per group", "bottom N per group", rankings within groups, comparing current row to previous/next rows, running totals within groups, percentiles, moving averages
- DO NOT use LIMIT in subqueries with IN/ALL/ANY - MySQL doesn't support this syntax
- Use ROW_NUMBER() for distinct sequential rankings, RANK() for rankings with ties (gaps), DENSE_RANK() for rankings with ties (no gaps)
- Pattern: Window function OVER (PARTITION BY group_column ORDER BY sort_column) then filter or use result
For "most/least/maximum/minimum" queries (finding entities with single extreme value):
- You MUST handle ties correctly by returning ALL entities that match the maximum/minimum value
- Use a subquery to find the MAX or MIN value first
- Then use HAVING COUNT = (SELECT MAX(...)) or HAVING COUNT = (SELECT MIN(...)) to find ALL entities with that value
- DO NOT use LIMIT 1 - this would only return one entity even if there are ties
- Return ALL entities that match the maximum/minimum value
Steps:
1. Identify needed tables
2. Select columns (use EXACT names from structures)
3. Add JOINs (use relationships if available)
4. If question involves ranking/comparison/aggregation within groups, consider WINDOW FUNCTIONS
5. If question asks for most/least/maximum/minimum (single extreme value), use subquery + HAVING pattern to handle ties
6. Add filters/aggregations as needed
7. Write SQL
Question: {user_query}
SQL:"""