sql_Analyze_Cluster_Stats
Analyze pre-computed SQL query cluster statistics to identify performance optimization opportunities by ranking clusters, assessing resource impact, detecting skew problems, and prioritizing optimization efforts.
Instructions
ANALYZE SQL QUERY CLUSTER PERFORMANCE STATISTICS
This tool analyzes pre-computed cluster statistics to identify optimization opportunities without re-running the clustering pipeline. Perfect for iterative analysis and decision-making on which query clusters to focus optimization efforts.
ANALYSIS CAPABILITIES:
Performance Ranking: Sort clusters by any performance metric to identify top resource consumers
Resource Impact Assessment: Compare clusters by CPU usage, I/O volume, and execution complexity
Skew Problem Detection: Identify clusters with CPU or I/O distribution issues
Workload Characterization: Understand query patterns by user, application, and workload type
Optimization Prioritization: Focus on clusters with highest impact potential
AVAILABLE SORTING METRICS:
avg_cpu: Average CPU seconds per cluster (primary optimization target)
avg_io: Average logical I/O operations (scan intensity indicator)
avg_cpuskw: Average CPU skew (distribution problem indicator)
avg_ioskw: Average I/O skew (hot spot indicator)
avg_pji: Average Physical-to-Logical I/O ratio (compute intensity)
avg_uii: Average Unit I/O Intensity (I/O efficiency)
avg_numsteps: Average query plan complexity
queries: Number of queries in cluster (frequency indicator)
cluster_silhouette_score: Clustering quality measure
PERFORMANCE CATEGORIZATION: Automatically categorizes clusters using configurable thresholds (from sql_opt_config.yml):
HIGH_CPU_USAGE: Average CPU > config.performance_thresholds.cpu.high
HIGH_IO_USAGE: Average I/O > config.performance_thresholds.io.high
HIGH_CPU_SKEW: CPU skew > config.performance_thresholds.skew.high
HIGH_IO_SKEW: I/O skew > config.performance_thresholds.skew.high
NORMAL: Clusters within configured normal performance ranges
TYPICAL ANALYSIS WORKFLOW:
Sort by 'avg_cpu' or 'avg_io' to find highest resource consumers
Sort by 'avg_cpuskw' or 'avg_ioskw' to find distribution problems
Use limit_results to focus on top problematic clusters
OPTIMIZATION DECISION FRAMEWORK:
High CPU + High Query Count: Maximum impact optimization candidates
High Skew + Moderate CPU: Distribution/statistics problems
High I/O + Low PJI: Potential indexing opportunities
High NumSteps: Complex query rewriting candidates
OUTPUT FORMAT: Returns detailed cluster statistics with performance rankings, categories, and metadata for LLM analysis and optimization recommendations.
Input Schema
| Name | Required | Description | Default |
|---|---|---|---|
| sort_by_metric | No | avg_cpu | |
| limit_results | No |
Implementation Reference
- The main execution function for the sql_Analyze_Cluster_Stats tool. Connects to the database, queries cluster performance statistics from the pre-built query_cluster_stats table, sorts and limits results based on parameters, adds performance categorization and ranking, and formats results as JSON with comprehensive metadata.def handle_sql_Analyze_Cluster_Stats( conn, sort_by_metric: str = "avg_cpu", limit_results: int = None, *args, **kwargs ): """ **ANALYZE SQL QUERY CLUSTER PERFORMANCE STATISTICS** This tool analyzes pre-computed cluster statistics to identify optimization opportunities without re-running the clustering pipeline. Perfect for iterative analysis and decision-making on which query clusters to focus optimization efforts. **ANALYSIS CAPABILITIES:** - **Performance Ranking**: Sort clusters by any performance metric to identify top resource consumers - **Resource Impact Assessment**: Compare clusters by CPU usage, I/O volume, and execution complexity - **Skew Problem Detection**: Identify clusters with CPU or I/O distribution issues - **Workload Characterization**: Understand query patterns by user, application, and workload type - **Optimization Prioritization**: Focus on clusters with highest impact potential **AVAILABLE SORTING METRICS:** - **avg_cpu**: Average CPU seconds per cluster (primary optimization target) - **avg_io**: Average logical I/O operations (scan intensity indicator) - **avg_cpuskw**: Average CPU skew (distribution problem indicator) - **avg_ioskw**: Average I/O skew (hot spot indicator) - **avg_pji**: Average Physical-to-Logical I/O ratio (compute intensity) - **avg_uii**: Average Unit I/O Intensity (I/O efficiency) - **avg_numsteps**: Average query plan complexity - **queries**: Number of queries in cluster (frequency indicator) - **cluster_silhouette_score**: Clustering quality measure **PERFORMANCE CATEGORIZATION:** Automatically categorizes clusters using configurable thresholds (from sql_opt_config.yml): - **HIGH_CPU_USAGE**: Average CPU > config.performance_thresholds.cpu.high - **HIGH_IO_USAGE**: Average I/O > config.performance_thresholds.io.high - **HIGH_CPU_SKEW**: CPU skew > config.performance_thresholds.skew.high - **HIGH_IO_SKEW**: I/O skew > config.performance_thresholds.skew.high - **NORMAL**: Clusters within configured normal performance ranges **TYPICAL ANALYSIS WORKFLOW:** 1. Sort by 'avg_cpu' or 'avg_io' to find highest resource consumers 2. Sort by 'avg_cpuskw' or 'avg_ioskw' to find distribution problems 4. Use limit_results to focus on top problematic clusters **OPTIMIZATION DECISION FRAMEWORK:** - **High CPU + High Query Count**: Maximum impact optimization candidates - **High Skew + Moderate CPU**: Distribution/statistics problems - **High I/O + Low PJI**: Potential indexing opportunities - **High NumSteps**: Complex query rewriting candidates **OUTPUT FORMAT:** Returns detailed cluster statistics with performance rankings, categories, and metadata for LLM analysis and optimization recommendations. """ config = SQL_CLUSTERING_CONFIG logger.debug(f"handle_sql_Analyze_Cluster_Stats: sort_by={sort_by_metric}, limit={limit_results}") feature_db = config['databases']['feature_db'] stats_table = config['tables']['query_cluster_stats'] # Validate sort metric valid_metrics = [ 'avg_cpu', 'avg_io', 'avg_cpuskw', 'avg_ioskw', 'avg_pji', 'avg_uii', 'avg_numsteps', 'queries', 'cluster_silhouette_score' ] if sort_by_metric not in valid_metrics: sort_by_metric = 'avg_cpu' # Default fallback with conn.cursor() as cur: # Build the query with optional limit limit_clause = f"TOP {limit_results}" if limit_results else "" # Get thresholds from config thresholds = config.get('performance_thresholds', {}) cpu_high = thresholds.get('cpu', {}).get('high', 100) skew_high = thresholds.get('skew', {}).get('high', 3.0) io_high = thresholds.get('io', {}).get('high', 1000000) stats_query = f""" SELECT {limit_clause} td_clusterid_kmeans, avg_numsteps, var_numsteps, avg_cpu, var_cpu, avg_io, var_io, avg_cpuskw, var_cpuskw, avg_ioskw, var_ioskw, avg_pji, var_pji, avg_uii, var_uii, top_username, top_wdname, top_appid, overall_silhouette_score, cluster_silhouette_score, queries, -- Additional analysis columns with configurable thresholds CASE WHEN avg_cpuskw > {skew_high} THEN 'HIGH_CPU_SKEW' WHEN avg_ioskw > {skew_high} THEN 'HIGH_IO_SKEW' WHEN avg_cpu > {cpu_high} THEN 'HIGH_CPU_USAGE' WHEN avg_io > {io_high} THEN 'HIGH_IO_USAGE' ELSE 'NORMAL' END AS performance_category, RANK() OVER (ORDER BY {sort_by_metric} DESC) AS performance_rank FROM {feature_db}.{stats_table} ORDER BY {sort_by_metric} DESC """ cur.execute(stats_query) data = rows_to_json(cur.description, cur.fetchall()) # Get summary statistics cur.execute(f""" SELECT COUNT(*) AS total_clusters, AVG(avg_cpu) AS system_avg_cpu, AVG(avg_io) AS system_avg_io, AVG(queries) AS avg_queries_per_cluster, MAX(avg_cpu) AS max_cluster_cpu, MIN(cluster_silhouette_score) AS min_silhouette_score FROM {feature_db}.{stats_table} """) summary_stats = rows_to_json(cur.description, cur.fetchall())[0] logger.debug(f"Retrieved {len(data)} cluster statistics") # Return results with metadata metadata = { "tool_name": "sql_Analyze_Cluster_Stats", "analysis_parameters": { "sort_by_metric": sort_by_metric, "limit_results": limit_results, "valid_metrics": valid_metrics }, "summary_statistics": summary_stats, "clusters_analyzed": len(data), "table_source": f"{feature_db}.{stats_table}", "description": f"Cluster statistics analysis sorted by {sort_by_metric} - ready for LLM optimization recommendations" } return create_response(data, metadata)