sql_Execute_Full_Pipeline
Execute a complete SQL query clustering workflow to identify and analyze high CPU usage queries for optimization opportunities in Teradata databases.
Instructions
COMPLETE SQL QUERY CLUSTERING PIPELINE FOR HIGH-USAGE QUERY OPTIMIZATION
This tool executes the entire SQL query clustering workflow to identify and analyze high CPU usage queries for optimization opportunities. It's designed for database performance analysts and DBAs who need to systematically identify query optimization candidates.
FULL PIPELINE WORKFLOW:
Query Log Extraction: Extracts SQL queries from DBC.DBQLSqlTbl with comprehensive performance metrics
Performance Metrics Calculation: Computes CPU skew, I/O skew, PJI (Physical to Logical I/O ratio), UII (Unit I/O Intensity)
Query Tokenization: Tokenizes SQL text using {sql_clustering_config.get('model', {}).get('model_id', 'bge-small-en-v1.5')} tokenizer via ivsm.tokenizer_encode
Embedding Generation: Creates semantic embeddings using ivsm.IVSM_score with ONNX models
Vector Store Creation: Converts embeddings to vector columns via ivsm.vector_to_columns
K-Means Clustering: Groups similar queries using TD_KMeans with optimal K from configuration
Silhouette Analysis: Calculates clustering quality scores using TD_Silhouette
Statistics Generation: Creates comprehensive cluster statistics with performance aggregations
PERFORMANCE METRICS EXPLAINED:
AMPCPUTIME: Total CPU seconds across all AMPs (primary optimization target)
CPUSKW/IOSKW: CPU/I/O skew ratios (>2.0 indicates distribution problems)
PJI: Physical-to-Logical I/O ratio (higher = more CPU-intensive)
UII: Unit I/O Intensity (higher = more I/O-intensive relative to CPU)
LogicalIO: Total logical I/O operations (indicates scan intensity)
NumSteps: Query plan complexity (higher = more complex plans)
CONFIGURATION (from sql_opt_config.yml):
Uses top {default_max_queries} queries by CPU time (configurable)
Creates {default_optimal_k} clusters by default (configurable via optimal_k parameter)
Embedding model: {sql_clustering_config.get('model', {}).get('model_id', 'bge-small-en-v1.5')}
Vector dimensions: {sql_clustering_config.get('embedding', {}).get('vector_length', 384)}
All database and table names are configurable
OPTIMIZATION WORKFLOW: After running this tool, use:
sql_Analyze_Cluster_Stats to identify problematic clusters
sql_Retrieve_Cluster_Queries to get actual SQL from target clusters
LLM analysis to identify patterns and propose specific optimizations
USE CASES:
Identify query families consuming the most system resources
Find queries with similar patterns but different performance
Discover optimization opportunities through clustering analysis
Prioritize DBA effort on highest-impact query improvements
Understand workload composition and resource distribution
PREREQUISITES:
DBC.DBQLSqlTbl and DBC.DBQLOgTbl must be accessible
Embedding models and tokenizers must be installed in feature_ext_db
Sufficient space in feature_ext_db for intermediate and final tables
Input Schema
Name | Required | Description | Default |
---|---|---|---|
max_queries | No | ||
optimal_k | No |