# Query Performance Analysis: LLM vs Ground-Truth Query
## Query Comparison
### LLM-Generated Query
```sparql
PREFIX sc: <http://schema.org/>
PREFIX secchain: <https://w3id.org/secure-chain/>
SELECT DISTINCT ?version
WHERE {
?software sc:name ?softwareName .
FILTER(?softwareName = "openssl" || ?softwareName = "OpenSSL")
?software secchain:hasSoftwareVersion ?sv .
?sv secchain:versionName ?version .
FILTER NOT EXISTS {
?sv secchain:vulnerableTo ?vuln .
?vuln sc:identifier "CVE-2023-3817"
}
}
ORDER BY ?version
```
### Ground-Truth Query
```sparql
PREFIX schema: <http://schema.org/>
PREFIX sc: <https://w3id.org/secure-chain/>
SELECT DISTINCT ?version
WHERE {
?lib schema:name "openssl" .
?lib sc:hasSoftwareVersion ?version .
?version a sc:SoftwareVersion .
FILTER NOT EXISTS {
?version sc:vulnerableTo ?vulnerability .
?vulnerability schema:identifier "CVE-2023-3817" .
}
}
```
---
## Performance Analysis
### Overall Efficiency Verdict: **Ground-Truth Query is MORE EFFICIENT** ✅
**Estimated Performance Difference:** 2-5x faster on large graphs
---
## 1. Query Plan Comparison
### LLM Query Execution Plan (Estimated)
```
1. Scan all triples: ?software sc:name ?softwareName
→ Materializes ALL software names in graph
→ Cost: O(n) where n = total software entities
2. Apply filter: FILTER(?softwareName = "openssl" || ?softwareName = "OpenSSL")
→ Post-scan filtering on materialized results
→ Cost: O(n) string comparisons
3. Join: ?software secchain:hasSoftwareVersion ?sv
→ Cost: O(m) where m = versions per matched software
4. Project: ?sv secchain:versionName ?version
→ Extract literal values
→ Cost: O(m)
5. NOT EXISTS subquery for each ?sv
→ Cost: O(m * k) where k = avg vulnerabilities checked per version
6. DISTINCT on literal strings
→ Cost: O(m log m) for sorting/deduplication
Total Estimated Cost: O(n + m log m + mk)
```
### Ground-Truth Query Execution Plan (Estimated)
```
1. Direct lookup: ?lib schema:name "openssl"
→ Uses index on (property, value) pair
→ Cost: O(log n) with B-tree index, or O(1) with hash index
2. Join: ?lib sc:hasSoftwareVersion ?version
→ Cost: O(m) where m = versions for matched software
→ Much smaller m since only one software matched
3. Type filter: ?version a sc:SoftwareVersion
→ May use type index if available
→ Cost: O(m) with index, filters early
4. NOT EXISTS subquery for each ?version
→ Cost: O(m * k) where k = avg vulnerabilities per version
5. DISTINCT on URIs (optional if URI uniqueness guaranteed)
→ Cost: O(m log m) but m is smaller
Total Estimated Cost: O(log n + m log m + mk)
```
### Key Difference: **O(n) vs O(log n)** in the first step
---
## 2. Detailed Performance Issues in LLM Query
### Issue 1: Inefficient Pattern Matching ⚠️ **CRITICAL**
**Problem:**
```sparql
?software sc:name ?softwareName .
FILTER(?softwareName = "openssl" || ?softwareName = "OpenSSL")
```
**Why Inefficient:**
- **Scans ALL triples** with predicate `sc:name`
- Materializes all software names into variable `?softwareName`
- Applies filter AFTER materialization (post-filtering)
- Cannot use predicate-object index effectively
**Better Approach (Ground Truth):**
```sparql
?lib schema:name "openssl" .
```
**Why Better:**
- Uses **exact triple pattern** with bound object
- SPARQL engines can use (subject, predicate, object) index directly
- No post-filtering needed
- **2-100x faster** depending on graph size
**Impact on Large Graphs:**
- LLM: Must scan 10K-1M+ software names
- Ground Truth: Direct index lookup (microseconds)
---
### Issue 2: OR Filter in FILTER Clause ⚠️ **HIGH**
**Problem:**
```sparql
FILTER(?softwareName = "openssl" || ?softwareName = "OpenSSL")
```
**Why Inefficient:**
- Disjunction (OR) prevents query optimizer from pushing down constraints
- Cannot use index for either value
- Requires two string comparisons per software entity
- Forces sequential scan instead of index seek
**Better Approach - Option A (UNION):**
```sparql
{
?software schema:name "openssl" .
} UNION {
?software schema:name "OpenSSL" .
}
```
**Better Approach - Option B (VALUES):**
```sparql
VALUES ?softwareName { "openssl" "OpenSSL" }
?software schema:name ?softwareName .
```
**Performance Gain:** 5-10x faster (can use index for each value)
---
### Issue 3: Unnecessary Variable Projection ⚠️ **MEDIUM**
**Problem:**
```sparql
?software secchain:hasSoftwareVersion ?sv .
?sv secchain:versionName ?version .
```
**Why Inefficient:**
- Two-step traversal: Get version resource, then extract name
- Extra join operation
- Requires dereferencing URI to get literal value
- Result set contains literals instead of URIs (less efficient for downstream queries)
**Ground Truth Approach:**
```sparql
?lib sc:hasSoftwareVersion ?version .
# ?version is the URI itself
```
**Why Better:**
- One-step traversal
- URIs are more compact and faster to compare than strings
- Can use URI equality (pointer comparison) instead of string comparison
- Better for join operations if query is part of larger workflow
**Performance Gain:** 1.5-2x faster
---
### Issue 4: Inefficient NOT EXISTS Subquery ⚠️ **MEDIUM**
**Problem:**
```sparql
FILTER NOT EXISTS {
?sv secchain:vulnerableTo ?vuln .
?vuln sc:identifier "CVE-2023-3817"
}
```
**Issues:**
- Uses intermediate variable `?sv` (not the result variable)
- Namespace inconsistency: `sc:identifier` vs dataset's actual namespace
**Ground Truth Approach:**
```sparql
FILTER NOT EXISTS {
?version sc:vulnerableTo ?vulnerability .
?vulnerability schema:identifier "CVE-2023-3817" .
}
```
**Why Better:**
- Directly checks the result variable
- Consistent namespace usage
- Potentially better join order optimization
**Performance Impact:** Minor (5-15%), but correctness is critical
---
### Issue 5: DISTINCT on Literal Strings ⚠️ **LOW**
**Problem:**
```sparql
SELECT DISTINCT ?version # where ?version is a literal string
```
**Why Less Efficient:**
- String comparison for deduplication
- Requires full string matching (not pointer comparison)
- Larger memory footprint for string literals
**Ground Truth Approach:**
```sparql
SELECT DISTINCT ?version # where ?version is a URI
```
**Why Better:**
- URI comparison is pointer comparison (faster)
- Smaller memory footprint
- Hardware-optimized equality checks
**Performance Gain:** 10-20% on deduplication step
---
## 3. Index Usage Analysis
### Indexes Likely Available in RDF Store
| Index Type | Purpose | Used By LLM? | Used By GT? |
|------------|---------|--------------|-------------|
| SPO (Subject-Predicate-Object) | Exact triple lookup | ❌ No | ✅ Yes |
| POS (Predicate-Object-Subject) | Find subjects by property value | ❌ No | ✅ Yes |
| OSP (Object-Subject-Predicate) | Find predicates by object | ❌ No | Partial |
| Type Index (rdf:type) | Filter by class | ❌ No | ✅ Yes |
| Literal Value Index | Fast string lookup | ❌ No | ✅ Yes |
### LLM Query Index Usage
```sparql
?software sc:name ?softwareName . # Uses SP* index (partial scan)
FILTER(...) # No index (post-filtering)
?software secchain:hasSoftwareVersion ?sv . # Uses SP* index
?sv secchain:versionName ?version . # Uses SP* index
```
**Estimated Index Hits:** 3 partial scans, 0 exact lookups
### Ground-Truth Query Index Usage
```sparql
?lib schema:name "openssl" . # Uses POS index (exact lookup) ✅
?lib sc:hasSoftwareVersion ?version . # Uses SP* index ✅
?version a sc:SoftwareVersion . # Uses Type index ✅
```
**Estimated Index Hits:** 3 exact/efficient lookups
---
## 4. Concrete Optimizations
### Optimization 1: Replace Variable + FILTER with Direct Binding ⭐⭐⭐
**CRITICAL - Highest Impact**
**Original (Inefficient):**
```sparql
?software sc:name ?softwareName .
FILTER(?softwareName = "openssl" || ?softwareName = "OpenSSL")
```
**Optimized Version:**
```sparql
?software schema:name "openssl" .
```
**If Multiple Names Required:**
```sparql
VALUES ?softwareName { "openssl" "OpenSSL" }
?software schema:name ?softwareName .
```
**Performance Gain:** 10-100x faster
- Enables index usage
- Eliminates full scan
- Reduces intermediate results
**Complexity:** Trivial (simple rewrite)
---
### Optimization 2: Remove Unnecessary Property Traversal ⭐⭐
**HIGH Impact**
**Original (Inefficient):**
```sparql
?software secchain:hasSoftwareVersion ?sv .
?sv secchain:versionName ?version . # Extra step to get literal
```
**Optimized Version:**
```sparql
?software sc:hasSoftwareVersion ?version .
# Return URI directly, or use version property only if needed
```
**If Literal Value Really Needed:**
```sparql
?software sc:hasSoftwareVersion ?versionURI .
OPTIONAL { ?versionURI secchain:versionName ?version }
# But consider: is the literal really needed? URIs are more efficient
```
**Performance Gain:** 1.5-3x faster
- One less join
- Smaller intermediate result set
- Faster comparisons (URI vs string)
**Trade-off:** Returns URIs instead of human-readable strings
- **Recommendation:** Return URIs for performance, format in application layer
---
### Optimization 3: Add Explicit Type Constraint Early ⭐⭐
**MEDIUM-HIGH Impact**
**Original (Missing):**
```sparql
?software secchain:hasSoftwareVersion ?sv .
# No type constraint - may match non-version entities
```
**Optimized Version:**
```sparql
?software sc:hasSoftwareVersion ?sv .
?sv a sc:SoftwareVersion . # Add EARLY in query
```
**Why This Helps:**
- **Correctness:** Filters out malformed data
- **Performance:** Early filtering reduces subsequent join size
- **Index Usage:** Can use type index if available
- **Query Planner:** Helps optimizer choose better join order
**Performance Gain:** 1.2-2x faster + correctness improvement
**Placement Matters:**
```sparql
# GOOD - Type check early
?lib sc:hasSoftwareVersion ?version .
?version a sc:SoftwareVersion . # ← Early filtering
?version sc:vulnerableTo ?vuln .
# BAD - Type check late
?lib sc:hasSoftwareVersion ?version .
?version sc:vulnerableTo ?vuln .
?version a sc:SoftwareVersion . # ← Late filtering (after expensive join)
```
---
### Optimization 4: Use Consistent Namespaces ⭐⭐
**MEDIUM Impact (Critical for Correctness)**
**Original (Inconsistent):**
```sparql
?software sc:name ?softwareName . # Wrong namespace
?vuln sc:identifier "CVE-2023-3817" # Wrong namespace
```
**Optimized Version:**
```sparql
?software schema:name ?softwareName . # Correct namespace
?vuln schema:identifier "CVE-2023-3817" # Correct namespace
```
**Performance Gain:**
- Direct: 0-20% (depends on graph structure)
- Indirect: **CRITICAL for correctness** - may prevent query from working at all
**Why It Matters:**
- Different namespace = different property in RDF
- May cause query to return empty results
- Forces optimizer to consider wrong indexes
---
### Optimization 5: Optimize NOT EXISTS Subquery ⭐
**LOW-MEDIUM Impact**
**Original:**
```sparql
FILTER NOT EXISTS {
?sv secchain:vulnerableTo ?vuln .
?vuln sc:identifier "CVE-2023-3817"
}
```
**Optimized Version:**
```sparql
# Option A: Bind identifier first (if selectivity is high)
FILTER NOT EXISTS {
?vuln schema:identifier "CVE-2023-3817" .
?sv sc:vulnerableTo ?vuln .
}
# Option B: Add hint for optimizer (Virtuoso/Blazegraph specific)
FILTER NOT EXISTS {
?sv sc:vulnerableTo ?vuln .
FILTER(?vuln = <uri_of_CVE-2023-3817>) # If CVE URI is known
}
```
**Why Better:**
- **Option A:** If very few CVEs match "CVE-2023-3817", finding them first is faster
- **Option B:** Direct URI comparison is faster than property lookup
**Performance Gain:** 5-15% on NOT EXISTS evaluation
**Caveat:** Depends on data distribution and query planner
---
## 5. Complete Optimized Query
### Fully Optimized LLM Query
```sparql
PREFIX schema: <http://schema.org/>
PREFIX sc: <https://w3id.org/secure-chain/>
SELECT DISTINCT ?version
WHERE {
# Optimization 1: Direct binding with index usage
?software schema:name "openssl" .
# Optimization 2: Direct version retrieval
?software sc:hasSoftwareVersion ?version .
# Optimization 3: Early type constraint
?version a sc:SoftwareVersion .
# Optimization 4 & 5: Consistent namespaces + optimized NOT EXISTS
FILTER NOT EXISTS {
?vuln schema:identifier "CVE-2023-3817" .
?version sc:vulnerableTo ?vuln .
}
}
ORDER BY ?version
```
**Expected Performance Improvement:** 10-50x faster than original LLM query
---
## 6. Benchmark Estimates (Large Graph)
### Assumptions:
- Graph size: 10M triples
- Software entities: 100K
- "openssl" versions: 500
- Vulnerabilities: 50K CVEs
- Hardware: Standard SPARQL endpoint
### Performance Comparison
| Query | Estimated Time | Memory Usage | Index Hits |
|-------|----------------|--------------|------------|
| **Original LLM** | 5-10 seconds | 500 MB | 3 partial scans |
| **Ground Truth** | 0.5-1 second | 50 MB | 3 efficient lookups |
| **Optimized LLM** | 0.5-1 second | 50 MB | 3 efficient lookups |
### Performance Breakdown
| Operation | LLM Time | GT Time | Optimized LLM Time |
|-----------|----------|---------|-------------------|
| Initial pattern match | 3000ms | 10ms | 10ms |
| Filter application | 500ms | 0ms | 0ms |
| Version traversal | 200ms | 100ms | 100ms |
| Type filtering | 0ms | 50ms | 50ms |
| NOT EXISTS | 1000ms | 300ms | 200ms |
| DISTINCT | 300ms | 100ms | 100ms |
| **Total** | **5000ms** | **560ms** | **460ms** |
**Speed-up: ~9-11x faster**
---
## 7. Additional Recommendations
### Use Query Hints (Store-Specific)
**For Virtuoso:**
```sparql
# Force index usage
SELECT DISTINCT ?version
WHERE {
hint:Prior hint:runOnce true .
?software schema:name "openssl" .
...
}
```
**For Blazegraph:**
```sparql
# Hint join order
SELECT DISTINCT ?version
WHERE {
hint:Query hint:optimizer "Runtime" .
?software schema:name "openssl" .
...
}
```
### Consider LIMIT for Testing
```sparql
SELECT DISTINCT ?version
WHERE { ... }
ORDER BY ?version
LIMIT 100 # Add during development/testing
```
### Use COUNT for Faster Existence Checks
**If only checking existence:**
```sparql
ASK {
?software schema:name "openssl" .
?software sc:hasSoftwareVersion ?version .
?version a sc:SoftwareVersion .
}
```
**ASK queries are 2-5x faster than SELECT when you only need true/false**
---
## 8. Summary Table
| Optimization | Priority | Performance Gain | Complexity | Implementation |
|--------------|----------|------------------|------------|----------------|
| 1. Direct binding instead of FILTER | ⭐⭐⭐ Critical | 10-100x | Trivial | Replace variable+FILTER with constant |
| 2. Remove property traversal | ⭐⭐ High | 1.5-3x | Low | Return URIs directly |
| 3. Add type constraint early | ⭐⭐ High | 1.2-2x + correctness | Trivial | Add `?version a sc:SoftwareVersion` |
| 4. Consistent namespaces | ⭐⭐ Critical* | 0-20% + correctness | Trivial | Use schema: namespace |
| 5. Optimize NOT EXISTS | ⭐ Medium | 5-15% | Medium | Reorder patterns |
*Critical for correctness, medium for performance
---
## 9. Final Verdict
### Efficiency Ranking:
1. **Ground-Truth Query**: ⭐⭐⭐⭐⭐ (Most Efficient)
2. **Optimized LLM Query**: ⭐⭐⭐⭐⭐ (Equivalent to Ground Truth)
3. **Original LLM Query**: ⭐⭐ (Inefficient - 10-50x slower)
### Key Takeaways:
✅ **Ground-Truth Query Strengths:**
- Uses indexes effectively
- Minimal intermediate results
- Simple query plan
- Type-safe with early filtering
❌ **LLM Query Weaknesses:**
- Full scan instead of index lookup
- Post-filtering instead of pre-filtering
- Unnecessary property traversal
- Missing type constraints
🔧 **Recommended Action:**
Use the **optimized query** provided above, which combines the correctness of the ground-truth query with additional performance optimizations.
**Expected Outcome:** 10-50x performance improvement on large graphs while maintaining correctness.