# PDSL Design Summary
**Version:** 0.1.0
**Date:** 2025-10-12
**Status:** Design Complete, Implementation Ready
This document summarizes the design decisions, rationale, and implementation timeline for the Probabilistic Domain-Specific Language (PDSL).
## Executive Summary
PDSL is a domain-specific language for expressing probabilistic knowledge that bridges natural language and ProbLog. It aims for 90%+ LLM generation success rate from natural language, making probabilistic reasoning accessible to everyone.
**Key Achievements:**
- Complete language specification with EBNF grammar
- Comprehensive parser architecture
- Full ProbLog translation patterns
- 30+ practical examples across 9 domains
- Type-safe probability validation
- Clear error messages with suggestions
## Design Philosophy
### Core Principles
1. **Readability First**
- Syntax reads like structured English
- Natural mapping from probabilistic statements
- Minimal special characters and symbols
2. **Safety by Design**
- Probability validation at parse time
- Type checking for predicates
- Variable safety enforcement
- Clear, actionable error messages
3. **LLM-Friendly**
- Intuitive syntax for LLM generation
- Consistent patterns
- Minimal ambiguity
- Self-documenting code
4. **Expressiveness**
- Full ProbLog feature coverage
- Support for advanced patterns
- Extensible for future features
5. **Bidirectional**
- PDSL ↔ ProbLog conversion
- Preserve semantics in both directions
- Support for round-trip translation
## Language Design Decisions
### Decision 1: Model Wrapper Syntax
**Choice:** Explicit `probabilistic_model` wrapper with named models
```pdsl
probabilistic_model MedicalDiagnosis {
# statements
}
```
**Rationale:**
- Provides clear scope boundaries
- Enables model composition (future)
- Matches CSDL pattern (proven success)
- Natural for LLM generation
**Alternatives Considered:**
- No wrapper (flat file) - Rejected: lacks structure
- Module syntax - Rejected: too complex for v0.1
### Decision 2: Probabilistic Annotation Operator
**Choice:** Double colon `::`
```pdsl
0.7 :: sunny
```
**Rationale:**
- Matches ProbLog syntax (easy translation)
- Visually distinctive
- Already understood by logic programmers
- Consistent with existing tools
**Alternatives Considered:**
- `P=0.7` - Rejected: looks like assignment
- `@0.7` - Rejected: less clear
- `prob(0.7, sunny)` - Rejected: too verbose
### Decision 3: Observation Syntax
**Choice:** Keyword `observe` before literals
```pdsl
observe fever
observe not raining
```
**Rationale:**
- Clear intent (not ambiguous with facts)
- Natural language mapping
- Distinct from queries
- Easy to parse
**Alternatives Considered:**
- `evidence(fever)` - Rejected: too functional
- `fever = true` - Rejected: looks imperative
- `given fever` - Rejected: less clear for LLMs
### Decision 4: Query Syntax
**Choice:** Keyword `query` before atoms
```pdsl
query flu
query disease(X)
```
**Rationale:**
- Explicit query intent
- Matches observation pattern
- Clear for beginners and LLMs
- Easy to distinguish from facts
**Alternatives Considered:**
- `?- flu` - Rejected: Prolog-specific notation
- `ask flu` - Rejected: less formal
- `P(flu)` - Rejected: mathematical notation
### Decision 5: Variable Naming Convention
**Choice:** Uppercase for variables, lowercase for constants
```pdsl
flies(X) :- bird(X) # X is variable
bird(sparrow) # sparrow is constant
```
**Rationale:**
- Prolog convention (widely known)
- Visually clear distinction
- Standard in logic programming
- LLMs already trained on this pattern
**Alternatives Considered:**
- `$X` or `%X` - Rejected: adds visual noise
- Type annotations - Rejected: too complex for v0.1
- Lowercase variables - Rejected: ambiguous
### Decision 6: Negation Operator
**Choice:** Keyword `not` in PDSL, translates to `\+` in ProbLog
```pdsl
flies(X) :- bird(X), not penguin(X)
# Translates to: flies(X) :- bird(X), \+ penguin(X).
```
**Rationale:**
- `not` is more natural for LLMs and humans
- `\+` is Prolog-specific (confusing for beginners)
- Translation handles conversion automatically
- Consistent with natural language
**Alternatives Considered:**
- Use `\+` directly - Rejected: unintuitive
- `~` or `!` - Rejected: too symbolic
- `neg(X)` - Rejected: verbose
### Decision 7: Annotated Disjunction Syntax
**Choice:** Semicolon-separated alternatives
```pdsl
0.3 :: a; 0.5 :: b; 0.2 :: c
```
**Rationale:**
- Matches ProbLog syntax (direct translation)
- Visually groups alternatives
- Probability sum validation possible
- Compact representation
**Alternatives Considered:**
- `|` separator - Rejected: looks like Prolog disjunction
- Multiple statements - Rejected: doesn't capture mutual exclusivity
- `choice { ... }` - Rejected: too verbose
### Decision 8: Comment Syntax
**Choice:** Hash symbol `#` for single-line comments
```pdsl
# This is a comment
0.5 :: rain # Inline comment
```
**Rationale:**
- Common in Python, shell, many DSLs
- Easy to type
- LLMs familiar with this pattern
- Non-intrusive
**Alternatives Considered:**
- `//` - Rejected: C-style, less universal
- `%` - Rejected: Prolog-specific
- `/* */` - Rejected: more complex (planned for v0.2)
### Decision 9: Type System Simplicity
**Choice:** Simple type inference, no explicit annotations in v0.1
**Rationale:**
- Reduces cognitive load
- Matches logic programming tradition
- Sufficient for most use cases
- Can add annotations in future versions
**Alternatives Considered:**
- Explicit type declarations - Rejected: too complex for v0.1
- Gradual typing - Deferred to v0.2+
### Decision 10: Error Message Design
**Choice:** Structured errors with location, message, and suggestion
```
Error: Invalid probability value
Line 3: 1.5 :: unlikely
^^^
Probability must be between 0.0 and 1.0
Suggestion: Did you mean 0.15?
```
**Rationale:**
- Actionable guidance for users
- Helps LLMs correct errors
- Industry best practice (Rust, TypeScript)
- Improves learning curve
## Example PDSL Programs
### Example 1: Medical Diagnosis
```pdsl
probabilistic_model MedicalDiagnosis {
# Prior probabilities
0.01 :: flu
0.001 :: covid
# Symptoms
0.9 :: fever :- flu
0.95 :: fever :- covid
0.1 :: fever # Background rate
# Evidence
observe fever
# Query
query flu
query covid
}
```
**Translation to ProbLog:**
```prolog
% Model: MedicalDiagnosis
% Prior probabilities
0.01::flu.
0.001::covid.
% Symptoms
0.9::fever :- flu.
0.95::fever :- covid.
0.1::fever.
% Evidence
evidence(fever, true).
% Query
query(flu).
query(covid).
```
### Example 2: Weather Prediction
```pdsl
probabilistic_model Weather {
# Base probability
0.3 :: rain
# Conditional
0.9 :: cloudy :- rain
0.3 :: cloudy :- not rain
observe cloudy
query rain
}
```
### Example 3: Network Reliability
```pdsl
probabilistic_model Network {
# Component reliability
0.95 :: up(server1)
0.98 :: up(server2)
0.90 :: up(router)
# Service availability
0.99 :: service :- up(server1), up(router)
0.99 :: service :- up(server2), up(router)
observe not service
query up(server1)
query up(router)
}
```
### Example 4: Annotated Disjunction
```pdsl
probabilistic_model Traffic {
# Mutually exclusive traffic levels
0.4 :: traffic(light);
0.4 :: traffic(moderate);
0.2 :: traffic(heavy)
# Travel time depends on traffic
15 :: time(15) :- traffic(light)
30 :: time(30) :- traffic(moderate)
60 :: time(60) :- traffic(heavy)
query traffic(heavy)
query time(30)
}
```
### Example 5: Learning from Data
```pdsl
probabilistic_model DataDriven {
# Learnable parameters
p1 :: disease(X) :- symptom1(X)
p2 :: disease(X) :- symptom2(X)
learn parameters from dataset("medical_data.csv")
query disease(patient123)
}
```
## Grammar Highlights
### Complete EBNF Grammar (Summary)
```ebnf
Program ::= Model+
Model ::= 'probabilistic_model' Identifier '{' Statement* '}'
Statement ::= ProbFact | ProbRule | Fact | Observation | Query | Learning
ProbFact ::= Probability '::' Atom
ProbRule ::= Probability '::' Atom ':-' Body
AnnotatedDisj ::= ProbFact (';' ProbFact)+
Observation ::= 'observe' Literal
Query ::= 'query' Atom
Body ::= Literal (',' Literal)*
Literal ::= Atom | 'not' Atom
Atom ::= Predicate | Predicate '(' ArgumentList ')'
```
**Key Features:**
- Recursive descent parser friendly
- Unambiguous grammar
- Left-to-right parsing
- Minimal lookahead required
## Parser Architecture Overview
### Compilation Pipeline
```
Source Text
↓
┌────────────┐
│ Lexer │ → Tokens
└────────────┘
↓
┌────────────┐
│ Parser │ → AST
└────────────┘
↓
┌────────────┐
│ Semantic │ → Validated AST
│ Analyzer │
└────────────┘
↓
┌────────────┐
│ Type │ → Type-checked AST
│ Checker │
└────────────┘
↓
┌────────────┐
│ Code │ → ProbLog
│ Generator │
└────────────┘
```
### Key Components
1. **Lexer** - Tokenization with precise error locations
2. **Parser** - Recursive descent, builds typed AST
3. **Semantic Analyzer** - Variable safety, arity checking
4. **Type Checker** - Probability validation, type consistency
5. **Code Generator** - ProbLog emission with optimization
### Parser Algorithm
**Chosen:** Recursive Descent
**Rationale:**
- Simple to implement and understand
- Excellent error recovery
- Direct mapping from grammar
- Efficient for LL(1) grammars
**Alternatives Considered:**
- LR parser generator - Rejected: overkill for simple grammar
- PEG parser - Rejected: less familiar
- Hand-coded state machine - Rejected: harder to maintain
## Translation Examples
### Basic Translation
**PDSL:**
```pdsl
0.7 :: sunny
observe cloudy
query sunny
```
**ProbLog:**
```prolog
0.7::sunny.
evidence(cloudy, true).
query(sunny).
```
### Rule Translation
**PDSL:**
```pdsl
0.9 :: flies(X) :- bird(X), not penguin(X)
```
**ProbLog:**
```prolog
0.9::flies(X) :- bird(X), \+ penguin(X).
```
### Complex Model
**PDSL:**
```pdsl
probabilistic_model BayesNet {
0.3 :: a
0.8 :: b :- a
0.1 :: b :- not a
0.9 :: c :- b
observe c
query a
}
```
**ProbLog:**
```prolog
% Model: BayesNet
0.3::a.
0.8::b :- a.
0.1::b :- \+ a.
0.9::c :- b.
evidence(c, true).
query(a).
```
## Implementation Timeline
### Phase 1: Core Language (Weeks 1-2)
**Week 1: Lexer and Parser**
- Day 1-2: Lexer implementation
- Token definitions
- Lexical analysis
- Error reporting
- Day 3-4: Parser implementation
- AST definitions
- Recursive descent parser
- Basic error recovery
- Day 5: Testing
- Lexer unit tests
- Parser unit tests
- Edge cases
**Week 2: Semantic Analysis**
- Day 1-2: Semantic analyzer
- Symbol table
- Variable safety checking
- Arity consistency
- Day 3: Type checker
- Probability validation
- Type inference
- Constraint checking
- Day 4-5: Testing
- Semantic analysis tests
- Integration tests
- Error message quality
### Phase 2: Code Generation (Week 3)
**Day 1-2: ProbLog Generator**
- AST traversal
- Code emission
- Formatting and optimization
**Day 3: Bidirectional Translation**
- ProbLog → PDSL parser
- Round-trip testing
- Format preservation
**Day 4-5: Testing and Polish**
- End-to-end tests
- Example validation
- Documentation updates
### Phase 3: Integration (Week 4)
**Day 1-2: MCP Server Integration**
- Add probabilistic operation to MCP
- Command handler integration
- Result formatting
**Day 3: Natural Language Processing**
- NLP → PDSL generation
- Pattern matching
- LLM prompt engineering
**Day 4-5: Testing and Benchmarking**
- LLM generation success rate testing
- Performance benchmarks
- Final documentation
### Phase 4: Polish and Release (Week 5)
**Day 1-2: Documentation**
- Complete user guide
- API documentation
- Tutorial videos (optional)
**Day 3-4: Examples and Templates**
- Expand example library
- Create templates
- Domain-specific guides
**Day 5: Release**
- Version 0.1.0 release
- Announcement
- Community feedback
## Total Timeline: 5 Weeks
### Week-by-Week Summary
| Week | Focus | Deliverables |
|------|-------|--------------|
| 1 | Lexer + Parser | Tokenizer, AST, Basic parsing |
| 2 | Semantic Analysis | Type checking, Validation |
| 3 | Code Generation | ProbLog output, Translation |
| 4 | Integration | MCP server, NLP support |
| 5 | Polish + Release | Docs, Examples, v0.1.0 |
## Design Decision Rationale
### Why PDSL over Direct ProbLog?
1. **Readability** - PDSL is more intuitive for non-experts
2. **Type Safety** - Catch errors before execution
3. **LLM Generation** - Optimized for AI-generated code
4. **Abstraction** - Hide ProbLog complexity
5. **Extensibility** - Easier to add features
### Why PDSL Syntax Choices?
1. **`probabilistic_model`** - Clear scoping, proven pattern
2. **`::`** - Matches ProbLog, familiar to users
3. **`observe`** - Explicit intent, clear semantics
4. **`query`** - Symmetric with observe
5. **`not`** - Natural language, not Prolog-specific
### Why This Parser Architecture?
1. **Multi-stage** - Clear separation of concerns
2. **Recursive Descent** - Simple, maintainable
3. **Type Checking** - Catch errors early
4. **Error Recovery** - Helpful messages
5. **Optimization** - Room for future improvements
## Success Metrics
### Target Goals
1. **LLM Generation Success Rate:** 90%+
- Measured by valid PDSL from natural language prompts
- Baseline: CSDL achieved 95%
2. **Translation Accuracy:** 100%
- All valid PDSL must produce valid ProbLog
- Semantic equivalence verified
3. **User Satisfaction:** 85%+
- Measured by ease-of-use surveys
- Compared against direct ProbLog
4. **Performance:** < 100ms compilation
- For programs up to 1000 lines
- Measured on standard hardware
### Validation Strategy
1. **Unit Tests:** 90%+ coverage
2. **Integration Tests:** All examples compile and run
3. **LLM Tests:** Prompt → PDSL success rate
4. **User Studies:** Feedback from 20+ users
## Risk Assessment
### Technical Risks
| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| Grammar ambiguity | High | Low | Formal grammar specification |
| Poor error messages | Medium | Medium | User testing, iteration |
| Translation bugs | High | Medium | Extensive test suite |
| Performance issues | Low | Low | Profiling, optimization |
### Project Risks
| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| Timeline overrun | Medium | Medium | Phased approach, MVP first |
| LLM generation below 90% | High | Low | Follow CSDL patterns |
| ProbLog compatibility | Medium | Low | Use stable ProbLog version |
| User adoption | Medium | Medium | Good docs, examples |
## Future Extensions (v0.2.0+)
### Planned Features
1. **Multi-line comments:** `/* ... */`
2. **List support:** `[1, 2, 3]` for aggregates
3. **Arithmetic:** `X is Y + 1`
4. **Comparison:** `X > Y`, `X =< Y`
5. **Aggregates:** `count`, `sum`, `avg`
6. **Modules:** Import/export between models
7. **Type annotations:** Optional explicit types
8. **Macros:** Code generation templates
### Experimental Features (v0.3.0+)
1. **Continuous distributions:** Gaussian, Beta, etc.
2. **Utility theory:** Decision-theoretic reasoning
3. **Temporal reasoning:** Time-indexed predicates
4. **Causal inference:** `do` operator
5. **Approximate inference:** MCMC, variational methods
6. **Online learning:** Update probabilities incrementally
## Lessons from CSDL
### What Worked Well
1. **Constraint wrapper syntax** - Clear, structured
2. **Natural keyword choices** - "require", "satisfy"
3. **Type inference** - Less boilerplate
4. **Good error messages** - Improved usability
5. **Comprehensive examples** - Aided understanding
### Improvements for PDSL
1. **More formal grammar** - EBNF from the start
2. **Better parser architecture** - Multi-stage pipeline
3. **Bidirectional translation** - ProbLog ↔ PDSL
4. **Performance testing** - Early optimization
5. **LLM-specific testing** - Dedicated validation
## Conclusion
PDSL represents a significant advancement in making probabilistic reasoning accessible. By combining:
- **Natural syntax** optimized for LLM generation
- **Type safety** through compile-time validation
- **Full ProbLog compatibility** via clean translation
- **Comprehensive documentation** and examples
...we expect to achieve 90%+ LLM generation success while maintaining 100% semantic correctness.
The 5-week implementation timeline is realistic and follows proven patterns from CSDL's success. The multi-stage parser architecture provides flexibility for future extensions while keeping the initial implementation manageable.
**PDSL is ready for implementation.**
---
## Appendices
### Appendix A: Grammar Reference
See [PROBABILISTIC_DSL_SPECIFICATION.md](PROBABILISTIC_DSL_SPECIFICATION.md) for complete EBNF grammar.
### Appendix B: Parser Implementation
See [PROBABILISTIC_DSL_PARSER_DESIGN.md](PROBABILISTIC_DSL_PARSER_DESIGN.md) for detailed architecture.
### Appendix C: Translation Patterns
See [PROBABILISTIC_DSL_PROBLOG_TRANSLATION.md](PROBABILISTIC_DSL_PROBLOG_TRANSLATION.md) for all patterns.
### Appendix D: Examples
See [PROBABILISTIC_DSL_EXAMPLES.md](PROBABILISTIC_DSL_EXAMPLES.md) for 30+ examples.
### Appendix E: Quick Start
See [PROBABILISTIC_DSL_QUICKSTART.md](PROBABILISTIC_DSL_QUICKSTART.md) for 5-minute tutorial.
---
**Document Version:** 1.0
**Last Updated:** 2025-10-12
**Authors:** PDSL Design Team
**Status:** ✅ Complete