Skip to main content
Glama
Arize-ai

@arizeai/phoenix-mcp

Official
by Arize-ai
future-improvements.md21.3 kB
# Future Improvements (Post-MVP) This document catalogs potential enhancements to elevate the LDAP implementation from production-ready (A-) to enterprise-grade (A+). These are **not blockers** for the current release but represent opportunities for future iterations. --- ## Priority Matrix | Feature | Impact | Complexity | Target Phase | |---------|--------|------------|--------------| | [Password Policy Handling](#1-password-policy-handling) | High | Medium | **Phase 2** | | [Retry Logic](#2-retry-logic-for-transient-failures) | Medium | Low | **Phase 2** | | [Metrics/Observability](#3-metricsobservability) | Medium | Low | **Phase 2** | | [Nested Group Resolution](#4-nested-group-resolution-for-active-directory) | Medium | High | Phase 3 | | [Connection Health Checks](#5-connection-health-checks) | Low | Medium | Phase 3 | | [Group Search Pagination](#6-group-search-pagination) | Low | Low | As needed | **Phase 2 Recommendation**: Items 1-3 should be bundled in the next LDAP release for significant UX and operational improvements with modest effort. --- ## 1. Password Policy Handling **Impact**: High | **Complexity**: Medium | **Target**: Phase 2 ### Current Behavior All bind failures return generic "Invalid credentials" error. ### Problem Active Directory and OpenLDAP return specific error codes for password-related issues that could provide better UX: | AD Error Code | Meaning | Current UX | Ideal UX | |---------------|---------|------------|----------| | `data 532` | Password expired | "Invalid credentials" | "Password expired, contact admin" | | `data 533` | Account disabled | "Invalid credentials" | "Account disabled" | | `data 701` | Account expired | "Invalid credentials" | "Account expired" | | `data 773` | Must change password | "Invalid credentials" | "Password change required" | | `data 775` | Account locked | "Invalid credentials" | "Account locked, try later" | ### Implementation Sketch ```python from enum import Enum class LDAPAuthResult(Enum): """LDAP authentication result with account status details.""" SUCCESS = "success" INVALID_CREDENTIALS = "invalid_credentials" PASSWORD_EXPIRED = "password_expired" ACCOUNT_DISABLED = "account_disabled" ACCOUNT_LOCKED = "account_locked" MUST_CHANGE_PASSWORD = "must_change_password" def _verify_user_password(self, server, user_dn, password) -> LDAPAuthResult: """Verify password and detect account status issues. Parses Active Directory extended error information from bind failures to provide actionable feedback to users. AD Error Format: "80090308: LdapErr: DSID-0C09044E, comment: AcceptSecurityContext error, data XXX, v23f0" Where XXX is the sub-status code (532, 533, 701, 773, 775, etc.) """ try: # ... existing bind logic ... return LDAPAuthResult.SUCCESS except LDAPInvalidCredentialsResult as e: error_msg = str(e) # Parse AD sub-status codes if "data 532" in error_msg: return LDAPAuthResult.PASSWORD_EXPIRED elif "data 533" in error_msg: return LDAPAuthResult.ACCOUNT_DISABLED elif "data 701" in error_msg: # Account expired (different from password expired) return LDAPAuthResult.ACCOUNT_DISABLED elif "data 773" in error_msg: return LDAPAuthResult.MUST_CHANGE_PASSWORD elif "data 775" in error_msg: return LDAPAuthResult.ACCOUNT_LOCKED return LDAPAuthResult.INVALID_CREDENTIALS ``` ### API Response Updates ```python # In /auth/ldap/login endpoint match auth_result: case LDAPAuthResult.SUCCESS: return create_tokens(user) case LDAPAuthResult.PASSWORD_EXPIRED: raise HTTPException(401, detail="Password has expired. Please contact your administrator.") case LDAPAuthResult.ACCOUNT_LOCKED: raise HTTPException(401, detail="Account is locked. Please try again later.") case LDAPAuthResult.MUST_CHANGE_PASSWORD: raise HTTPException(401, detail="Password change required. Please contact your administrator.") case _: raise HTTPException(401, detail="Invalid username and/or password") ``` ### Security Considerations - **Username Enumeration Risk**: Detailed errors confirm user existence. Consider only showing detailed errors after successful user lookup (which already confirms existence). - **Information Disclosure**: Error messages should be generic enough to not aid attackers but specific enough to help legitimate users. ### References - [MS-ADTS §3.1.1.3.4.2](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-adts/) - AD LDAP error codes - [OpenLDAP ppolicy overlay](https://www.openldap.org/doc/admin24/overlays.html#Password%20Policies) - LDAP password policy controls --- ## 2. Retry Logic for Transient Failures **Impact**: Medium | **Complexity**: Low | **Target**: Phase 2 ### Current Behavior A network blip mid-request fails the entire authentication attempt for that server, then moves to next server. ### Problem Transient network issues (brief DNS hiccups, momentary packet loss) cause unnecessary failures and degrade user experience. ### Implementation Sketch ```python from tenacity import ( retry, stop_after_attempt, retry_if_exception_type, wait_fixed, before_sleep_log, ) from ldap3.core.exceptions import LDAPSocketOpenError, LDAPSocketReceiveError logger = logging.getLogger(__name__) @retry( stop=stop_after_attempt(2), # Original attempt + 1 retry wait=wait_fixed(0.5), # 500ms between attempts retry=retry_if_exception_type(( LDAPSocketOpenError, # Connection refused, timeout LDAPSocketReceiveError, # Connection reset mid-operation )), before_sleep=before_sleep_log(logger, logging.WARNING), reraise=True, ) def _try_authenticate_on_server( self, server: Server, username: str, password: str ) -> LDAPUserInfo | None: """Single server auth attempt with retry for transient network errors. Retry Policy: - Max 1 retry (2 total attempts) - 500ms delay between attempts - Only retries on socket-level errors (network issues) Does NOT retry on: - LDAPInvalidCredentialsResult (wrong password) - LDAPNoSuchObjectResult (user not found) - Other LDAP protocol errors (server-side issues) """ with self._establish_connection(server) as conn: # ... existing authentication logic ... pass ``` ### Metrics Integration ```python # Track retry frequency for monitoring ldap_retry_total = Counter( 'phoenix_ldap_retry_total', 'LDAP authentication retry attempts', ['server', 'error_type'] ) # In retry callback def on_retry(retry_state): ldap_retry_total.labels( server=retry_state.args[1].host, # server argument error_type=type(retry_state.outcome.exception()).__name__ ).inc() ``` ### Configuration Option (Optional) ```bash # Environment variable to disable retries if needed PHOENIX_LDAP_RETRY_ENABLED=true # default PHOENIX_LDAP_RETRY_MAX_ATTEMPTS=2 PHOENIX_LDAP_RETRY_DELAY_MS=500 ``` --- ## 3. Metrics/Observability **Impact**: Medium | **Complexity**: Low | **Target**: Phase 2 ### Current Behavior No visibility into LDAP authentication performance or failure patterns. ### Problem Operations teams cannot: - Track LDAP auth latency trends - Identify failing servers proactively - Debug authentication issues without log diving - Set up alerting for LDAP degradation ### Implementation Sketch ```python from prometheus_client import Counter, Histogram, Gauge # === Counters === ldap_auth_total = Counter( 'phoenix_ldap_auth_total', 'Total LDAP authentication attempts', ['result', 'server'] ) # Labels: # result: success, invalid_credentials, user_not_found, server_error, timeout, all_failed # server: hostname or "all" for aggregate failures ldap_bind_total = Counter( 'phoenix_ldap_bind_total', 'Total LDAP bind operations (service account + user)', ['type', 'server', 'result'] ) # Labels: # type: service_account, user_verification, anonymous # result: success, failure # === Histograms === ldap_auth_duration_seconds = Histogram( 'phoenix_ldap_auth_duration_seconds', 'LDAP authentication end-to-end latency', ['server'], buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0] ) ldap_operation_duration_seconds = Histogram( 'phoenix_ldap_operation_duration_seconds', 'Individual LDAP operation latency', ['operation', 'server'], buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0] ) # Labels: # operation: connect, bind, search_user, search_groups, verify_password # === Gauges === ldap_server_healthy = Gauge( 'phoenix_ldap_server_healthy', 'LDAP server health status (1=healthy, 0=unhealthy)', ['server'] ) ldap_servers_configured = Gauge( 'phoenix_ldap_servers_configured', 'Number of configured LDAP servers' ) ``` ### Usage in Code ```python async def authenticate(self, username: str, password: str) -> LDAPUserInfo | None: """Authenticate with metrics instrumentation.""" for server in self.servers: with ldap_auth_duration_seconds.labels(server=server.host).time(): try: result = await self._authenticate_on_server(server, username, password) if result: ldap_auth_total.labels(result='success', server=server.host).inc() return result else: ldap_auth_total.labels(result='invalid_credentials', server=server.host).inc() return None except LDAPException as e: ldap_auth_total.labels(result='server_error', server=server.host).inc() continue ldap_auth_total.labels(result='all_failed', server='all').inc() return None ``` ### Grafana Dashboard (Future) Consider shipping a pre-built Grafana dashboard JSON for LDAP monitoring: - Authentication success/failure rates - Latency percentiles (P50, P95, P99) - Server health status - Error type breakdown --- ## 4. Nested Group Resolution for Active Directory **Impact**: Medium | **Complexity**: High | **Target**: Phase 3 (customer-driven) ### Current Behavior Only direct group memberships are resolved via `memberOf` attribute. ### Problem Many AD deployments use nested groups: ``` Phoenix Admins (group) └── IT Admins (group) ← Nested group └── alice (user) ← Indirect member of "Phoenix Admins" ``` Current implementation won't see Alice as a member of "Phoenix Admins". ### Option A: LDAP_MATCHING_RULE_IN_CHAIN (Recommended) ```python def _get_nested_groups_ad(self, conn: Connection, user_dn: str) -> list[str]: """Resolve nested AD groups using recursive membership filter. Uses Microsoft's LDAP_MATCHING_RULE_IN_CHAIN (OID 1.2.840.113556.1.4.1941) which recursively expands group membership server-side. Performance: Single LDAP query, server does recursion Compatibility: AD-only (not OpenLDAP) Args: conn: Active LDAP connection user_dn: User's distinguished name Returns: List of all group DNs (direct + transitive) """ # OID 1.2.840.113556.1.4.1941 = LDAP_MATCHING_RULE_IN_CHAIN # Finds all groups where user_dn is a member (directly or transitively) nested_filter = f"(member:1.2.840.113556.1.4.1941:={escape_filter_chars(user_dn)})" conn.search( search_base=self.config.group_search_base_dns[0], search_filter=nested_filter, search_scope=SUBTREE, attributes=['distinguishedName'], ) return [entry.entry_dn for entry in conn.entries] ``` ### Option B: tokenGroups Attribute ```python def _get_nested_groups_via_token_groups(self, conn: Connection, user_dn: str) -> list[str]: """Resolve nested groups via AD's tokenGroups attribute. tokenGroups contains binary SIDs of all groups (including nested). Requires additional lookup to convert SIDs to DNs. Performance: Two queries (get SIDs, resolve to DNs) Compatibility: AD-only Pros: - Returns security groups only (no distribution lists) - Includes domain-local groups from trusts Cons: - Binary SID handling complexity - Requires SID-to-DN resolution query """ # Step 1: Get tokenGroups (binary SIDs) conn.search( search_base=user_dn, search_filter='(objectClass=*)', search_scope=BASE, attributes=['tokenGroups'], ) if not conn.entries: return [] sids = conn.entries[0].tokenGroups.raw_values # Step 2: Resolve SIDs to DNs groups = [] for sid in sids: sid_string = self._binary_sid_to_string(sid) conn.search( search_base=self.config.user_search_base_dns[0], search_filter=f'(objectSid={sid_string})', search_scope=SUBTREE, attributes=['distinguishedName'], ) if conn.entries: groups.append(conn.entries[0].entry_dn) return groups ``` ### Configuration ```bash # Enable nested group resolution (default: false for backward compatibility) PHOENIX_LDAP_NESTED_GROUPS=true # Method selection (if multiple supported in future) PHOENIX_LDAP_NESTED_GROUPS_METHOD=chain # or "tokengroups" ``` ### References - [MS-ADTS LDAP_MATCHING_RULE_IN_CHAIN](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-adts/) - [tokenGroups attribute](https://docs.microsoft.com/en-us/windows/win32/adschema/a-tokengroups) - [SID structure (MS-DTYP)](https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-dtyp/) --- ## 5. Connection Health Checks **Impact**: Low | **Complexity**: Medium | **Target**: Phase 3 ### Current Behavior Failed servers are discovered at authentication time. ### Problem First user to hit a failed server experiences latency while failover occurs. No proactive alerting for server issues. ### Implementation Sketch ```python import asyncio from typing import Optional class LDAPAuthenticator: def __init__(self, config: LDAPConfig): self.servers = self._create_servers() self._healthy_servers: list[Server] = list(self.servers) self._health_check_task: Optional[asyncio.Task] = None async def start_health_checks(self, interval: float = 30.0) -> None: """Background task to check server health. Performs lightweight anonymous bind or rootDSE query to verify connectivity. Updates _healthy_servers list for prioritized failover. Args: interval: Seconds between health check cycles """ logger.info(f"Starting LDAP health checks (interval={interval}s)") while True: await asyncio.sleep(interval) for server in self.servers: healthy = await self._check_server_health(server) if healthy and server not in self._healthy_servers: self._healthy_servers.append(server) logger.info(f"LDAP server {server.host} recovered") ldap_server_healthy.labels(server=server.host).set(1) elif not healthy and server in self._healthy_servers: self._healthy_servers.remove(server) logger.warning(f"LDAP server {server.host} is unhealthy") ldap_server_healthy.labels(server=server.host).set(0) async def _check_server_health(self, server: Server) -> bool: """Quick connectivity check via rootDSE query. RootDSE is always accessible without authentication and returns server capabilities. Faster than bind operations. """ try: conn = Connection( server, auto_bind=False, receive_timeout=5, # Short timeout for health checks ) conn.open() # Query rootDSE (always accessible) conn.search( search_base='', search_filter='(objectClass=*)', search_scope=BASE, attributes=['namingContexts'], ) conn.unbind() return True except LDAPException: return False async def stop_health_checks(self) -> None: """Stop background health check task.""" if self._health_check_task: self._health_check_task.cancel() try: await self._health_check_task except asyncio.CancelledError: pass self._health_check_task = None async def authenticate(self, username: str, password: str) -> LDAPUserInfo | None: """Authenticate with health-aware server selection.""" # Prioritize healthy servers, fall back to all servers if none healthy servers_to_try = self._healthy_servers if self._healthy_servers else self.servers for server in servers_to_try: try: result = await self._authenticate_on_server(server, username, password) if result: return result except LDAPException: continue return None ``` ### FastAPI Lifespan Integration ```python from contextlib import asynccontextmanager @asynccontextmanager async def lifespan(app: FastAPI): # Startup if ldap_config := get_ldap_config(): authenticator = LDAPAuthenticator(ldap_config) authenticator._health_check_task = asyncio.create_task( authenticator.start_health_checks() ) app.state.ldap_authenticator = authenticator yield # Shutdown if hasattr(app.state, 'ldap_authenticator'): await app.state.ldap_authenticator.stop_health_checks() ``` --- ## 6. Group Search Pagination **Impact**: Low | **Complexity**: Low | **Target**: As needed ### Current Behavior POSIX mode group searches use single unpaginated query. ### Problem LDAP servers typically limit results to 1000 entries (configurable). Organizations with >1000 groups in a search base may get truncated results. ### Implementation Sketch ```python from ldap3.extend.standard import paged_search def _search_groups_paged( self, conn: Connection, search_base: str, search_filter: str, page_size: int = 500 ) -> list[str]: """Search groups with pagination for large directories. Uses LDAP Simple Paged Results Control (RFC 2696) to handle directories with more than 1000 groups. Args: conn: Active LDAP connection search_base: Base DN for group search search_filter: LDAP filter for group membership page_size: Results per page (default 500) Returns: List of all matching group DNs """ groups = [] generator = paged_search.paged_search_generator( conn, search_base=search_base, search_filter=search_filter, search_scope=SUBTREE, attributes=['distinguishedName'], paged_size=page_size, ) for entry in generator: if entry['type'] == 'searchResEntry': groups.append(entry['dn']) logger.debug(f"Paged search returned {len(groups)} groups from {search_base}") return groups ``` ### When to Implement Only implement if a customer reports truncation issues. Signs of this problem: - Users missing expected role assignments - Log messages about partial results - Large enterprise with >1000 LDAP groups ### References - [RFC 2696](https://www.rfc-editor.org/rfc/rfc2696) - LDAP Simple Paged Results Control --- ## Implementation Recommendations ### Phase 2 (Next LDAP Release) Bundle these three improvements for significant operational value: 1. ✅ **Metrics/Observability** - Low effort, essential for production operations 2. ✅ **Retry Logic** - Low effort, improves reliability 3. ⚠️ **Password Policy Handling** - Medium effort, major UX improvement **Estimated Effort**: 2-3 days total ### Phase 3 (Future) Implement based on customer demand: 4. ⏳ **Nested Group Resolution** - Only if customers report AD nested group issues 5. ⏳ **Connection Health Checks** - Only for large multi-server deployments 6. ⏳ **Group Search Pagination** - Only if truncation reported ### Not Recommended - **Connection Pooling**: Complexity vs. benefit tradeoff unfavorable for typical Phoenix workloads (<100 auth/sec). Each auth takes <500ms; pool management overhead may exceed connection setup cost. --- ## Related Documentation - [Main LDAP Spec](../ldap-authentication.md) - Implementation overview - [Security Deep-Dive](./security.md) - Current security measures - [Configuration Reference](./configuration.md) - Environment variables - [Protocol Compliance](./protocol-compliance.md) - RFC compliance details

Latest Blog Posts

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/Arize-ai/phoenix'

If you have feedback or need assistance with the MCP directory API, please join our Discord server