SSL/TLS Certificate Renewal Failures from Certificates Managed Outside Infrastructure-as-Code
devtoolsdevtools0 views
Production services experience unexpected TLS certificate expiration outages because certificates were provisioned manually through cloud console UIs, domain registrar dashboards, or one-off scripts, bypassing the infrastructure-as-code pipeline and leaving no automated renewal or expiration monitoring in place. So what? When a certificate expires, all HTTPS traffic to the affected domain fails with browser security warnings or hard connection resets, causing immediate and total service unavailability for affected endpoints. So what? The engineer who originally provisioned the certificate has often left the company or changed teams, and no runbook exists for renewal because the manual process was never documented, turning a 5-minute renewal into a multi-hour investigation of 'where is this certificate even managed?' So what? During the investigation, customer-facing services remain down, support tickets pile up, and the incident escalates to leadership, consuming engineering management bandwidth on a completely preventable operational failure. So what? After the incident, the team adds a calendar reminder for the next renewal rather than automating it, guaranteeing the same class of failure will recur in 90 days or 1 year. So what? Certificate expiration incidents erode organizational credibility with customers and partners who view TLS failures as a signal of operational immaturity, affecting enterprise sales conversations and partnership evaluations. The structural root cause is that certificate provisioning is split across multiple systems (cloud provider ACM, Let's Encrypt, manual CA purchases, CDN-managed certs) with no single inventory or expiration dashboard, because different certificates were added at different times by different people solving immediate needs without considering lifecycle management.
Evidence
Let's Encrypt's 90-day certificate lifetime was deliberately chosen to force automation, yet Netcraft surveys show millions of expired certificates on the public internet at any given time. High-profile certificate expiration outages have hit Microsoft Teams (February 2020), Spotify (multiple occasions), and Equifax. The cert-manager project for Kubernetes has over 10,000 GitHub stars, indicating massive demand for automated certificate lifecycle management. Qualys SSL Labs scans regularly find that 5-10% of surveyed sites have certificate chain issues.