SaaS Outage Tracker
Real incident history for developer tools and SaaS platforms — not what the status page claims, but what actually happened. Each incident includes the date, duration, affected services, and severity level. The Uptime Scorecard compares claimed uptime percentages against actual measured availability. The Post-Mortem Analysis tab dives deeper into major incidents with root causes, communication grades, and lessons learned.
1.1.1.1 public DNS resolver experiencing intermittent SERVFAIL responses for some queries. Authoritative DNS and Workers unaffected.
All new deploys in Oregon region failing with build timeout errors. Running services unaffected. Caused by storage I/O degradation on build infrastructure.
Llama 3.3 70B and Mixtral endpoints returning 503 'capacity exceeded' errors. Smaller models routed to overflow capacity with elevated latency.
Elevated 500 error rates across GPT-4o and o3 endpoints. Streaming requests most affected.
Site publishes stuck in pending state. CMS item updates not reflecting on live sites. Editor functional but changes not deployable.
Feature flag API returning stale values or timing out under load. Event capture and session recordings unaffected. ClickHouse query pressure affecting flag cache refresh.
Claude 3.5 Sonnet streaming requests experiencing 3-5x normal TTFB. Haiku and Opus endpoints unaffected. Caused by GPU cluster rebalancing.
High-volume senders (>100K/hr) experiencing delivery failures. Emails accepted by API but not delivered. Low-volume senders unaffected.
Event-triggered functions experiencing 5-15 minute execution delays instead of sub-second. Cron-triggered functions unaffected. Event queue processing backlog.
SELECT queries returning internal errors for services in us-east-1. INSERT operations queued but not lost. Caused by ZooKeeper coordination layer restart.
App Platform deployments in NYC1 region failing with build timeout. Existing apps running normally. Functions also affected in same region.
Realtime subscriptions experiencing 5-10 second delays instead of sub-second. Queries and mutations unaffected. Caused by a hot partition in the subscription fanout layer.
Search queries timing out or returning partial results in US region. Data ingestion unaffected. Caused by search head captain election loop after maintenance.
GitLab CI runners unable to pick up new jobs. Pipelines stuck in pending state. Merge requests blocked on pipeline status. Web IDE unaffected.
All new deployments failed for 47 minutes due to build infrastructure issue. Existing deployments unaffected.
Error events accepted but appearing in dashboard with 15-30 minute delay. Alert notifications delayed accordingly. Performance data also lagging.
US long-code SMS messages delayed by 5-15 minutes instead of sub-second. Short-code and toll-free unaffected. Carrier routing issue.
Multiplayer editing failing — users seeing stale content and edit conflicts. Single-user editing worked but changes not syncing between clients.
S3-compatible Object Storage returning 503 errors in Paris region. Compute instances unaffected. Serverless Functions depending on object storage failing.
Complete API unavailability due to LPU cluster maintenance window extended unexpectedly. No inference capacity served during the window.
REST API and native Redis protocol connections experiencing 3-5x normal latency. QStash webhook deliveries delayed by 2-10 minutes.
All builds queued but not starting execution. Running builds completed normally. Docker layer caching service also degraded.
GitHub Actions queue times increased to 30+ minutes. Pages deployments delayed.
Zap triggers firing but actions queued for 30-90 minutes instead of near-instant. Webhook triggers most affected. Scheduled triggers ran on time.
Voice channels disconnecting and gateway WebSocket failing to reconnect for ~30% of users. Bot APIs degraded. Caused by misconfigured edge routing change.
Cloud Run services in europe-west1 experiencing 10-30 second cold starts vs normal <2s. Warm instances unaffected. Caused by container registry caching issue.
Database branching operations timing out. Existing branches and connections fully operational. Caused by storage layer backlog during internal migration.
Fly Machines API returning timeouts. Running apps stayed up but scaling and deploys failed.
Storefront API response times 3-5x higher than normal globally. Checkout flow unaffected. Headless storefronts experienced slow page loads.
Error events delayed 15-30 minutes before appearing in the dashboard. Alert rules not firing on time. Ingest API accepting events without errors.
Amplitude cohort syncs to destinations (Braze, Iterable, etc.) failing silently. Event ingestion and dashboards unaffected. Caused by destination sync worker OOM.
Database queries returning errors. Auth service intermittently failing. File storage operational. Caused by cloud infrastructure provider network issue.
Mixpanel cohort syncs to Braze, Iterable, and other destinations failing. Event ingestion and dashboards unaffected. Destination sync worker OOM restart loop.
PostgreSQL connections timing out for projects in us-east-1. Caused by underlying AWS networking issue.
JSM queues not processing new tickets. Existing tickets accessible but new submissions stuck in limbo. Confluence and Bitbucket unaffected.
APL queries spanning >7 days returning timeout errors. Sub-day queries and ingest unaffected. Caused by query planner regression.
Sign-in and sign-up endpoints returning 500 for ~25 minutes. Users unable to authenticate. Caused by a failed database migration in the auth token service.
NRQL queries returning errors across all accounts. Dashboards blank. Alert conditions not evaluating. Data ingestion continued normally.
Git operations over HTTPS returning timeouts. SSH git operations partially affected. Web UI accessible but showing stale repository state.
Public model predictions queued for 2-5 minutes vs sub-second baseline. Private deployments unaffected. Caused by autoscaler lag during traffic spike.
Edge replicas in EU regions lagging 15-30 seconds behind primary. Read-after-write consistency broken for EU-deployed apps.
MongoDB Atlas Serverless instances in AWS us-east-1 returning connection refused. Dedicated and shared clusters unaffected. Data API also impacted.
Sticky notes and shapes rendering with 5-10 second delays. Board loading times increased 3x. Caused by CDN cache purge propagation issue.
~20% of Ghost(Pro) hosted blogs returning 502. Admin panel inaccessible for affected sites. Newsletter scheduling delayed. Self-hosted unaffected.
Sora video generation jobs accepted but not processing. ChatGPT and standard API unaffected. Caused by video processing worker pool deadlock.
Workspaces with 10K+ issues experiencing 30-60 second sync delays. New issue creation working but UI not reflecting changes in real time.
Firestore read latency increased 10x in us-central1. Write operations unaffected.
Large file uploads (>100MB) failing with timeout errors. Small file uploads and downloads unaffected. S3-compatible API also impacted.
Checkout pages for ~15% of stores returning 502 errors. API and dashboard unaffected. Caused by CDN edge misconfiguration after SSL certificate renewal.
Anthropic and Google models returning 503 errors through OpenRouter. Direct provider APIs were functional — issue was in OpenRouter's provider proxy layer routing.
All triggered and scheduled workflows failing with internal server error. Workflow editor accessible but executions not starting. Caused by message broker outage.
Visual Editor failing to connect to preview environments. Content API reads/writes normal. Form-based editing unaffected. Caused by WebSocket proxy issue.
Pull request and branch references not syncing from GitHub. Issue state changes via GitHub not reflected. Manual refresh partially resolved for some users.
All queued tasks stuck in 'pending' state. Running tasks completed but new tasks not picked up. Dashboard showing stale execution status.
Pulumi up/destroy operations failing with state lock timeout. Stack exports and imports also affected. CLI operations against local state unaffected.
Events accepted but appearing in dashboards with 30-60 minute delay. Feature flags and session recordings unaffected. ClickHouse ingestion backlog.
Functions requiring A100 and H100 GPUs failing to start with 'no capacity available' errors. CPU functions unaffected. Caused by upstream cloud GPU shortage in us-east-1.
0.5% of API requests returning 500 errors. Payment processing unaffected for most merchants.
Kibana dashboards returning 502 errors for deployments in GCP us-central1. Elasticsearch queries via API unaffected.
Object operations in eu-central-1 experiencing 5-10x normal latency. US regions unaffected. ListBucket operations most impacted.
Index updates queued for 10-20 minutes instead of near-instant. Search queries using existing index unaffected. Caused by temporary indexing cluster pressure.
Response times degraded from <500ms to 3-5 seconds. Wafer-scale compute cluster rebalancing after hardware maintenance. Smaller models unaffected.
Magic link and OTP emails delayed by 3-8 minutes instead of sub-10s. Password auth and OAuth flows unaffected. Email provider rate limiting issue.
Magic link and OTP emails delayed by 3-8 minutes. OAuth and password auth unaffected. Email provider rate limiting triggered by transient traffic spike.
Continuous aggregates not refreshing on schedule. Raw data queries unaffected. Dashboard views showing stale data up to 2 hours old.
Message send/receive delayed by 30-90 seconds for EU workspaces. Search and integrations also affected. Caused by Vitess shard rebalancing operation.
Serverless driver connections failing intermittently in us-east-2. Direct connections unaffected. Caused by pooler autoscaling misconfiguration.
Workflow history queries returning empty results. Running workflows continued but new workflow starts failing due to deduplication check failures.
Physical server failure in Chicago region. Apps with multi-region setup unaffected. Single-region ORD apps experienced downtime.
Packet loss and elevated latency for servers in Gravelines datacenter. VPS and dedicated servers affected. Internal network between DCs unaffected.
Figma files >500MB failing to load with timeout errors. Smaller files unaffected. Dev Mode and Inspect panel also impacted for large files.
Cloudflare Workers and Pages returning 502 errors globally. Root cause: bad config deployment.
Framer editor preview pane showing blank content. Published sites unaffected. Code components failing to render in editor. Publish still functional.
Alert rules in EU region evaluated with 10-20 minute delay. Dashboards and metrics ingestion unaffected. Caused by Cortex ruler pod restart loop.
Google and GitHub OAuth sign-ins returning 'invalid_grant' for new sessions. Existing sessions unaffected. Email/password and magic link auth working normally.
Custom metrics delayed by 5-15 minutes in US1 region. Alerts based on delayed metrics may have fired late.
All Terraform runs entering pending state and not executing. State file locking working but no plan/apply operations completing. Caused by worker pool scaling failure.
S3 bucket operations and Lambda invocations experiencing elevated error rates in us-east-1.
API accepting requests but emails not being delivered. Webhook delivery confirmations delayed. SMTP relay returning 421 temporary errors.
Content Delivery API response times increased 3x in NA region. Management API unaffected. CDN cache serving stale content for some entries.
Custom metrics showing 10-15 minute ingestion lag. Monitors firing late or not at all. Infrastructure metrics less affected. Caused by intake pipeline partition skew.
Smartscape topology updates delayed by 15-30 minutes. Metrics and log ingestion unaffected. Davis AI alerting slightly delayed.
Messages API returning 529 overloaded errors at 3x normal rate. Batch API unaffected. Claude.ai web interface also experiencing slower responses.
SAML-based SSO logins experiencing 10-20 second delays for enterprise connections. SCIM sync and OAuth unaffected. IDP metadata cache refresh caused the issue.
GitHub Copilot returning errors in IDE for all users. Chat feature also affected. Underlying Azure OpenAI deployment experienced capacity issues.
Workspaces with 10K+ issues experiencing 30-60 second sync delays. New issue creation working but UI not reflecting changes immediately.
Configuration changes in Konnect not propagating to data plane nodes for 10-20 minutes. Existing routes unaffected. New route creation delayed.
Builds queuing for 20+ minutes instead of usual <2 minutes. Caused by surge in traffic.
InfluxDB Cloud write API returning 503 for all organizations in US region. Reads and queries unaffected. Caused by storage engine compaction backlog.
Schema changes via branch promotion failing with timeout errors. Direct database queries unaffected. Caused by Vitess vttablet resource exhaustion.
Docker image pushes to GitLab Container Registry returning 500 errors. Image pulls from existing tags working. CI jobs building Docker images failing.
Newly created apps not resolving via fly.dev subdomain. Existing apps unaffected. Custom domains working. Caused by DNS zone file sync lag between authoritative nameservers.
GPT-4o returning truncated or low-quality responses. Suspected routing issue to degraded model shard.
Vault secret read operations experiencing 5-10x normal latency. Secret writes unaffected. Auth token validation delayed. Caused by storage backend compaction.
Sign-in and sign-up flows taking 5-10 seconds instead of <1s. No auth failures reported.
Claude Sonnet returning malformed tool_use blocks intermittently for ~5% of requests. Issue traced to model serving layer rollback.
CockroachDB Serverless clusters in US regions returning connection refused errors. Dedicated clusters unaffected. Caused by proxy layer autoscaling bug.
All sign-in attempts returning 500 errors. Session validation failing for existing sessions. Caused by database migration that locked the sessions table.
SMTP server returning temporary 451 errors for ~5% of submission attempts. API delivery unaffected. Caused by SMTP authentication service restart.
Tier 4+ accounts receiving Tier 1 rate limits. Batch API unaffected.
Packet loss and elevated latency for servers in Falkenstein data center. Caused by upstream provider fiber cut. Helsinki and Nuremberg DCs unaffected.
M10+ clusters in AWS us-east-1 experiencing simultaneous primary elections. Applications saw connection resets and write failures for 15-18 minutes. Caused by network partition in MongoDB's management plane.
Edge Functions experiencing 10x normal cold start times. Serverless Functions unaffected.
All deployments failing due to Docker build infrastructure issue. Running services unaffected.
Build queue times exceeding 25 minutes for all tiers including Enterprise. Priority queue not respected. Caused by build image registry corruption requiring rebuild.
New deployments stuck in 'building' state. Rollback to previous deployment also failing. Running services unaffected. Caused by Nixpacks builder OOM during concurrent builds.
Serverless Functions returning 504 timeouts regardless of function duration setting. Edge Functions unaffected. Root cause: internal routing table update.
EC2 instance launches failing in 3 of 6 AZs. Lambda cold starts timing out. ECS task placements failing. Caused by internal DNS resolution failures in control plane.
Payment intent webhooks delayed by 30-90 minutes. Dashboard showing events as pending. Payments processed normally but downstream systems not notified. Caused by event bus partition rebalancing.
R2 bucket reads returning intermittent 500 errors in EU regions. Writes unaffected. Workers binding to R2 saw failures.
Projects in Singapore region hitting connection limits. Pooler returning 'too many connections' despite available capacity. Caused by PgBouncer misconfiguration during scaling event.
Git push, pull, and clone operations failing via both SSH and HTTPS. Web interface functional. Authentication service degraded.
Free tier services experiencing 60-90 second cold starts (normal: 10-15s). Paid services unaffected. Caused by aggressive resource reclamation during holiday traffic spike.
DynamoDB reads experiencing elevated latency and error rates in us-east-1. Cognito authentication failures cascade-impacted services using federated identity.
Ubuntu-latest runners taking 45+ minutes to start. Windows and macOS runners at 50% capacity. Self-hosted runners unaffected. Caused by capacity crunch during end-of-year CI surge.
Webhook events delayed by 10-30 minutes. Payment processing and API calls unaffected. Caused by event processing queue backup.
Workers KV reads returning stale data or errors. API Gateway routes failing for customers using custom domains. Caused by distributed storage consistency issue during rollout.
Tier 4-5 customers hitting rate limits at 10% of their stated capacity. 429 errors returned with incorrect retry-after headers. Caused by capacity reallocation for new model deployment.
Google and email/password sign-in returning errors intermittently. OAuth redirect flows most affected. Phone auth unaffected.
Edge Functions returning 502 errors in EU-west and EU-central. Static assets served normally. Root cause: BGP route leak from upstream provider affected edge PoPs.
Cluster auto-scaling and manual tier changes failing. Existing clusters operational but could not scale up during high load periods.
ChatGPT returning 503 errors. API returning elevated 500 rates across all models. Streaming endpoints most affected. Caused by infrastructure scaling issue.
OAuth token validation service partial failure after deployment. Copilot stopped working in IDEs. API calls with OAuth tokens returned 401. Login via session cookies unaffected.
D1 replication failure from control plane update. SQLite replicas couldn't sync with primary. 60% of edge locations affected. Workers using D1 experienced full outages.
ISR revalidation worker pool exhausted connections to data cache layer. Revalidation requests silently dropped. Sites served stale content. Status page showed operational for 45 min.
Inference cluster capacity insufficient for sustained traffic spike. Auto-scaling couldn't provision GPU instances fast enough. 30%+ of requests returning 529.
Internal indexing partition split caused read inconsistencies. GET requests returned 404 or stale data for recently written objects. Cascading impact on Lambda, ECS, CloudFront.
Traffic surge from new model launches overwhelmed API gateway. All endpoints returning 503 including GPT-4, Embeddings, and Assistants. ChatGPT also degraded.
Machines API experienced OOM crash loop after deploy request surge. No new deployments or scaling possible. Running machines continued serving traffic.
Network configuration change during maintenance caused routing loop. All services in Oregon unreachable. Other regions unaffected.
GoTrue memory leak triggered by OAuth callback traffic spike. All authentication operations failed globally. Existing sessions unaffected.
Configuration push caused cache invalidation storm across all PoPs. KV reads returned errors or stale data. R2 and D1 unaffected.
Storage backend migration caused job dispatcher slowdown. Jobs accepted but not dispatched to runners. Queue grew to 500K+ pending jobs. Self-hosted runners unaffected.
New signing key deployed before public key propagated to edge nodes. All session verifications returned 401. Users logged out of every Clerk-powered app.
V8 isolate pool exhaustion caused Edge Functions to timeout or return 504 errors. Static assets and ISR unaffected. All regions impacted simultaneously.
Control plane update caused Lambda sandbox provisioning to take 5-10x longer. Cold starts exceeded 10 seconds. Warm invocations unaffected.
Nixpacks builder update caused builds to hang on certain Node.js projects. Build queue backed up consuming all builder capacity. Existing deployments unaffected.
CouchDB replication lag caused package metadata inconsistencies. Some packages not found, others returned stale versions. Affected npm, yarn, and pnpm.
Webhook queue backed up after database partition rebalance. Payments processed normally but webhook notifications delayed 1-4 hours. Order fulfillment systems stalled.
Rate limiting logic incorrectly classified authenticated pulls as anonymous. CI/CD pipelines using Docker Hub images failed with 429 errors globally.
Bigtable replication lag in nam5 multi-region caused write commits to fail. Reads from cache worked but realtime listeners stopped. Auth and Hosting unaffected.
Vitess version upgrade caused VReplication streams to stall. Deploy requests and branch merges stuck in pending state. Database reads/writes normal.
Database export tools timing out as thousands of free tier users attempted migration before deadline. Export queue backed up 6+ hours. Emergency capacity added.