Skip to content
Share:XLinkedIn

SaaS Outage Tracker

Real incident history for developer tools and SaaS platforms — not what the status page claims, but what actually happened. Each incident includes the date, duration, affected services, and severity level. The Uptime Scorecard compares claimed uptime percentages against actual measured availability. The Post-Mortem Analysis tab dives deeper into major incidents with root causes, communication grades, and lessons learned.

DegradedCloudflare
2026-04-26
20 min
1.1.1.1 DNS resolver intermittent failures

1.1.1.1 public DNS resolver experiencing intermittent SERVFAIL responses for some queries. Authoritative DNS and Workers unaffected.

1.1.1.1 Resolver
MajorRender
2026-04-24
2h 10min
Deploy failures on Oregon region

All new deploys in Oregon region failing with build timeout errors. Running services unaffected. Caused by storage I/O degradation on build infrastructure.

DeploysBuilds
MajorGroq
2026-04-23
1h 25min
LPU inference cluster overloaded

Llama 3.3 70B and Mixtral endpoints returning 503 'capacity exceeded' errors. Smaller models routed to overflow capacity with elevated latency.

Chat Completionsllama-3.3-70bmixtral-8x7b
MajorOpenAI
2026-04-22
1h 30min
API 500 errors — all models

Elevated 500 error rates across GPT-4o and o3 endpoints. Streaming requests most affected.

Chat CompletionsEmbeddings
MajorWebflow
2026-04-22
1h 05min
CMS publishing failures across all sites

Site publishes stuck in pending state. CMS item updates not reflecting on live sites. Editor functional but changes not deployable.

PublishingCMSHosting
PartialPostHog
2026-04-22
1h 05min
Feature flag evaluations degraded

Feature flag API returning stale values or timing out under load. Event capture and session recordings unaffected. ClickHouse query pressure affecting flag cache refresh.

Feature FlagsFlag API
PartialAnthropic
2026-04-21
1h 15min
Elevated latency on Claude Sonnet endpoints

Claude 3.5 Sonnet streaming requests experiencing 3-5x normal TTFB. Haiku and Opus endpoints unaffected. Caused by GPU cluster rebalancing.

APIclaude-3-5-sonnet
MajorSendGrid
2026-04-21
1h 10min
Transactional email delivery failures for high-volume senders

High-volume senders (>100K/hr) experiencing delivery failures. Emails accepted by API but not delivered. Low-volume senders unaffected.

Mail Send APIDeliverySMTP
PartialInngest
2026-04-20
55 min
Function invocation delays

Event-triggered functions experiencing 5-15 minute execution delays instead of sub-second. Cron-triggered functions unaffected. Event queue processing backlog.

Function TriggersEvent Processing
MajorClickHouse Cloud
2026-04-20
1h 30min
Query processing failures in AWS us-east-1

SELECT queries returning internal errors for services in us-east-1. INSERT operations queued but not lost. Caused by ZooKeeper coordination layer restart.

QueriesDashboards
PartialDigitalOcean
2026-04-20
55 min
App Platform deploy failures in NYC1

App Platform deployments in NYC1 region failing with build timeout. Existing apps running normally. Functions also affected in same region.

App PlatformFunctionsDeployments
DegradedConvex
2026-04-20
15 min
Realtime subscription delays

Realtime subscriptions experiencing 5-10 second delays instead of sub-second. Queries and mutations unaffected. Caused by a hot partition in the subscription fanout layer.

Realtime subscriptions
PartialSplunk Cloud
2026-04-19
1h 20min
Search head cluster degradation

Search queries timing out or returning partial results in US region. Data ingestion unaffected. Caused by search head captain election loop after maintenance.

SearchDashboardsAlerts
MajorGitLab.com
2026-04-19
1h 40min
CI/CD pipeline execution failures

GitLab CI runners unable to pick up new jobs. Pipelines stuck in pending state. Merge requests blocked on pipeline status. Web IDE unaffected.

CI/CDRunnersPipelines
MajorVercel
2026-04-18
47 min
Global deployment failures

All new deployments failed for 47 minutes due to build infrastructure issue. Existing deployments unaffected.

DeploymentsBuild API
DegradedSentry
2026-04-18
40 min
Event processing delays in US region

Error events accepted but appearing in dashboard with 15-30 minute delay. Alert notifications delayed accordingly. Performance data also lagging.

Event ProcessingAlertsPerformance
DegradedTwilio
2026-04-18
35 min
Programmable Messaging SMS delivery delays

US long-code SMS messages delayed by 5-15 minutes instead of sub-second. Short-code and toll-free unaffected. Carrier routing issue.

SMS MessagingProgrammable Messaging
MajorNotion
2026-04-17
1h 10min
Real-time collaboration broken

Multiplayer editing failing — users seeing stale content and edit conflicts. Single-user editing worked but changes not syncing between clients.

Real-time SyncCollaborationAPI
MajorScaleway
2026-04-17
1h 30min
Object Storage API errors in PAR1

S3-compatible Object Storage returning 503 errors in Paris region. Compute instances unaffected. Serverless Functions depending on object storage failing.

Object StorageServerless Functions
MajorGroq
2026-04-17
55 min
All model endpoints returning 503 errors

Complete API unavailability due to LPU cluster maintenance window extended unexpectedly. No inference capacity served during the window.

APIAll Models
DegradedUpstash
2026-04-16
45 min
Elevated Redis latency in US East

REST API and native Redis protocol connections experiencing 3-5x normal latency. QStash webhook deliveries delayed by 2-10 minutes.

RedisQStash
MajorCircleCI
2026-04-16
1h 15min
Build queue processing halted

All builds queued but not starting execution. Running builds completed normally. Docker layer caching service also degraded.

BuildsQueueDocker Layer Cache
PartialGitHub
2026-04-15
1h 30min
Actions and Pages degraded

GitHub Actions queue times increased to 30+ minutes. Pages deployments delayed.

ActionsPages
MajorZapier
2026-04-15
2h 30min
Zap execution engine backlog

Zap triggers firing but actions queued for 30-90 minutes instead of near-instant. Webhook triggers most affected. Scheduled triggers ran on time.

Zap ExecutionWebhooksActions
MajorDiscord
2026-04-15
2h 5min
Voice and gateway connection failures

Voice channels disconnecting and gateway WebSocket failing to reconnect for ~30% of users. Bot APIs degraded. Caused by misconfigured edge routing change.

VoiceGatewayBot API
DegradedGoogle Cloud Run
2026-04-15
40 min
Cold start latency spike in europe-west1

Cloud Run services in europe-west1 experiencing 10-30 second cold starts vs normal <2s. Warm instances unaffected. Caused by container registry caching issue.

ServicesCold Starts
PartialNeon
2026-04-15
1h 10min
Branch creation and deletion failing

Database branching operations timing out. Existing branches and connections fully operational. Caused by storage layer backlog during internal migration.

Branch creationBranch deletionDashboard
MajorFly.io
2026-04-14
2h 30min
Machines API unresponsive

Fly Machines API returning timeouts. Running apps stayed up but scaling and deploys failed.

Machines APIDeployments
DegradedShopify
2026-04-14
35 min
Storefront API elevated latency

Storefront API response times 3-5x higher than normal globally. Checkout flow unaffected. Headless storefronts experienced slow page loads.

Storefront APIGraphQL
DegradedSentry
2026-04-14
40 min
Event ingestion delays — US region

Error events delayed 15-30 minutes before appearing in the dashboard. Alert rules not firing on time. Ingest API accepting events without errors.

Event IngestionAlertsPerformance
PartialAmplitude
2026-04-13
1h 35min
Cohort sync failures to downstream tools

Amplitude cohort syncs to destinations (Braze, Iterable, etc.) failing silently. Event ingestion and dashboards unaffected. Caused by destination sync worker OOM.

Cohort SyncIntegrations
MajorAppwrite Cloud
2026-04-13
1h 45min
Database and auth service errors

Database queries returning errors. Auth service intermittently failing. File storage operational. Caused by cloud infrastructure provider network issue.

DatabaseAuthFunctions
PartialMixpanel
2026-04-13
50 min
Cohort sync failures to downstream destinations

Mixpanel cohort syncs to Braze, Iterable, and other destinations failing. Event ingestion and dashboards unaffected. Destination sync worker OOM restart loop.

Cohort SyncIntegrations
MajorSupabase
2026-04-12
2h 15min
Database connectivity issues — US East

PostgreSQL connections timing out for projects in us-east-1. Caused by underlying AWS networking issue.

DatabaseAuthRealtime
MajorJira Cloud
2026-04-12
2h
Jira Service Management queue failures

JSM queues not processing new tickets. Existing tickets accessible but new submissions stuck in limbo. Confluence and Bitbucket unaffected.

JSM QueuesTicket CreationAutomation
PartialAxiom
2026-04-12
50 min
Query API timeouts for large time ranges

APL queries spanning >7 days returning timeout errors. Sub-day queries and ingest unaffected. Caused by query planner regression.

Query APIAPL
MajorClerk
2026-04-12
25 min
Authentication API returning 500 errors

Sign-in and sign-up endpoints returning 500 for ~25 minutes. Users unable to authenticate. Caused by a failed database migration in the auth token service.

Sign-inSign-upSession validation
MajorNew Relic
2026-04-11
1h 45min
NRQL query engine unavailable

NRQL queries returning errors across all accounts. Dashboards blank. Alert conditions not evaluating. Data ingestion continued normally.

NRQLDashboardsAlerting
MajorBitbucket Cloud
2026-04-11
1h 50min
Git push/pull operations timing out

Git operations over HTTPS returning timeouts. SSH git operations partially affected. Web UI accessible but showing stale repository state.

Git OperationsHTTPSSSH
PartialReplicate
2026-04-11
55 min
Prediction queue backlog

Public model predictions queued for 2-5 minutes vs sub-second baseline. Private deployments unaffected. Caused by autoscaler lag during traffic spike.

Public ModelsPredictions API
PartialTurso
2026-04-10
1h 20min
Edge replica sync delays across EU regions

Edge replicas in EU regions lagging 15-30 seconds behind primary. Read-after-write consistency broken for EU-deployed apps.

Edge ReplicasReplication
MajorMongoDB Atlas
2026-04-10
1h 20min
Serverless instance connection failures

MongoDB Atlas Serverless instances in AWS us-east-1 returning connection refused. Dedicated and shared clusters unaffected. Data API also impacted.

Serverless InstancesData API
DegradedMiro
2026-04-09
35 min
Board widget rendering delays

Sticky notes and shapes rendering with 5-10 second delays. Board loading times increased 3x. Caused by CDN cache purge propagation issue.

Board RenderingCDN
MajorGhost(Pro)
2026-04-09
1h 30min
Ghost(Pro) sites returning 502 errors

~20% of Ghost(Pro) hosted blogs returning 502. Admin panel inaccessible for affected sites. Newsletter scheduling delayed. Self-hosted unaffected.

HostingAdmin PanelNewsletter
MajorOpenAI
2026-04-09
2h 40min
Sora video generation queue stuck

Sora video generation jobs accepted but not processing. ChatGPT and standard API unaffected. Caused by video processing worker pool deadlock.

SoraVideo Generation
DegradedLinear
2026-04-09
30 min
Sync delays for large workspaces

Workspaces with 10K+ issues experiencing 30-60 second sync delays. New issue creation working but UI not reflecting changes in real time.

Real-time SyncApp
PartialFirebase
2026-04-08
1h 45min
Firestore read latency spike

Firestore read latency increased 10x in us-central1. Write operations unaffected.

FirestoreCloud Functions
PartialBackblaze B2
2026-04-08
45 min
B2 API upload failures in US West

Large file uploads (>100MB) failing with timeout errors. Small file uploads and downloads unaffected. S3-compatible API also impacted.

Upload APIS3 APILarge Files
DegradedLemon Squeezy
2026-04-08
45 min
Checkout page loading failures

Checkout pages for ~15% of stores returning 502 errors. API and dashboard unaffected. Caused by CDN edge misconfiguration after SSL certificate renewal.

CheckoutStorefront
MajorOpenRouter
2026-04-08
2h 15min
Multiple model providers unavailable

Anthropic and Google models returning 503 errors through OpenRouter. Direct provider APIs were functional — issue was in OpenRouter's provider proxy layer routing.

Claude modelsGemini modelsAPI routing
Majorn8n Cloud
2026-04-07
1h 15min
Workflow execution failures across all regions

All triggered and scheduled workflows failing with internal server error. Workflow editor accessible but executions not starting. Caused by message broker outage.

ExecutionsTriggersWebhooks
PartialStoryblok
2026-04-07
1h
Visual Editor connection timeouts

Visual Editor failing to connect to preview environments. Content API reads/writes normal. Form-based editing unaffected. Caused by WebSocket proxy issue.

Visual EditorPreview
DegradedLinear
2026-04-06
25 min
GitHub integration sync failures

Pull request and branch references not syncing from GitHub. Issue state changes via GitHub not reflected. Manual refresh partially resolved for some users.

GitHub IntegrationSync
MajorTrigger.dev
2026-04-05
1h 45min
Task execution engine unresponsive

All queued tasks stuck in 'pending' state. Running tasks completed but new tasks not picked up. Dashboard showing stale execution status.

Task ExecutionQueueDashboard
PartialPulumi Cloud
2026-04-03
1h 05min
Stack state operations timing out

Pulumi up/destroy operations failing with state lock timeout. Stack exports and imports also affected. CLI operations against local state unaffected.

State ManagementDeployments
PartialPostHog
2026-03-30
1h 10min
Event ingestion lag in US Cloud

Events accepted but appearing in dashboards with 30-60 minute delay. Feature flags and session recordings unaffected. ClickHouse ingestion backlog.

Event IngestionDashboards
MajorModal
2026-03-29
1h 50min
GPU function cold starts failing

Functions requiring A100 and H100 GPUs failing to start with 'no capacity available' errors. CPU functions unaffected. Caused by upstream cloud GPU shortage in us-east-1.

GPU FunctionsImage Builds
DegradedStripe
2026-03-28
23 min
Elevated API error rates

0.5% of API requests returning 500 errors. Payment processing unaffected for most merchants.

API
DegradedElastic Cloud
2026-03-28
55 min
Kibana dashboard loading failures

Kibana dashboards returning 502 errors for deployments in GCP us-central1. Elasticsearch queries via API unaffected.

KibanaDashboards
PartialWasabi
2026-03-28
1h 10min
Elevated latency in EU Central region

Object operations in eu-central-1 experiencing 5-10x normal latency. US regions unaffected. ListBucket operations most impacted.

Storage APIEU Central
DegradedAlgolia
2026-03-28
35 min
Indexing delays for large index pushes

Index updates queued for 10-20 minutes instead of near-instant. Search queries using existing index unaffected. Caused by temporary indexing cluster pressure.

Indexing APIIndex Updates
PartialCerebras
2026-03-28
45 min
Inference latency spike on Llama 3.3 70B

Response times degraded from <500ms to 3-5 seconds. Wafer-scale compute cluster rebalancing after hardware maintenance. Smaller models unaffected.

Llama 3.3 70B inference
DegradedStytch
2026-03-27
25 min
Magic link delivery delays

Magic link and OTP emails delayed by 3-8 minutes instead of sub-10s. Password auth and OAuth flows unaffected. Email provider rate limiting issue.

Magic LinksOTP Email
DegradedStytch
2026-03-27
25 min
Magic link and OTP email delivery delays

Magic link and OTP emails delayed by 3-8 minutes. OAuth and password auth unaffected. Email provider rate limiting triggered by transient traffic spike.

Magic LinksOTP Email
PartialTimescale Cloud
2026-03-26
55 min
Continuous aggregate refresh failures

Continuous aggregates not refreshing on schedule. Raw data queries unaffected. Dashboard views showing stale data up to 2 hours old.

Continuous AggregatesScheduled Jobs
PartialSlack
2026-03-26
1h 15min
Message delivery delays in EU workspaces

Message send/receive delayed by 30-90 seconds for EU workspaces. Search and integrations also affected. Caused by Vitess shard rebalancing operation.

MessagingSearchIntegrations
PartialNeon
2026-03-25
1h 40min
Connection pooler errors in us-east-2

Serverless driver connections failing intermittently in us-east-2. Direct connections unaffected. Caused by pooler autoscaling misconfiguration.

Serverless DriverConnection Pooler
MajorTemporal Cloud
2026-03-23
1h 50min
Workflow execution history unavailable

Workflow history queries returning empty results. Running workflows continued but new workflow starts failing due to deduplication check failures.

History ServiceWorkflow Starts
MajorFly.io
2026-03-22
3h
ORD region hardware failure

Physical server failure in Chicago region. Apps with multi-region setup unaffected. Single-region ORD apps experienced downtime.

ComputeVolumes
MajorOVHcloud
2026-03-22
2h 15min
Network degradation in GRA datacenter

Packet loss and elevated latency for servers in Gravelines datacenter. VPS and dedicated servers affected. Internal network between DCs unaffected.

NetworkingVPSDedicated Servers
PartialFigma
2026-03-21
50 min
File loading timeouts for large projects

Figma files >500MB failing to load with timeout errors. Smaller files unaffected. Dev Mode and Inspect panel also impacted for large files.

EditorDev ModeFile Loading
MajorCloudflare
2026-03-20
35 min
Workers and Pages outage

Cloudflare Workers and Pages returning 502 errors globally. Root cause: bad config deployment.

WorkersPagesKV
PartialFramer
2026-03-20
55 min
Editor preview rendering failures

Framer editor preview pane showing blank content. Published sites unaffected. Code components failing to render in editor. Publish still functional.

EditorPreviewCode Components
PartialGrafana Cloud
2026-03-19
50 min
Alerting evaluation delays in EU stack

Alert rules in EU region evaluated with 10-20 minute delay. Dashboards and metrics ingestion unaffected. Caused by Cortex ruler pod restart loop.

AlertingRuler
PartialSupabase
2026-03-19
55 min
Auth provider OAuth callback failures

Google and GitHub OAuth sign-ins returning 'invalid_grant' for new sessions. Existing sessions unaffected. Email/password and magic link auth working normally.

AuthOAuth
PartialDatadog
2026-03-18
42 min
Metrics ingestion delays

Custom metrics delayed by 5-15 minutes in US1 region. Alerts based on delayed metrics may have fired late.

MetricsMonitors
MajorTerraform Cloud
2026-03-17
1h 25min
Plan and apply runs stuck in queue

All Terraform runs entering pending state and not executing. State file locking working but no plan/apply operations completing. Caused by worker pool scaling failure.

RunsPlansApplies
MajorAWS
2026-03-15
4h
us-east-1 S3 and Lambda degraded

S3 bucket operations and Lambda invocations experiencing elevated error rates in us-east-1.

S3LambdaAPI Gateway
MajorResend
2026-03-15
1h 30min
Email delivery failures across all regions

API accepting requests but emails not being delivered. Webhook delivery confirmations delayed. SMTP relay returning 421 temporary errors.

Email DeliverySMTPWebhooks
DegradedContentstack
2026-03-15
35 min
Content Delivery API slow responses

Content Delivery API response times increased 3x in NA region. Management API unaffected. CDN cache serving stale content for some entries.

Content Delivery API
PartialDatadog
2026-03-15
1h 30min
Metric ingestion delays in US1 region

Custom metrics showing 10-15 minute ingestion lag. Monitors firing late or not at all. Infrastructure metrics less affected. Caused by intake pipeline partition skew.

MetricsMonitorsDashboards
DegradedDynatrace
2026-03-14
40 min
Smartscape topology map delays

Smartscape topology updates delayed by 15-30 minutes. Metrics and log ingestion unaffected. Davis AI alerting slightly delayed.

SmartscapeDavis AI
PartialAnthropic
2026-03-14
50 min
Messages API rate limit errors

Messages API returning 529 overloaded errors at 3x normal rate. Batch API unaffected. Claude.ai web interface also experiencing slower responses.

Messages APIClaude.ai
DegradedWorkOS
2026-03-14
22 min
SSO SAML assertion validation latency

SAML-based SSO logins experiencing 10-20 second delays for enterprise connections. SCIM sync and OAuth unaffected. IDP metadata cache refresh caused the issue.

SSOSAML
MajorGitHub
2026-03-13
1h 20min
Copilot completions unavailable

GitHub Copilot returning errors in IDE for all users. Chat feature also affected. Underlying Azure OpenAI deployment experienced capacity issues.

CopilotCopilot Chat
DegradedLinear
2026-03-12
35 min
Sync delays for large workspaces

Workspaces with 10K+ issues experiencing 30-60 second sync delays. New issue creation working but UI not reflecting changes immediately.

Real-time SyncApp
DegradedKong Konnect
2026-03-11
30 min
Control plane sync delays

Configuration changes in Konnect not propagating to data plane nodes for 10-20 minutes. Existing routes unaffected. New route creation delayed.

Control PlaneConfig Sync
PartialNetlify
2026-03-10
55 min
Build queue backlog

Builds queuing for 20+ minutes instead of usual <2 minutes. Caused by surge in traffic.

Builds
MajorInfluxDB Cloud
2026-03-09
1h 40min
Write endpoint rejecting data points

InfluxDB Cloud write API returning 503 for all organizations in US region. Reads and queries unaffected. Caused by storage engine compaction backlog.

Write APIData Ingestion
MajorPlanetScale
2026-03-08
2h 30min
Branch promotion failures in US East

Schema changes via branch promotion failing with timeout errors. Direct database queries unaffected. Caused by Vitess vttablet resource exhaustion.

Branch PromotionsSchema Changes
PartialGitLab.com
2026-03-08
50 min
Container Registry push failures

Docker image pushes to GitLab Container Registry returning 500 errors. Image pulls from existing tags working. CI jobs building Docker images failing.

Container RegistryCI/CD
PartialFly.io
2026-03-08
2h 40min
DNS propagation failures for new apps

Newly created apps not resolving via fly.dev subdomain. Existing apps unaffected. Custom domains working. Caused by DNS zone file sync lag between authoritative nameservers.

DNSfly.dev SubdomainsNew Apps
DegradedOpenAI
2026-03-05
45 min
GPT-4o response quality degradation

GPT-4o returning truncated or low-quality responses. Suspected routing issue to degraded model shard.

Chat Completions
PartialHCP Vault
2026-02-28
45 min
Secret read latency spike in US region

Vault secret read operations experiencing 5-10x normal latency. Secret writes unaffected. Auth token validation delayed. Caused by storage backend compaction.

Secret EngineAuth
DegradedClerk
2026-02-25
18 min
Sign-in latency increase

Sign-in and sign-up flows taking 5-10 seconds instead of <1s. No auth failures reported.

Authentication
DegradedAnthropic
2026-02-21
35 min
Tool use responses malformed

Claude Sonnet returning malformed tool_use blocks intermittently for ~5% of requests. Issue traced to model serving layer rollback.

Tool UseMessages API
MajorCockroachDB
2026-02-20
1h 55min
Serverless cluster connection failures

CockroachDB Serverless clusters in US regions returning connection refused errors. Dedicated clusters unaffected. Caused by proxy layer autoscaling bug.

Serverless ClustersConnection Proxy
MajorClerk
2026-02-20
42 min
Authentication failures across all sign-in methods

All sign-in attempts returning 500 errors. Session validation failing for existing sessions. Caused by database migration that locked the sessions table.

Sign-inSession ValidationOAuth
DegradedPostmark
2026-02-19
28 min
SMTP submission elevated error rates

SMTP server returning temporary 451 errors for ~5% of submission attempts. API delivery unaffected. Caused by SMTP authentication service restart.

SMTPEmail Delivery
PartialOpenAI
2026-02-18
1h
Rate limits applied incorrectly

Tier 4+ accounts receiving Tier 1 rate limits. Batch API unaffected.

Rate LimitsChat Completions
MajorHetzner
2026-02-12
2h 40min
Falkenstein DC network degradation

Packet loss and elevated latency for servers in Falkenstein data center. Caused by upstream provider fiber cut. Helsinki and Nuremberg DCs unaffected.

Cloud ServersDedicated ServersNetworking
MajorMongoDB Atlas
2026-02-12
18 min
Unexpected primary elections across multiple clusters

M10+ clusters in AWS us-east-1 experiencing simultaneous primary elections. Applications saw connection resets and write failures for 15-18 minutes. Caused by network partition in MongoDB's management plane.

Cluster AvailabilityWrite Operations
PartialVercel
2026-02-10
1h 10min
Edge Functions cold starts

Edge Functions experiencing 10x normal cold start times. Serverless Functions unaffected.

Edge Functions
MajorRailway
2026-02-05
3h
Platform-wide deployment failures

All deployments failing due to Docker build infrastructure issue. Running services unaffected.

DeploymentsBuilds
PartialNetlify
2026-02-05
3h 15min
Build queue backlog across all plans

Build queue times exceeding 25 minutes for all tiers including Enterprise. Priority queue not respected. Caused by build image registry corruption requiring rebuild.

BuildsDeploy Previews
MajorRailway
2026-01-28
1h 55min
Deployment pipeline failures and rollback issues

New deployments stuck in 'building' state. Rollback to previous deployment also failing. Running services unaffected. Caused by Nixpacks builder OOM during concurrent builds.

DeploymentsBuildsRollbacks
MajorVercel
2026-01-23
1h 52min
Serverless Functions timing out globally

Serverless Functions returning 504 timeouts regardless of function duration setting. Edge Functions unaffected. Root cause: internal routing table update.

Serverless FunctionsAPI Routes
MajorAWS
2026-01-22
2h 30min
us-east-1 EC2 and Lambda partial availability

EC2 instance launches failing in 3 of 6 AZs. Lambda cold starts timing out. ECS task placements failing. Caused by internal DNS resolution failures in control plane.

EC2LambdaECSInternal DNS
PartialStripe
2026-01-15
2h 10min
Webhook delivery backlog exceeding 30 minutes

Payment intent webhooks delayed by 30-90 minutes. Dashboard showing events as pending. Payments processed normally but downstream systems not notified. Caused by event bus partition rebalancing.

WebhooksEvent Delivery
DegradedCloudflare
2026-01-14
28 min
R2 Storage elevated error rates

R2 bucket reads returning intermittent 500 errors in EU regions. Writes unaffected. Workers binding to R2 saw failures.

R2 StorageWorkers R2 bindings
MajorSupabase
2026-01-09
1h 45min
Database connection storms in ap-southeast-1

Projects in Singapore region hitting connection limits. Pooler returning 'too many connections' despite available capacity. Caused by PgBouncer misconfiguration during scaling event.

DatabaseConnection PoolerPostgREST
MajorGitHub
2026-01-08
2h 10min
Git SSH and HTTPS operations failing

Git push, pull, and clone operations failing via both SSH and HTTPS. Web interface functional. Authentication service degraded.

Git OperationsSSHHTTPS
DegradedRender
2025-12-22
5h
Free tier cold starts exceeding 60 seconds

Free tier services experiencing 60-90 second cold starts (normal: 10-15s). Paid services unaffected. Caused by aggressive resource reclamation during holiday traffic spike.

Free TierCold Starts
MajorAWS
2025-12-18
3h 20min
us-east-1 DynamoDB and Cognito elevated errors

DynamoDB reads experiencing elevated latency and error rates in us-east-1. Cognito authentication failures cascade-impacted services using federated identity.

DynamoDBCognitoAppSync
PartialGitHub
2025-12-18
4h 20min
Actions runners severely degraded

Ubuntu-latest runners taking 45+ minutes to start. Windows and macOS runners at 50% capacity. Self-hosted runners unaffected. Caused by capacity crunch during end-of-year CI surge.

ActionsHosted Runners
PartialStripe
2025-12-10
41 min
Webhook delivery delays

Webhook events delayed by 10-30 minutes. Payment processing and API calls unaffected. Caused by event processing queue backup.

Webhooks
MajorCloudflare
2025-12-03
53 min
API Gateway and Workers KV global outage

Workers KV reads returning stale data or errors. API Gateway routes failing for customers using custom domains. Caused by distributed storage consistency issue during rollout.

Workers KVAPI GatewayCustom Domains
PartialOpenAI
2025-11-28
6h
Aggressive rate limiting on GPT-4 and o1 endpoints

Tier 4-5 customers hitting rate limits at 10% of their stated capacity. 429 errors returned with incorrect retry-after headers. Caused by capacity reallocation for new model deployment.

APIGPT-4o1Rate Limits
PartialFirebase
2025-11-22
1h 15min
Firebase Auth sign-in methods degraded

Google and email/password sign-in returning errors intermittently. OAuth redirect flows most affected. Phone auth unaffected.

Auth
MajorVercel
2025-11-14
1h 12min
Edge network routing failures across EU regions

Edge Functions returning 502 errors in EU-west and EU-central. Static assets served normally. Root cause: BGP route leak from upstream provider affected edge PoPs.

Edge FunctionsMiddlewareISR
MajorMongoDB Atlas
2025-11-12
2h 5min
Atlas cluster scaling failures in AWS us-east-1

Cluster auto-scaling and manual tier changes failing. Existing clusters operational but could not scale up during high load periods.

Cluster ManagementAuto-scaling
MajorOpenAI
2025-11-05
3h 20min
ChatGPT and API widespread unavailability

ChatGPT returning 503 errors. API returning elevated 500 rates across all models. Streaming endpoints most affected. Caused by infrastructure scaling issue.

ChatGPTAPIChat CompletionsAssistants
PartialGitHub
2025-03-12
1h 55min
Copilot and OAuth token validation failures

OAuth token validation service partial failure after deployment. Copilot stopped working in IDEs. API calls with OAuth tokens returned 401. Login via session cookies unaffected.

CopilotOAuthAPI Auth
PartialCloudflare
2025-02-28
1h 30min
D1 database unavailability — multiple regions

D1 replication failure from control plane update. SQLite replicas couldn't sync with primary. 60% of edge locations affected. Workers using D1 experienced full outages.

D1Workers
PartialVercel
2025-02-14
2h 00min
ISR revalidation silently failing

ISR revalidation worker pool exhausted connections to data cache layer. Revalidation requests silently dropped. Sites served stale content. Status page showed operational for 45 min.

ISRData CacheOn-demand Revalidation
PartialAnthropic
2025-01-23
2h 05min
Claude API 529 overload errors

Inference cluster capacity insufficient for sustained traffic spike. Auto-scaling couldn't provision GPU instances fast enough. 30%+ of requests returning 529.

Messages APIClaude.ai
MajorAWS
2025-01-08
2h 10min
S3 elevated error rates — us-east-1

Internal indexing partition split caused read inconsistencies. GET requests returned 404 or stale data for recently written objects. Cascading impact on Lambda, ECS, CloudFront.

S3CloudFrontLambdaECS
MajorOpenAI
2024-12-11
4h 15min
Full API outage during Sora/o1 launch

Traffic surge from new model launches overwhelmed API gateway. All endpoints returning 503 including GPT-4, Embeddings, and Assistants. ChatGPT also degraded.

Chat CompletionsEmbeddingsAssistantsChatGPT
MajorFly.io
2024-12-05
2h 40min
Machines API OOM crash loop — global deploy freeze

Machines API experienced OOM crash loop after deploy request surge. No new deployments or scaling possible. Running machines continued serving traffic.

Machines APIDeploymentsVolumes
MajorRender
2024-11-22
1h 50min
Oregon region complete outage

Network configuration change during maintenance caused routing loop. All services in Oregon unreachable. Other regions unaffected.

Web ServicesDatabasesCron Jobs
MajorSupabase
2024-11-18
1h 45min
Auth service outage — all regions

GoTrue memory leak triggered by OAuth callback traffic spike. All authentication operations failed globally. Existing sessions unaffected.

AuthOAuthMagic Links
MajorCloudflare
2024-11-01
55 min
Workers KV global read failures

Configuration push caused cache invalidation storm across all PoPs. KV reads returned errors or stale data. R2 and D1 unaffected.

Workers KVWorkers
MajorGitHub
2024-10-30
5h 40min
Actions job queue delays worldwide

Storage backend migration caused job dispatcher slowdown. Jobs accepted but not dispatched to runners. Queue grew to 500K+ pending jobs. Self-hosted runners unaffected.

ActionsPackages
MajorClerk
2024-10-15
1h 20min
Session verification failures — JWKS rotation error

New signing key deployed before public key propagated to edge nodes. All session verifications returned 401. Users logged out of every Clerk-powered app.

Session VerificationAuth APIJWKS
MajorVercel
2024-10-02
3h 10min
Edge Function cold start failures globally

V8 isolate pool exhaustion caused Edge Functions to timeout or return 504 errors. Static assets and ISR unaffected. All regions impacted simultaneously.

Edge FunctionsMiddlewareEdge Config
MajorAWS
2024-09-25
3h 20min
Lambda cold start degradation in us-east-1

Control plane update caused Lambda sandbox provisioning to take 5-10x longer. Cold starts exceeded 10 seconds. Warm invocations unaffected.

LambdaAPI Gateway
MajorRailway
2024-08-22
3h 05min
Deploy queue blocked by Nixpacks regression

Nixpacks builder update caused builds to hang on certain Node.js projects. Build queue backed up consuming all builder capacity. Existing deployments unaffected.

DeploymentsBuilds
Majornpm
2024-08-05
2h 30min
Registry publish and install failures

CouchDB replication lag caused package metadata inconsistencies. Some packages not found, others returned stale versions. Affected npm, yarn, and pnpm.

npm RegistryPackage Publishing
MajorStripe
2024-07-19
4h 00min
Webhook delivery delays up to 4 hours

Webhook queue backed up after database partition rebalance. Payments processed normally but webhook notifications delayed 1-4 hours. Order fulfillment systems stalled.

WebhooksEvents API
MajorDocker Hub
2024-07-08
6h 00min
Image pull rate limiting misclassification

Rate limiting logic incorrectly classified authenticated pulls as anonymous. CI/CD pipelines using Docker Hub images failed with 429 errors globally.

Image PullDocker Hub API
MajorFirebase
2024-06-03
2h 15min
Firestore write failures — multi-region US

Bigtable replication lag in nam5 multi-region caused write commits to fail. Reads from cache worked but realtime listeners stopped. Auth and Hosting unaffected.

FirestoreRealtime Listeners
MajorPlanetScale
2024-04-10
3h 45min
Deploy request queue stalled during Vitess upgrade

Vitess version upgrade caused VReplication streams to stall. Deploy requests and branch merges stuck in pending state. Database reads/writes normal.

Deploy RequestsBranchingSchema Migrations
MajorPlanetScale
2024-04-06
8h
Mass migration deadline causing export queue saturation

Database export tools timing out as thousands of free tier users attempted migration before deadline. Export queue backed up 6+ hours. Emergency capacity added.

Database ExportCLIDashboard

Explore other areas