Soft.xyz is an independent, data-driven platform that compares 232+ developer tools across 69 categories by real metrics — GitHub activity, npm downloads, actual pricing, DX scores, and community sentiment. Unlike pay-to-play review sites, every ranking on Soft.xyz is based on publicly verifiable data pulled from GitHub, npm, and official pricing pages.

How does Soft.xyz collect data?

Soft.xyz pulls data from public APIs including GitHub (stars, forks, commits, open issues, last commit date), npm (weekly download counts), and official pricing pages. Community sentiment data comes from Reddit and Hacker News mention tracking. All metrics are refreshed weekly and timestamped so you can see exactly when data was last updated.

Is Soft.xyz free to use?

Yes, Soft.xyz is completely free to use. All tool pages, comparisons, benchmarks, DX scores, pricing analysis, and category rankings are available at no cost. For teams that need a deeper analysis, we offer a paid Stack Audit report ($199) that evaluates your entire tech stack with personalized recommendations. An API is also available for programmatic access to our dataset.

How many tools does Soft.xyz track?

Soft.xyz tracks 232 developer tools across 69 categories including databases, hosting, authentication, payments, analytics, CI/CD, monitoring, search, CMS, and more. Each tool page includes real-time GitHub metrics, npm download counts, pricing breakdowns, and side-by-side comparisons with alternatives in the same category.

How often do SaaS services go down?

Most major SaaS services experience multiple incidents per month, ranging from minor degradations to full outages. Our tracker logs real incidents with dates, durations, and severity levels. The Uptime Scorecard compares claimed vs. actual uptime percentages.

Which services have the most outages?

Filter by service on this page to see each provider's incident history. We track major outages, partial disruptions, and performance degradations. The uptime gap between claimed and actual availability reveals which providers over-promise on reliability.

How to prepare for service outages?

Build redundancy into critical paths, use multiple providers for essential services, implement circuit breakers, and have fallback strategies. Check each service's average resolution time on our Uptime Scorecard to understand how long outages typically last before planning your resilience strategy.

Share:X LinkedIn

SaaS Outage Tracker

Real incident history for developer tools and SaaS platforms — not what the status page claims, but what actually happened. Each incident includes the date, duration, affected services, and severity level. The Uptime Scorecard compares claimed uptime percentages against actual measured availability. The Post-Mortem Analysis tab dives deeper into major incidents with root causes, communication grades, and lessons learned.

DegradedCloudflare

2026-04-26

20 min

1.1.1.1 DNS resolver intermittent failures

1.1.1.1 public DNS resolver experiencing intermittent SERVFAIL responses for some queries. Authoritative DNS and Workers unaffected.

1.1.1.1 Resolver

MajorRender

2026-04-24

2h 10min

Deploy failures on Oregon region

All new deploys in Oregon region failing with build timeout errors. Running services unaffected. Caused by storage I/O degradation on build infrastructure.

DeploysBuilds

MajorGroq

2026-04-23

1h 25min

LPU inference cluster overloaded

Llama 3.3 70B and Mixtral endpoints returning 503 'capacity exceeded' errors. Smaller models routed to overflow capacity with elevated latency.

Chat Completionsllama-3.3-70bmixtral-8x7b

MajorOpenAI

2026-04-22

1h 30min

API 500 errors — all models

Elevated 500 error rates across GPT-4o and o3 endpoints. Streaming requests most affected.

Chat CompletionsEmbeddings

MajorWebflow

2026-04-22

1h 05min

CMS publishing failures across all sites

Site publishes stuck in pending state. CMS item updates not reflecting on live sites. Editor functional but changes not deployable.

PublishingCMSHosting

PartialPostHog

2026-04-22

1h 05min

Feature flag evaluations degraded

Feature flag API returning stale values or timing out under load. Event capture and session recordings unaffected. ClickHouse query pressure affecting flag cache refresh.

Feature FlagsFlag API

PartialAnthropic

2026-04-21

1h 15min

Elevated latency on Claude Sonnet endpoints

Claude 3.5 Sonnet streaming requests experiencing 3-5x normal TTFB. Haiku and Opus endpoints unaffected. Caused by GPU cluster rebalancing.

APIclaude-3-5-sonnet

MajorSendGrid

2026-04-21

1h 10min

Transactional email delivery failures for high-volume senders

High-volume senders (>100K/hr) experiencing delivery failures. Emails accepted by API but not delivered. Low-volume senders unaffected.

Mail Send APIDeliverySMTP

PartialInngest

2026-04-20

55 min

Function invocation delays

Event-triggered functions experiencing 5-15 minute execution delays instead of sub-second. Cron-triggered functions unaffected. Event queue processing backlog.

Function TriggersEvent Processing

MajorClickHouse Cloud

2026-04-20

1h 30min

Query processing failures in AWS us-east-1

SELECT queries returning internal errors for services in us-east-1. INSERT operations queued but not lost. Caused by ZooKeeper coordination layer restart.

QueriesDashboards

PartialDigitalOcean

2026-04-20

55 min

App Platform deploy failures in NYC1

App Platform deployments in NYC1 region failing with build timeout. Existing apps running normally. Functions also affected in same region.

App PlatformFunctionsDeployments

DegradedConvex

2026-04-20

15 min

Realtime subscription delays

Realtime subscriptions experiencing 5-10 second delays instead of sub-second. Queries and mutations unaffected. Caused by a hot partition in the subscription fanout layer.

Realtime subscriptions

PartialSplunk Cloud

2026-04-19

1h 20min

Search head cluster degradation

Search queries timing out or returning partial results in US region. Data ingestion unaffected. Caused by search head captain election loop after maintenance.

SearchDashboardsAlerts

MajorGitLab.com

2026-04-19

1h 40min

CI/CD pipeline execution failures

GitLab CI runners unable to pick up new jobs. Pipelines stuck in pending state. Merge requests blocked on pipeline status. Web IDE unaffected.

CI/CDRunnersPipelines

MajorVercel

2026-04-18

47 min

Global deployment failures

All new deployments failed for 47 minutes due to build infrastructure issue. Existing deployments unaffected.

DeploymentsBuild API

DegradedSentry

2026-04-18

40 min

Event processing delays in US region

Error events accepted but appearing in dashboard with 15-30 minute delay. Alert notifications delayed accordingly. Performance data also lagging.

Event ProcessingAlertsPerformance

DegradedTwilio

2026-04-18

35 min

Programmable Messaging SMS delivery delays

US long-code SMS messages delayed by 5-15 minutes instead of sub-second. Short-code and toll-free unaffected. Carrier routing issue.

SMS MessagingProgrammable Messaging

MajorNotion

2026-04-17

1h 10min

Real-time collaboration broken

Multiplayer editing failing — users seeing stale content and edit conflicts. Single-user editing worked but changes not syncing between clients.

Real-time SyncCollaborationAPI

MajorScaleway

2026-04-17

1h 30min

Object Storage API errors in PAR1

S3-compatible Object Storage returning 503 errors in Paris region. Compute instances unaffected. Serverless Functions depending on object storage failing.

Object StorageServerless Functions

MajorGroq

2026-04-17

55 min

All model endpoints returning 503 errors

Complete API unavailability due to LPU cluster maintenance window extended unexpectedly. No inference capacity served during the window.

APIAll Models

DegradedUpstash

2026-04-16

45 min

Elevated Redis latency in US East

REST API and native Redis protocol connections experiencing 3-5x normal latency. QStash webhook deliveries delayed by 2-10 minutes.

RedisQStash

MajorCircleCI

2026-04-16

1h 15min

Build queue processing halted

All builds queued but not starting execution. Running builds completed normally. Docker layer caching service also degraded.

BuildsQueueDocker Layer Cache

PartialGitHub

2026-04-15

1h 30min

Actions and Pages degraded

GitHub Actions queue times increased to 30+ minutes. Pages deployments delayed.

ActionsPages

MajorZapier

2026-04-15

2h 30min

Zap execution engine backlog

Zap triggers firing but actions queued for 30-90 minutes instead of near-instant. Webhook triggers most affected. Scheduled triggers ran on time.

Zap ExecutionWebhooksActions

MajorDiscord

2026-04-15

2h 5min

Voice and gateway connection failures

Voice channels disconnecting and gateway WebSocket failing to reconnect for ~30% of users. Bot APIs degraded. Caused by misconfigured edge routing change.

VoiceGatewayBot API

DegradedGoogle Cloud Run

2026-04-15

40 min

Cold start latency spike in europe-west1

Cloud Run services in europe-west1 experiencing 10-30 second cold starts vs normal <2s. Warm instances unaffected. Caused by container registry caching issue.

ServicesCold Starts

PartialNeon

2026-04-15

1h 10min

Branch creation and deletion failing

Database branching operations timing out. Existing branches and connections fully operational. Caused by storage layer backlog during internal migration.

Branch creationBranch deletionDashboard

MajorFly.io

2026-04-14

2h 30min

Machines API unresponsive

Fly Machines API returning timeouts. Running apps stayed up but scaling and deploys failed.

Machines APIDeployments

DegradedShopify

2026-04-14

35 min

Storefront API elevated latency

Storefront API response times 3-5x higher than normal globally. Checkout flow unaffected. Headless storefronts experienced slow page loads.

Storefront APIGraphQL

DegradedSentry

2026-04-14

40 min

Event ingestion delays — US region

Error events delayed 15-30 minutes before appearing in the dashboard. Alert rules not firing on time. Ingest API accepting events without errors.

Event IngestionAlertsPerformance

PartialAmplitude

2026-04-13

1h 35min

Cohort sync failures to downstream tools

Amplitude cohort syncs to destinations (Braze, Iterable, etc.) failing silently. Event ingestion and dashboards unaffected. Caused by destination sync worker OOM.

Cohort SyncIntegrations

MajorAppwrite Cloud

2026-04-13

1h 45min

Database and auth service errors

Database queries returning errors. Auth service intermittently failing. File storage operational. Caused by cloud infrastructure provider network issue.

DatabaseAuthFunctions

PartialMixpanel

2026-04-13

50 min

Cohort sync failures to downstream destinations

Mixpanel cohort syncs to Braze, Iterable, and other destinations failing. Event ingestion and dashboards unaffected. Destination sync worker OOM restart loop.

Cohort SyncIntegrations

MajorSupabase

2026-04-12

2h 15min

Database connectivity issues — US East

PostgreSQL connections timing out for projects in us-east-1. Caused by underlying AWS networking issue.

DatabaseAuthRealtime

MajorJira Cloud

2026-04-12

Jira Service Management queue failures

JSM queues not processing new tickets. Existing tickets accessible but new submissions stuck in limbo. Confluence and Bitbucket unaffected.

JSM QueuesTicket CreationAutomation

PartialAxiom

2026-04-12

50 min

Query API timeouts for large time ranges

APL queries spanning >7 days returning timeout errors. Sub-day queries and ingest unaffected. Caused by query planner regression.

Query APIAPL

MajorClerk

2026-04-12

25 min

Authentication API returning 500 errors

Sign-in and sign-up endpoints returning 500 for ~25 minutes. Users unable to authenticate. Caused by a failed database migration in the auth token service.

Sign-inSign-upSession validation

MajorNew Relic

2026-04-11

1h 45min

NRQL query engine unavailable

NRQL queries returning errors across all accounts. Dashboards blank. Alert conditions not evaluating. Data ingestion continued normally.

NRQLDashboardsAlerting

MajorBitbucket Cloud

2026-04-11

1h 50min

Git push/pull operations timing out

Git operations over HTTPS returning timeouts. SSH git operations partially affected. Web UI accessible but showing stale repository state.

Git OperationsHTTPSSSH

PartialReplicate

2026-04-11

55 min

Prediction queue backlog

Public model predictions queued for 2-5 minutes vs sub-second baseline. Private deployments unaffected. Caused by autoscaler lag during traffic spike.

Public ModelsPredictions API

PartialTurso

2026-04-10

1h 20min

Edge replica sync delays across EU regions

Edge replicas in EU regions lagging 15-30 seconds behind primary. Read-after-write consistency broken for EU-deployed apps.

Edge ReplicasReplication

MajorMongoDB Atlas

2026-04-10

1h 20min

Serverless instance connection failures

MongoDB Atlas Serverless instances in AWS us-east-1 returning connection refused. Dedicated and shared clusters unaffected. Data API also impacted.

Serverless InstancesData API

DegradedMiro

2026-04-09

35 min

Board widget rendering delays

Sticky notes and shapes rendering with 5-10 second delays. Board loading times increased 3x. Caused by CDN cache purge propagation issue.

Board RenderingCDN

MajorGhost(Pro)

2026-04-09

1h 30min

Ghost(Pro) sites returning 502 errors

~20% of Ghost(Pro) hosted blogs returning 502. Admin panel inaccessible for affected sites. Newsletter scheduling delayed. Self-hosted unaffected.

HostingAdmin PanelNewsletter

MajorOpenAI

2026-04-09

2h 40min

Sora video generation queue stuck

Sora video generation jobs accepted but not processing. ChatGPT and standard API unaffected. Caused by video processing worker pool deadlock.

SoraVideo Generation

DegradedLinear

2026-04-09

30 min

Sync delays for large workspaces

Workspaces with 10K+ issues experiencing 30-60 second sync delays. New issue creation working but UI not reflecting changes in real time.

Real-time SyncApp

PartialFirebase

2026-04-08

1h 45min

Firestore read latency spike

Firestore read latency increased 10x in us-central1. Write operations unaffected.

FirestoreCloud Functions

PartialBackblaze B2

2026-04-08

45 min

B2 API upload failures in US West

Large file uploads (>100MB) failing with timeout errors. Small file uploads and downloads unaffected. S3-compatible API also impacted.

Upload APIS3 APILarge Files

DegradedLemon Squeezy

2026-04-08

45 min

Checkout page loading failures

Checkout pages for ~15% of stores returning 502 errors. API and dashboard unaffected. Caused by CDN edge misconfiguration after SSL certificate renewal.

CheckoutStorefront

MajorOpenRouter

2026-04-08

2h 15min

Multiple model providers unavailable

Anthropic and Google models returning 503 errors through OpenRouter. Direct provider APIs were functional — issue was in OpenRouter's provider proxy layer routing.

Claude modelsGemini modelsAPI routing

Majorn8n Cloud

2026-04-07

1h 15min

Workflow execution failures across all regions

All triggered and scheduled workflows failing with internal server error. Workflow editor accessible but executions not starting. Caused by message broker outage.

ExecutionsTriggersWebhooks

PartialStoryblok

2026-04-07

Visual Editor connection timeouts

Visual Editor failing to connect to preview environments. Content API reads/writes normal. Form-based editing unaffected. Caused by WebSocket proxy issue.

Visual EditorPreview

DegradedLinear

2026-04-06

25 min

GitHub integration sync failures

Pull request and branch references not syncing from GitHub. Issue state changes via GitHub not reflected. Manual refresh partially resolved for some users.

GitHub IntegrationSync

MajorTrigger.dev

2026-04-05

1h 45min

Task execution engine unresponsive

All queued tasks stuck in 'pending' state. Running tasks completed but new tasks not picked up. Dashboard showing stale execution status.

Task ExecutionQueueDashboard

PartialPulumi Cloud

2026-04-03

1h 05min

Stack state operations timing out

Pulumi up/destroy operations failing with state lock timeout. Stack exports and imports also affected. CLI operations against local state unaffected.

State ManagementDeployments

PartialPostHog

2026-03-30

1h 10min

Event ingestion lag in US Cloud

Events accepted but appearing in dashboards with 30-60 minute delay. Feature flags and session recordings unaffected. ClickHouse ingestion backlog.

Event IngestionDashboards

MajorModal

2026-03-29

1h 50min

GPU function cold starts failing

Functions requiring A100 and H100 GPUs failing to start with 'no capacity available' errors. CPU functions unaffected. Caused by upstream cloud GPU shortage in us-east-1.

GPU FunctionsImage Builds

DegradedStripe

2026-03-28

23 min

Elevated API error rates

0.5% of API requests returning 500 errors. Payment processing unaffected for most merchants.

API

DegradedElastic Cloud

2026-03-28

55 min

Kibana dashboard loading failures

Kibana dashboards returning 502 errors for deployments in GCP us-central1. Elasticsearch queries via API unaffected.

KibanaDashboards

PartialWasabi

2026-03-28

1h 10min

Elevated latency in EU Central region

Object operations in eu-central-1 experiencing 5-10x normal latency. US regions unaffected. ListBucket operations most impacted.

Storage APIEU Central

DegradedAlgolia

2026-03-28

35 min

Indexing delays for large index pushes

Index updates queued for 10-20 minutes instead of near-instant. Search queries using existing index unaffected. Caused by temporary indexing cluster pressure.

Indexing APIIndex Updates

PartialCerebras

2026-03-28

45 min

Inference latency spike on Llama 3.3 70B

Response times degraded from <500ms to 3-5 seconds. Wafer-scale compute cluster rebalancing after hardware maintenance. Smaller models unaffected.

Llama 3.3 70B inference

DegradedStytch

2026-03-27

25 min

Magic link delivery delays

Magic link and OTP emails delayed by 3-8 minutes instead of sub-10s. Password auth and OAuth flows unaffected. Email provider rate limiting issue.

Magic LinksOTP Email

DegradedStytch

2026-03-27

25 min

Magic link and OTP email delivery delays

Magic link and OTP emails delayed by 3-8 minutes. OAuth and password auth unaffected. Email provider rate limiting triggered by transient traffic spike.

Magic LinksOTP Email

PartialTimescale Cloud

2026-03-26

55 min

Continuous aggregate refresh failures

Continuous aggregates not refreshing on schedule. Raw data queries unaffected. Dashboard views showing stale data up to 2 hours old.

Continuous AggregatesScheduled Jobs

PartialSlack

2026-03-26

1h 15min

Message delivery delays in EU workspaces

Message send/receive delayed by 30-90 seconds for EU workspaces. Search and integrations also affected. Caused by Vitess shard rebalancing operation.

MessagingSearchIntegrations

PartialNeon

2026-03-25

1h 40min

Connection pooler errors in us-east-2

Serverless driver connections failing intermittently in us-east-2. Direct connections unaffected. Caused by pooler autoscaling misconfiguration.

Serverless DriverConnection Pooler

MajorTemporal Cloud

2026-03-23

1h 50min

Workflow execution history unavailable

Workflow history queries returning empty results. Running workflows continued but new workflow starts failing due to deduplication check failures.

History ServiceWorkflow Starts

MajorFly.io

2026-03-22

ORD region hardware failure

Physical server failure in Chicago region. Apps with multi-region setup unaffected. Single-region ORD apps experienced downtime.

ComputeVolumes

MajorOVHcloud

2026-03-22

2h 15min

Network degradation in GRA datacenter

Packet loss and elevated latency for servers in Gravelines datacenter. VPS and dedicated servers affected. Internal network between DCs unaffected.

NetworkingVPSDedicated Servers

PartialFigma

2026-03-21

50 min

File loading timeouts for large projects

Figma files >500MB failing to load with timeout errors. Smaller files unaffected. Dev Mode and Inspect panel also impacted for large files.

EditorDev ModeFile Loading

MajorCloudflare

2026-03-20

35 min

Workers and Pages outage

Cloudflare Workers and Pages returning 502 errors globally. Root cause: bad config deployment.

WorkersPagesKV

PartialFramer

2026-03-20

55 min

Editor preview rendering failures

Framer editor preview pane showing blank content. Published sites unaffected. Code components failing to render in editor. Publish still functional.

EditorPreviewCode Components

PartialGrafana Cloud

2026-03-19

50 min

Alerting evaluation delays in EU stack

Alert rules in EU region evaluated with 10-20 minute delay. Dashboards and metrics ingestion unaffected. Caused by Cortex ruler pod restart loop.

AlertingRuler

PartialSupabase

2026-03-19

55 min

Auth provider OAuth callback failures

Google and GitHub OAuth sign-ins returning 'invalid_grant' for new sessions. Existing sessions unaffected. Email/password and magic link auth working normally.

AuthOAuth

PartialDatadog

2026-03-18

42 min

Metrics ingestion delays

Custom metrics delayed by 5-15 minutes in US1 region. Alerts based on delayed metrics may have fired late.

MetricsMonitors

MajorTerraform Cloud

2026-03-17

1h 25min

Plan and apply runs stuck in queue

All Terraform runs entering pending state and not executing. State file locking working but no plan/apply operations completing. Caused by worker pool scaling failure.

RunsPlansApplies

MajorAWS

2026-03-15

us-east-1 S3 and Lambda degraded

S3 bucket operations and Lambda invocations experiencing elevated error rates in us-east-1.

S3LambdaAPI Gateway

MajorResend

2026-03-15

1h 30min

Email delivery failures across all regions

API accepting requests but emails not being delivered. Webhook delivery confirmations delayed. SMTP relay returning 421 temporary errors.

Email DeliverySMTPWebhooks

DegradedContentstack

2026-03-15

35 min

Content Delivery API slow responses

Content Delivery API response times increased 3x in NA region. Management API unaffected. CDN cache serving stale content for some entries.

Content Delivery API

PartialDatadog

2026-03-15

1h 30min

Metric ingestion delays in US1 region

Custom metrics showing 10-15 minute ingestion lag. Monitors firing late or not at all. Infrastructure metrics less affected. Caused by intake pipeline partition skew.

MetricsMonitorsDashboards

DegradedDynatrace

2026-03-14

40 min

Smartscape topology map delays

Smartscape topology updates delayed by 15-30 minutes. Metrics and log ingestion unaffected. Davis AI alerting slightly delayed.

SmartscapeDavis AI

PartialAnthropic

2026-03-14

50 min

Messages API rate limit errors

Messages API returning 529 overloaded errors at 3x normal rate. Batch API unaffected. Claude.ai web interface also experiencing slower responses.

Messages APIClaude.ai

DegradedWorkOS

2026-03-14

22 min

SSO SAML assertion validation latency

SAML-based SSO logins experiencing 10-20 second delays for enterprise connections. SCIM sync and OAuth unaffected. IDP metadata cache refresh caused the issue.

SSOSAML

MajorGitHub

2026-03-13

1h 20min

Copilot completions unavailable

GitHub Copilot returning errors in IDE for all users. Chat feature also affected. Underlying Azure OpenAI deployment experienced capacity issues.

CopilotCopilot Chat

DegradedLinear

2026-03-12

35 min

Sync delays for large workspaces

Workspaces with 10K+ issues experiencing 30-60 second sync delays. New issue creation working but UI not reflecting changes immediately.

Real-time SyncApp

DegradedKong Konnect

2026-03-11

30 min

Control plane sync delays

Configuration changes in Konnect not propagating to data plane nodes for 10-20 minutes. Existing routes unaffected. New route creation delayed.

Control PlaneConfig Sync

PartialNetlify

2026-03-10

55 min

Build queue backlog

Builds queuing for 20+ minutes instead of usual <2 minutes. Caused by surge in traffic.

Builds

MajorInfluxDB Cloud

2026-03-09

1h 40min

Write endpoint rejecting data points

InfluxDB Cloud write API returning 503 for all organizations in US region. Reads and queries unaffected. Caused by storage engine compaction backlog.

Write APIData Ingestion

MajorPlanetScale

2026-03-08

2h 30min

Branch promotion failures in US East

Schema changes via branch promotion failing with timeout errors. Direct database queries unaffected. Caused by Vitess vttablet resource exhaustion.

Branch PromotionsSchema Changes

PartialGitLab.com

2026-03-08

50 min

Container Registry push failures

Docker image pushes to GitLab Container Registry returning 500 errors. Image pulls from existing tags working. CI jobs building Docker images failing.

Container RegistryCI/CD

PartialFly.io

2026-03-08

2h 40min

DNS propagation failures for new apps

Newly created apps not resolving via fly.dev subdomain. Existing apps unaffected. Custom domains working. Caused by DNS zone file sync lag between authoritative nameservers.

DNSfly.dev SubdomainsNew Apps

DegradedOpenAI

2026-03-05

45 min

GPT-4o response quality degradation

GPT-4o returning truncated or low-quality responses. Suspected routing issue to degraded model shard.

Chat Completions

PartialHCP Vault

2026-02-28

45 min

Secret read latency spike in US region

Vault secret read operations experiencing 5-10x normal latency. Secret writes unaffected. Auth token validation delayed. Caused by storage backend compaction.

Secret EngineAuth

DegradedClerk

2026-02-25

18 min

Sign-in latency increase

Sign-in and sign-up flows taking 5-10 seconds instead of <1s. No auth failures reported.

Authentication

DegradedAnthropic

2026-02-21

35 min

Tool use responses malformed

Claude Sonnet returning malformed tool_use blocks intermittently for ~5% of requests. Issue traced to model serving layer rollback.

Tool UseMessages API

MajorCockroachDB

2026-02-20

1h 55min

Serverless cluster connection failures

CockroachDB Serverless clusters in US regions returning connection refused errors. Dedicated clusters unaffected. Caused by proxy layer autoscaling bug.

Serverless ClustersConnection Proxy

MajorClerk

2026-02-20

42 min

Authentication failures across all sign-in methods

All sign-in attempts returning 500 errors. Session validation failing for existing sessions. Caused by database migration that locked the sessions table.

Sign-inSession ValidationOAuth

DegradedPostmark

2026-02-19

28 min

SMTP submission elevated error rates

SMTP server returning temporary 451 errors for ~5% of submission attempts. API delivery unaffected. Caused by SMTP authentication service restart.

SMTPEmail Delivery

PartialOpenAI

2026-02-18

Rate limits applied incorrectly

Tier 4+ accounts receiving Tier 1 rate limits. Batch API unaffected.

Rate LimitsChat Completions

MajorHetzner

2026-02-12

2h 40min

Falkenstein DC network degradation

Packet loss and elevated latency for servers in Falkenstein data center. Caused by upstream provider fiber cut. Helsinki and Nuremberg DCs unaffected.

Cloud ServersDedicated ServersNetworking

MajorMongoDB Atlas

2026-02-12

18 min

Unexpected primary elections across multiple clusters

M10+ clusters in AWS us-east-1 experiencing simultaneous primary elections. Applications saw connection resets and write failures for 15-18 minutes. Caused by network partition in MongoDB's management plane.

Cluster AvailabilityWrite Operations

PartialVercel

2026-02-10

1h 10min

Edge Functions cold starts

Edge Functions experiencing 10x normal cold start times. Serverless Functions unaffected.

Edge Functions

MajorRailway

2026-02-05

Platform-wide deployment failures

All deployments failing due to Docker build infrastructure issue. Running services unaffected.

DeploymentsBuilds

PartialNetlify

2026-02-05

3h 15min

Build queue backlog across all plans

Build queue times exceeding 25 minutes for all tiers including Enterprise. Priority queue not respected. Caused by build image registry corruption requiring rebuild.

BuildsDeploy Previews

MajorRailway

2026-01-28

1h 55min

Deployment pipeline failures and rollback issues

New deployments stuck in 'building' state. Rollback to previous deployment also failing. Running services unaffected. Caused by Nixpacks builder OOM during concurrent builds.

DeploymentsBuildsRollbacks

MajorVercel

2026-01-23

1h 52min

Serverless Functions timing out globally

Serverless Functions returning 504 timeouts regardless of function duration setting. Edge Functions unaffected. Root cause: internal routing table update.

Serverless FunctionsAPI Routes

MajorAWS

2026-01-22

2h 30min

us-east-1 EC2 and Lambda partial availability

EC2 instance launches failing in 3 of 6 AZs. Lambda cold starts timing out. ECS task placements failing. Caused by internal DNS resolution failures in control plane.

EC2LambdaECSInternal DNS

PartialStripe

2026-01-15

2h 10min

Webhook delivery backlog exceeding 30 minutes

Payment intent webhooks delayed by 30-90 minutes. Dashboard showing events as pending. Payments processed normally but downstream systems not notified. Caused by event bus partition rebalancing.

WebhooksEvent Delivery

DegradedCloudflare

2026-01-14

28 min

R2 Storage elevated error rates

R2 bucket reads returning intermittent 500 errors in EU regions. Writes unaffected. Workers binding to R2 saw failures.

R2 StorageWorkers R2 bindings

MajorSupabase

2026-01-09

1h 45min

Database connection storms in ap-southeast-1

Projects in Singapore region hitting connection limits. Pooler returning 'too many connections' despite available capacity. Caused by PgBouncer misconfiguration during scaling event.

DatabaseConnection PoolerPostgREST

MajorGitHub

2026-01-08

2h 10min

Git SSH and HTTPS operations failing

Git push, pull, and clone operations failing via both SSH and HTTPS. Web interface functional. Authentication service degraded.

Git OperationsSSHHTTPS

DegradedRender

2025-12-22

Free tier cold starts exceeding 60 seconds

Free tier services experiencing 60-90 second cold starts (normal: 10-15s). Paid services unaffected. Caused by aggressive resource reclamation during holiday traffic spike.

Free TierCold Starts

MajorAWS

2025-12-18

3h 20min

us-east-1 DynamoDB and Cognito elevated errors

DynamoDB reads experiencing elevated latency and error rates in us-east-1. Cognito authentication failures cascade-impacted services using federated identity.

DynamoDBCognitoAppSync

PartialGitHub

2025-12-18

4h 20min

Actions runners severely degraded

Ubuntu-latest runners taking 45+ minutes to start. Windows and macOS runners at 50% capacity. Self-hosted runners unaffected. Caused by capacity crunch during end-of-year CI surge.

ActionsHosted Runners

PartialStripe

2025-12-10

41 min

Webhook delivery delays

Webhook events delayed by 10-30 minutes. Payment processing and API calls unaffected. Caused by event processing queue backup.

Webhooks

MajorCloudflare

2025-12-03

53 min

API Gateway and Workers KV global outage

Workers KV reads returning stale data or errors. API Gateway routes failing for customers using custom domains. Caused by distributed storage consistency issue during rollout.

Workers KVAPI GatewayCustom Domains

PartialOpenAI

2025-11-28

Aggressive rate limiting on GPT-4 and o1 endpoints

Tier 4-5 customers hitting rate limits at 10% of their stated capacity. 429 errors returned with incorrect retry-after headers. Caused by capacity reallocation for new model deployment.

APIGPT-4o1Rate Limits

PartialFirebase

2025-11-22

1h 15min

Firebase Auth sign-in methods degraded

Google and email/password sign-in returning errors intermittently. OAuth redirect flows most affected. Phone auth unaffected.

Auth

MajorVercel

2025-11-14

1h 12min

Edge network routing failures across EU regions

Edge Functions returning 502 errors in EU-west and EU-central. Static assets served normally. Root cause: BGP route leak from upstream provider affected edge PoPs.

Edge FunctionsMiddlewareISR

MajorMongoDB Atlas

2025-11-12

2h 5min

Atlas cluster scaling failures in AWS us-east-1

Cluster auto-scaling and manual tier changes failing. Existing clusters operational but could not scale up during high load periods.

Cluster ManagementAuto-scaling

MajorOpenAI

2025-11-05

3h 20min

ChatGPT and API widespread unavailability

ChatGPT returning 503 errors. API returning elevated 500 rates across all models. Streaming endpoints most affected. Caused by infrastructure scaling issue.

ChatGPTAPIChat CompletionsAssistants

PartialGitHub

2025-03-12

1h 55min

Copilot and OAuth token validation failures

OAuth token validation service partial failure after deployment. Copilot stopped working in IDEs. API calls with OAuth tokens returned 401. Login via session cookies unaffected.

CopilotOAuthAPI Auth

PartialCloudflare

2025-02-28

1h 30min

D1 database unavailability — multiple regions

D1 replication failure from control plane update. SQLite replicas couldn't sync with primary. 60% of edge locations affected. Workers using D1 experienced full outages.

D1Workers

PartialVercel

2025-02-14

2h 00min

ISR revalidation silently failing

ISR revalidation worker pool exhausted connections to data cache layer. Revalidation requests silently dropped. Sites served stale content. Status page showed operational for 45 min.

ISRData CacheOn-demand Revalidation

PartialAnthropic

2025-01-23

2h 05min

Claude API 529 overload errors

Inference cluster capacity insufficient for sustained traffic spike. Auto-scaling couldn't provision GPU instances fast enough. 30%+ of requests returning 529.

Messages APIClaude.ai

MajorAWS

2025-01-08

2h 10min

S3 elevated error rates — us-east-1

Internal indexing partition split caused read inconsistencies. GET requests returned 404 or stale data for recently written objects. Cascading impact on Lambda, ECS, CloudFront.

S3CloudFrontLambdaECS

MajorOpenAI

2024-12-11

4h 15min

Full API outage during Sora/o1 launch

Traffic surge from new model launches overwhelmed API gateway. All endpoints returning 503 including GPT-4, Embeddings, and Assistants. ChatGPT also degraded.

Chat CompletionsEmbeddingsAssistantsChatGPT

MajorFly.io

2024-12-05

2h 40min

Machines API OOM crash loop — global deploy freeze

Machines API experienced OOM crash loop after deploy request surge. No new deployments or scaling possible. Running machines continued serving traffic.

Machines APIDeploymentsVolumes

MajorRender

2024-11-22

1h 50min

Oregon region complete outage

Network configuration change during maintenance caused routing loop. All services in Oregon unreachable. Other regions unaffected.

Web ServicesDatabasesCron Jobs

MajorSupabase

2024-11-18

1h 45min

Auth service outage — all regions

GoTrue memory leak triggered by OAuth callback traffic spike. All authentication operations failed globally. Existing sessions unaffected.

AuthOAuthMagic Links

MajorCloudflare

2024-11-01

55 min

Workers KV global read failures

Configuration push caused cache invalidation storm across all PoPs. KV reads returned errors or stale data. R2 and D1 unaffected.

Workers KVWorkers

MajorGitHub

2024-10-30

5h 40min

Actions job queue delays worldwide

Storage backend migration caused job dispatcher slowdown. Jobs accepted but not dispatched to runners. Queue grew to 500K+ pending jobs. Self-hosted runners unaffected.

ActionsPackages

MajorClerk

2024-10-15

1h 20min

Session verification failures — JWKS rotation error

New signing key deployed before public key propagated to edge nodes. All session verifications returned 401. Users logged out of every Clerk-powered app.

Session VerificationAuth APIJWKS

MajorVercel

2024-10-02

3h 10min

Edge Function cold start failures globally

V8 isolate pool exhaustion caused Edge Functions to timeout or return 504 errors. Static assets and ISR unaffected. All regions impacted simultaneously.

Edge FunctionsMiddlewareEdge Config

MajorAWS

2024-09-25

3h 20min

Lambda cold start degradation in us-east-1

Control plane update caused Lambda sandbox provisioning to take 5-10x longer. Cold starts exceeded 10 seconds. Warm invocations unaffected.

LambdaAPI Gateway

MajorRailway

2024-08-22

3h 05min

Deploy queue blocked by Nixpacks regression

Nixpacks builder update caused builds to hang on certain Node.js projects. Build queue backed up consuming all builder capacity. Existing deployments unaffected.

DeploymentsBuilds

Majornpm

2024-08-05

2h 30min

Registry publish and install failures

CouchDB replication lag caused package metadata inconsistencies. Some packages not found, others returned stale versions. Affected npm, yarn, and pnpm.

npm RegistryPackage Publishing

MajorStripe

2024-07-19

4h 00min

Webhook delivery delays up to 4 hours

Webhook queue backed up after database partition rebalance. Payments processed normally but webhook notifications delayed 1-4 hours. Order fulfillment systems stalled.

WebhooksEvents API

MajorDocker Hub

2024-07-08

6h 00min

Image pull rate limiting misclassification

Rate limiting logic incorrectly classified authenticated pulls as anonymous. CI/CD pipelines using Docker Hub images failed with 429 errors globally.

Image PullDocker Hub API

MajorFirebase

2024-06-03

2h 15min

Firestore write failures — multi-region US

Bigtable replication lag in nam5 multi-region caused write commits to fail. Reads from cache worked but realtime listeners stopped. Auth and Hosting unaffected.

FirestoreRealtime Listeners

MajorPlanetScale

2024-04-10

3h 45min

Deploy request queue stalled during Vitess upgrade

Vitess version upgrade caused VReplication streams to stall. Deploy requests and branch merges stuck in pending state. Database reads/writes normal.

Deploy RequestsBranchingSchema Migrations

MajorPlanetScale

2024-04-06

Mass migration deadline causing export queue saturation

Database export tools timing out as thousands of free tier users attempted migration before deadline. Export queue backed up 6+ hours. Emergency capacity added.

Database ExportCLIDashboard

SaaS Outage Tracker

See also

Explore other areas