Deployment Failure Modes: A Catalog of What Goes Wrong

Deployment Failure Modes: A Catalog of What Goes Wrong

Purpose

This is a catalog, not a tutorial. Every entry here represents a deployment failure I've either caused or diagnosed firsthand across Rails applications running on VPS infrastructure. I'm organizing them by category so I can reference them later and so the patterns become easier to recognize.

Each entry follows the same structure: symptoms, root cause, and the diagnostic steps that actually narrowed things down. I'm deliberately omitting fixes in most cases — the fix depends on your specific stack and constraints. The diagnostic path is what transfers between environments.

Category 1: Asset Compilation Failures

Symptoms

Deployment completes but the application serves pages with missing stylesheets, broken JavaScript, or missing images processed through the asset pipeline. Alternatively, the deploy itself fails during the assets:precompile step with a JavaScript or CSS build error.

Root Cause

The asset compilation environment on the server differs from the development environment in ways that aren't immediately obvious. Common sources: a Node.js version mismatch between local and server, missing system dependencies for native npm packages (particularly node-sass or esbuild), or a Yarn/npm lockfile that references a package version no longer available in the registry.

In one case, the failure was caused by a Tailwind CSS configuration that referenced a file glob pattern that resolved differently on the case-sensitive Linux filesystem than on macOS. The class purging step silently removed classes that existed in development.

Diagnostic Steps

  1. Run assets:precompile manually on the server with RAILS_ENV=production and read the full output. Capistrano and similar tools often truncate or bury the actual error.
  2. Compare node --version and yarn --version between local and server.
  3. Check whether the tmp/cache/assets directory from a previous deploy is polluted. Clear it and recompile.
  4. If the deploy succeeds but assets are wrong, inspect the generated manifest file in public/assets/.sprockets-manifest-*.json (Sprockets) or the public/assets/manifest.json (Propshaft) to verify the expected files are present.

Category 2: Migration Failures

Symptoms

The deploy runs migrations and either fails mid-migration (leaving the database in a partially migrated state) or succeeds but the new code expects a schema that doesn't match what the migration actually produced.

Root Cause

The most dangerous variant is a migration that acquires a lock on a high-traffic table and holds it long enough to exhaust the connection pool. This looks like an application-wide outage, not a migration problem, which makes the initial diagnosis confusing.

Other causes: migrations that ran in a different order than expected because timestamps were manually edited, migrations that depend on model code that has already changed in the same deploy, or migrations that succeed on the development database (with its small data volume) but time out on production.

Diagnostic Steps

  1. Always inspect rails db:migrate:status on the server after a failure. Know which migrations ran and which didn't.
  2. For lock-related failures, check pg_locks and pg_stat_activity in PostgreSQL. Look for queries in waiting state and identify what they're waiting on.
  3. Test migrations against a production-sized dataset before deploying. A migration that runs in 200ms on a table with 1,000 rows may take 40 minutes on a table with 10 million rows.
  4. If the migration modified a column that the running application code depends on, you have a backwards-compatibility problem. This is a deployment sequencing issue, not a migration issue.

Category 3: Dependency Resolution Failures

Symptoms

bundle install fails during deployment, reporting that it cannot find a compatible set of gem versions. Or it succeeds but installs a different version than expected, causing runtime errors.

Root Cause

The Gemfile.lock on the server doesn't match what was committed. This happens when someone runs bundle update locally without committing the lockfile, when the deploy process runs bundle install without --frozen, or when the server has a different Bundler version that interprets the lockfile differently.

I've also seen failures caused by gems hosted on private gem servers that were temporarily unreachable during deploy, and by gems with native extensions that fail to compile because the server is missing a system library (libxml2-dev, libpq-dev, etc.).

Diagnostic Steps

  1. Verify that Gemfile.lock is committed and matches what's on the server. Run bundle check on the server.
  2. Confirm the Bundler version on the server matches the BUNDLED WITH line at the bottom of Gemfile.lock.
  3. For native extension failures, read the full mkmf.log in the gem's build directory — the actual missing header or library is named there.
  4. If using a private gem source, check network connectivity from the server to that source independently.

Category 4: Environment Variable Mismatches

Symptoms

The application boots but behaves incorrectly: connecting to the wrong database, sending emails through the wrong SMTP server, using development-mode settings in production, or crashing at runtime with KeyError for a missing environment variable.

Root Cause

Environment variables are set through a different mechanism than the code deployment, and the two get out of sync. A new feature requires a new environment variable, the code is deployed, but nobody added the variable to the server's environment. Or the variable is set in one process manager (systemd) but not another (cron), so the application works for web requests but fails in background jobs.

Diagnostic Steps

  1. Print the full environment from inside the running application. A one-off Rails console session with ENV.to_h.keys.sort tells you exactly what's available.
  2. Compare against the list of expected variables. If the application uses a gem like dotenv or a config initializer that reads ENV.fetch, grep the codebase for all ENV references to build the expected list.
  3. Check whether the process manager's environment is the same as the shell environment. systemctl show your-app.service -p Environment reveals what systemd actually passes.
  4. For variables that differ between web and worker processes, check each process type independently.

Category 5: DNS and SSL Timing Issues

Symptoms

The deploy succeeds and the application runs, but users see SSL certificate errors, DNS resolution failures, or mixed-content warnings. These often appear intermittently because of DNS propagation or caching.

Root Cause

DNS changes and certificate provisioning are asynchronous. If a deploy depends on a new domain or subdomain, the DNS records may not have propagated to all resolvers. If using Let's Encrypt with automatic renewal, the renewal may fail silently and the certificate expires during a deploy window.

Diagnostic Steps

  1. Check the certificate directly: openssl s_client -connect yourdomain.com:443 -servername yourdomain.com shows the certificate chain and expiry.
  2. Use dig or nslookup against multiple DNS resolvers (not just the local one) to verify propagation.
  3. Check the certificate renewal logs. For Let's Encrypt with Certbot, the logs live in /var/log/letsencrypt/. For Caddy or other automatic issuers, check their respective log paths.

Category 6: Health Check Failures

Symptoms

The deploy completes and the application starts, but the load balancer or process manager marks it as unhealthy and routes traffic away (or restarts the process in a loop). From the server's perspective, the application is running. From the infrastructure's perspective, it's down.

Root Cause

The health check endpoint returns an error because it tests a dependency that isn't ready yet — the database connection pool hasn't initialized, a Redis connection hasn't been established, or an external service check is failing. Or the health check has a timeout shorter than the application's boot time.

Diagnostic Steps

  1. Hit the health check endpoint manually from the server itself: curl -v http://localhost:3000/health. Read the response body and status code.
  2. Check the timing. If the health check passes 30 seconds after boot but the load balancer expects a response within 10, the application is healthy but the infrastructure disagrees.
  3. Review what the health check actually tests. A health check that verifies every downstream dependency will fail when any one of them is slow.

Category 7: Connection Pool Exhaustion

Symptoms

The application works under light traffic but starts returning 500 errors under load, with ActiveRecord::ConnectionTimeoutError in the logs. Database connections sit in idle in transaction state. Restarting the application temporarily resolves it.

Root Cause

The database connection pool size doesn't match the number of threads that need connections. In Puma, each thread in each worker needs its own database connection. If Puma is configured with 5 threads and database.yml has a pool of 5, that works for a single worker — but if you have 4 workers, you need the pool to handle the per-worker thread count, and the PostgreSQL max_connections to handle all workers combined.

Background job processors compound this. Sidekiq threads also need database connections, and they draw from a separate pool configured in the Sidekiq process's database.yml.

Diagnostic Steps

  1. Check the current pool configuration: ActiveRecord::Base.connection_pool.size in a Rails console.
  2. Check actual usage: ActiveRecord::Base.connection_pool.stat returns a hash with :size, :connections, :busy, and :waiting keys.
  3. On the PostgreSQL side, run SELECT count(*) FROM pg_stat_activity WHERE datname = 'your_db' to see total connections.
  4. Compare total connections against max_connections in PostgreSQL. If you're close to the limit, adding workers or processes will push you over.

Cross-Cutting Observation

The deployments that fail most confusingly are the ones where two or more of these categories interact. A migration that takes too long exhausts the connection pool. An environment variable mismatch causes the health check to fail. An asset compilation failure is masked by a cached version that eventually expires.

The single most useful habit I've developed is to change one thing per deploy and verify it before changing the next. When three changes go out together and something breaks, the diagnostic space is combinatorial. When one change goes out, you know exactly where to look.

For more on deployment patterns and infrastructure, see the Rails Deployment topic. For a structured approach to setting up reliable deploys, the Deploy Rails on Your Own Server path walks through the full stack from initial server provisioning through to automated deploys.