Debugging Production Rails Issues

Debugging Production Rails Issues

Flowchart showing the first-five-minutes triage sequence for a production Rails incident

Production debugging is a different discipline from development debugging. You cannot binding.pry your way through a problem when the process is serving real traffic, the data is not yours to experiment with, and every minute of downtime is a minute your users are staring at an error page. The skills that matter are triage speed, systematic log reading, and knowing which diagnostic tools to reach for before you start guessing. This guide covers the full production debugging surface for Rails applications: the first-five-minutes checklist that determines whether you resolve an issue quickly or burn hours, structured log analysis across multiple services, memory leak detection, integrating error tracking tools effectively, diagnosing database slow queries under load, and dealing with background job failures that do not surface obviously. It connects to the broader Debugging and Maintenance topic, which covers long-term codebase health alongside incident response. I have been the person paged at 2 AM for Rails applications of various sizes for over a decade now, and the pattern is remarkably consistent: the developers who resolve incidents fastest are not the ones who know the most about Rails internals. They are the ones who have a repeatable triage process and the discipline to follow it under pressure.

The first-five-minutes checklist

When something breaks in production, the natural instinct is to start reading code. Resist it. Code reading is the slowest path to diagnosis when you do not yet know where to look. Instead, spend the first five minutes gathering context.

Minute one: confirm the symptom. What exactly is broken? There are four distinct categories and each has a different investigation path:

  • Errors (5xx responses): application exceptions reaching the user
  • Performance degradation: pages loading but slowly
  • Functional bugs: wrong data or wrong behaviour, no errors
  • Full outage: nothing responding at all

Do not skip this step. "The site is down" might mean the server is unreachable, the database is offline, Puma crashed, or a single endpoint is erroring while everything else works fine. Each of those has a completely different cause.

Minute two: check the infrastructure basics. Is the server responding to ping? Is disk space available? Is memory exhausted? Is the database process running? These take thirty seconds to check and account for a humbling percentage of production incidents.

df -h            # disk space
free -m          # memory
systemctl status puma
systemctl status postgresql

Minute three: establish a timeline. When did the problem start? Was there a deploy in the last hour? A database migration? A config change? A traffic spike? Correlating the incident start time with recent events narrows your search space from "anything in the codebase" to "something that changed at 14:15."

Minute four: check error tracking. If you are running Sentry, Honeybadger, Bugsnag or similar, open it. Look at the error type, the stack trace, the frequency graph and which users are affected. A single stack trace pointing at a specific line of code is worth more than twenty minutes of hypothesis-driven investigation.

Minute five: check application logs. Pull the logs from around the incident start time. Look for error-level messages, unusual SQL patterns, timeout warnings and connection failures. At this point you should have enough context to form a hypothesis about the root cause.

Five minutes. That is all it takes to go from "something is wrong" to "I have a working theory about what is wrong." The developers who skip this sequence and jump straight to reading source code routinely spend an hour finding what the checklist would have surfaced in five minutes.

Structured log analysis

Rails logs are dense with information that most developers never systematically read. During an incident, the ability to cross-reference logs across services is the single most valuable debugging skill you can develop.

A production Rails stack generates multiple log streams, and each one tells a different part of the story:

  • Rails application log: request parameters, SQL queries, rendering times, exception messages
  • Nginx access log: URLs, response codes, upstream response times, client IPs
  • Nginx error log: proxy failures, upstream timeouts, socket errors
  • PostgreSQL log: slow queries, lock waits, deadlocks, connection events
  • Sidekiq log: job execution, failures, retries, queue depth
  • System journal: OOM kills, disk errors, service restarts

The technique that produces results: pick the timestamp of the incident, then read all six log streams for that window. A user reports a 500 error at 14:23. The Rails log shows a PG::ConnectionBad exception. The PostgreSQL log shows max connections reached at 14:22. The system journal shows a Sidekiq restart at 14:21 that spawned 25 new connections simultaneously. Now you know the cause: the Sidekiq restart exhausted the PostgreSQL connection pool.

None of those individual logs would have told you the full story. The cross-reference did.

For applications with high log volume, structured logging changes the game. Gems like lograge collapse Rails' multi-line request logs into single-line JSON entries that you can grep, filter and aggregate. Adding a request ID that propagates through Rails, Sidekiq and your external service calls lets you trace a single user action across every system it touched.

# config/environments/production.rb
config.lograge.enabled = true
config.lograge.formatter = Lograge::Formatters::Json.new
config.lograge.custom_payload do |controller|
  { request_id: controller.request.request_id }
end

Memory leak detection

Ruby memory leaks manifest as gradual RSS growth over hours or days until the process is killed by the OOM killer or your monitoring threshold. They are frustrating because they do not produce error messages — the process just gets bigger until it dies.

The diagnostic approach:

1. Confirm it is actually a leak. Ruby processes grow in memory during the first few hundred requests as the heap expands to its working set size. This is not a leak; it is the runtime finding its steady state. A leak is continuous growth that never plateaus, even after thousands of requests.

2. Monitor RSS over time. Use your monitoring tool (Datadog, Prometheus, or even a cron job logging ps -o rss) to graph memory usage per worker over hours. If the line goes up and to the right without flattening, you have a leak.

3. Identify the growth pattern. Does every worker leak at the same rate, or just one? Uniform leaking suggests application code. A single worker leaking suggests something stateful like a connection or file handle.

4. Use diagnostic tools. The derailed_benchmarks gem can identify memory growth per request by running your application in a loop and measuring RSS at each iteration. For more precision, memory_profiler shows object allocations per request, which helps identify code paths that accumulate objects without releasing them.

Common sources of memory leaks in Rails applications: unbounded caching (a memoisation hash that grows with each unique input), event listeners that are added but never removed, global variables or class-level arrays that accumulate data, and native extensions with their own memory management bugs. String concatenation in loops is also a classic — each concatenation allocates a new, larger string while the old one waits for GC.

The pragmatic fix when you cannot find the leak quickly: configure Puma's worker killer to restart workers after a memory threshold. This is not a solution — it is a tourniquet. But it keeps your application running while you investigate.

# config/puma.rb — restart workers exceeding 512 MB
plugin :tmp_restart
before_fork do
  require 'puma_worker_killer'
  PumaWorkerKiller.config do |config|
    config.ram = 1024 # total available RAM in MB
    config.frequency = 60
    config.percent_usage = 0.9
  end
  PumaWorkerKiller.start
end

Error tracking integration

An error tracker (Sentry, Honeybadger, Bugsnag, Rollbar) is not optional for production Rails applications. It is the difference between discovering errors from user complaints and discovering them before users notice.

Effective error tracking setup goes beyond installing the gem and adding an API key:

Group errors by root cause, not by message. An ActiveRecord::RecordNotFound on ten different endpoints is ten symptoms of one problem (bad links or missing data), not ten separate issues.

Set up alerting thresholds. A new error type should alert immediately. A known error exceeding its normal frequency should alert. Background noise errors (bot traffic hitting invalid URLs) should be filtered, not ignored entirely but excluded from alerting.

Attach context to every error. The error message alone is often insufficient. Attach the current user ID, the request parameters, the relevant database record IDs and any feature flags that were active. Most trackers support this through a context or scope mechanism.

Sentry.set_context('order', { id: order.id, status: order.status })
Sentry.set_user({ id: current_user.id, email: current_user.email })

Review errors weekly, not just when alerted. Many production issues manifest as a gradual increase in a known error type rather than a sudden spike. Weekly review catches these trends before they become incidents.

Database slow queries

Slow database queries are the most common cause of performance degradation in production Rails applications, and they often appear suddenly when data volume crosses a threshold that changes PostgreSQL's query plan.

Enable slow query logging in PostgreSQL:

# postgresql.conf
log_min_duration_statement = 200  # log queries slower than 200ms

This gives you a timestamped record of every query that exceeds your threshold, including the full SQL. Cross-reference with your Rails logs to identify which controller action and which ActiveRecord call generated the query.

The usual culprits: missing indexes on foreign key columns, N+1 queries that were invisible with small datasets, ORDER BY on unindexed columns, full-text searches without GIN indexes, and queries that return far more data than the application actually uses (selecting all columns when you only need three).

For acute incidents — a query that was fast yesterday and is slow today — check for lock contention. A long-running migration or a stuck transaction can hold locks that force other queries to wait. SELECT * FROM pg_stat_activity WHERE state = 'active' shows you what is running right now.

Background job failures

Sidekiq job failures are insidious because they happen outside the request cycle and are invisible to users until the downstream effects appear: emails not sent, reports not generated, webhooks not delivered.

Monitor your dead job queue. Sidekiq retries failed jobs with exponential backoff, but after exhausting retries, jobs go to the dead set. If you are not monitoring the dead set, failed jobs disappear silently. Check it daily, or set up an alert when the dead set size exceeds a threshold.

Make jobs idempotent. A job that runs twice should produce the same result as a job that runs once. Sidekiq guarantees at-least-once delivery, not exactly-once. If your job charges a credit card without checking whether the charge already succeeded, retries will double-charge users. Ask me how I know.

Log job arguments and results. When a job fails in production, you need to know what arguments it received and what state the system was in when it ran. Include the job ID, arguments, execution time and any relevant record IDs in your structured logs.

Watch for cascading failures. A database outage causes all database-dependent jobs to fail simultaneously. Sidekiq retries them all, creating a thundering herd when the database comes back. Configure retry jitter and consider circuit breakers for jobs that depend on external services.

What usually goes wrong

Skipping the triage checklist. Developers jump straight to reading code, form a hypothesis based on intuition, spend an hour investigating the wrong thing, then finally check the logs and find the answer in thirty seconds.

Not having structured logging before the incident. You cannot retroactively add request IDs to logs. Set up lograge and request ID propagation before you need them.

Ignoring the dead job queue. Failed background jobs accumulate silently for weeks until a user complains about missing emails or stale reports.

Treating memory growth as normal. "Our workers just use a lot of memory" is not a diagnosis. If RSS grows continuously, something is leaking. The worker killer buys time but does not fix the problem.

No baseline metrics. You cannot identify "slow" if you do not know what "normal" looks like. Establish baseline response times, error rates and memory usage before you need to debug a deviation.

Investigating without a timeline. Asking "what changed?" without knowing when the problem started is searching without a map.

Checklist summary

  • Establish a written first-five-minutes triage checklist and practice it
  • Set up structured logging with lograge and request ID propagation
  • Configure error tracking with context, alerting thresholds and weekly review
  • Enable PostgreSQL slow query logging at 200ms threshold
  • Monitor Sidekiq dead job queue size with alerting
  • Graph worker memory (RSS) over time to detect leaks early
  • Set up Puma worker killer as a safety net for memory issues
  • Ensure all background jobs are idempotent
  • Record baseline performance metrics so you can detect deviations
  • Document your incident response process — who gets paged, where to look first, how to communicate status

Frequently asked questions

What error tracker should I use for Rails?

Honeybadger and Sentry are both excellent. Honeybadger has a simpler interface and better Rails-specific defaults. Sentry has more features and supports more languages if your stack extends beyond Ruby. Both provide the core capability: grouped errors, stack traces, alerting and context attachment. Pick one and configure it well rather than agonising over the choice.

How do I debug a problem I cannot reproduce locally?

Add more context to your error tracking and logging. Attach the full request parameters, the current user's relevant attributes, and the state of any records involved. Often the problem is data-dependent — a specific combination of record states that does not exist in your development database. Once you identify the data pattern, you can reproduce it locally by creating matching fixtures.

Should I use rails console in production?

With extreme caution and only for read-only investigation. Never modify data through the console unless you have a specific, tested command and a rollback plan. Always use --sandbox mode when you just need to inspect state. And log what you do — if someone asks later what happened at 14:35, "I was poking around in console" is not a reassuring answer.

How do I detect N+1 queries in production?

The bullet gem works in development but is not designed for production use. Instead, use your APM tool's transaction trace view, which shows all SQL queries per request. Alternatively, enable ActiveSupport::Notifications to log query counts per request and alert when a request exceeds a threshold (say, 30 queries).

What memory threshold should trigger a Puma worker restart?

Start with 80-90% of available memory divided by worker count. If your server has 2 GB of RAM and runs 4 Puma workers, each worker's share is roughly 400 MB after accounting for the OS and other processes. Set the restart threshold at 350-400 MB per worker. Adjust based on your application's actual steady-state memory usage.