Debugging and Maintenance — Keeping Rails Applications Healthy
Debugging a Rails application in production is fundamentally different from debugging in development. The error messages are less helpful, you cannot attach a debugger, the data is real and the clock is ticking because users are affected. The skills that matter are triage speed, structured log analysis, hypothesis testing and knowing which diagnostic tools to reach for first. This topic covers the debugging and maintenance surface for production Rails applications, from incident response to long-term codebase health. The connected debugging guide provides step-by-step triage workflows. The Make an Old Rails App Safer to Change learning path covers the maintenance side in structured detail.
Production triage: the first five minutes
When a production issue is reported, the first five minutes determine whether you resolve it in thirty minutes or three hours. The difference is almost always about gathering the right information before forming a hypothesis.
A reliable triage sequence:
- Confirm the symptom. Is it an error (500s), a performance problem (slow responses), a functional bug (wrong data), or an outage (site completely down)? Each category has a different investigation path.
- Check the obvious. Is the server running? Is the database reachable? Is disk space full? Is memory exhausted? These account for a startling percentage of production incidents, and they are all checkable in under sixty seconds.
- Establish the timeline. When did it start? Was there a recent deployment? A traffic spike? A database migration? A dependency update? Correlating the symptom start time with recent changes narrows the search space enormously.
- Check error tracking. If you are using an error tracker (Sentry, Honeybadger, Bugsnag), look at the error type, stack trace, frequency and affected users. A single stack trace is worth more than ten minutes of guessing.
- Check application logs. Look at the requests around the time the issue started. Focus on error-level messages, unusual SQL queries, timeout warnings and connection errors.
This sequence sounds obvious, but under pressure, developers routinely skip steps 2 and 3 and jump straight to reading code, which is the slowest path to resolution.
Structured log analysis
Rails application logs contain enormous amounts of diagnostic information, but most developers only skim them during incidents. Structured log analysis means approaching log data systematically rather than reading it like prose.
For a production Rails application running on a single server, the key log files are:
- Application log (
log/production.log): request parameters, SQL queries, rendering times, error messages - Nginx access log: request URLs, response codes, upstream response times, client IPs
- Nginx error log: proxy failures, upstream timeouts, configuration errors
- PostgreSQL log: slow queries, connection events, lock waits, deadlocks
- Sidekiq log: job execution, failures, retries, queue depth warnings
- System log (
/var/log/syslogorjournalctl): OOM kills, disk errors, service restarts
When investigating a specific issue, the most useful technique is correlating timestamps across multiple log files. If a user reports a slow request at 14:23, check the Rails log for the request, the Nginx log for the upstream response time, and the PostgreSQL log for any slow queries during that window. Cross-referencing narrows the cause faster than reading any single log in isolation.
For applications that produce high log volumes, structured logging with JSON output (using gems like lograge or semantic_logger) makes logs filterable and aggregatable. Instead of parsing free-form text, you can query by request ID, controller, action, status code, duration or any custom field.
Dependency audits
Every Rails application depends on dozens of gems, each of which depends on other gems. Over time, this dependency tree accumulates security vulnerabilities, deprecated behavior and incompatibilities. Dependency auditing is not glamorous work, but skipping it is how applications end up running gems with known security exploits.
A practical dependency audit covers:
- Security vulnerabilities. Run
bundle auditregularly. It checks yourGemfile.lockagainst the Ruby Advisory Database and reports known vulnerabilities with severity ratings. - Outdated gems. Run
bundle outdatedto see which gems have newer versions available. Prioritise by category: security patches first, then gems with breaking changes that your tests cover, then major version bumps that require investigation. - Abandoned gems. Check the last commit date and issue activity for gems you depend on. A gem with no commits in two years and unanswered security issues is a liability. Look for maintained forks or alternative gems.
- License compliance. Run
license_finderto verify that all gem licenses are compatible with your project's licensing requirements.
How often to audit depends on your application's risk profile. At minimum, run bundle audit weekly and do a full dependency review quarterly.
Safe refactoring in legacy codebases
The term "legacy code" gets thrown around loosely, but the practical definition is: code you need to change but are afraid to change because you do not have enough tests to know if your changes break something. Working on legacy Rails codebases requires a specific set of skills that are distinct from greenfield development.
The first rule of legacy refactoring: do not make it worse. Before changing any code, establish what the code currently does. Write characterization tests—tests that assert the current behavior, even if that behavior is wrong—so that you have a safety net for your changes.
The second rule: make small, reversible changes. Large refactoring branches that touch fifty files are hard to review, hard to test and hard to roll back. Small changes that each improve one thing are individually low-risk and collectively transformative.
Practical refactoring patterns for Rails legacy code:
- Extract method. Move a block of code from a large method into a named method with clear intent. This is the lowest-risk refactoring and the most impactful for readability.
- Replace conditional with polymorphism. Large
casestatements orif/elsifchains that switch on type can often be replaced with small classes that each handle one case. - Introduce service object. Move business logic from a fat model or fat controller into a plain Ruby class with a single public method. This makes the logic testable in isolation.
- Strangler pattern for gem replacement. Wrap the old gem's API with an adapter, write the adapter's tests, then swap the implementation behind the adapter without changing calling code.
Dead code identification
Every Rails application accumulates dead code over time. Unused models, orphaned views, feature-flagged code where the flag was removed but the code was not, abandoned API endpoints and zombie rake tasks. Dead code is not harmless: it makes the codebase harder to navigate, increases test run time and creates false positives in security audits.
Finding dead code requires a combination of static analysis and runtime observation:
- Static analysis:
debridescans Ruby files for methods that are never called. It produces false positives (dynamic dispatch defeats static analysis) but catches the obvious cases. - Runtime coverage:
coverbandinstruments production code to track which files and methods are actually executed by real traffic. After running for a few weeks, it shows you which code paths are truly dead. - Route analysis: compare
bin/rails routesoutput with your actual traffic logs. Routes that receive zero hits over a month are candidates for removal.
The safe approach to removing dead code: do not delete it in one large commit. Comment it out with a dated note, deploy, observe for a week, then delete. If something breaks, the commented code is easy to find and restore.
Test coverage as a maintenance tool
Test coverage in a maintenance context is not about achieving a percentage target. It is about knowing which parts of your application are covered and which are not, so you can make informed decisions about the risk of changes.
The worst test coverage situation is not low coverage. It is uneven coverage: 95% coverage on the parts of the application that never change, and 10% coverage on the parts that change every sprint. Coverage analysis should focus on the code paths that are actively being modified.
For legacy applications, building test coverage incrementally is more practical than a coverage sprint. Every time you fix a bug, write a test that reproduces it first. Every time you refactor a method, write characterization tests before you start. Over months, this approach builds meaningful coverage exactly where it matters most.
Monitoring as a maintenance practice
Monitoring is not just for deployment. It is a maintenance practice that surfaces problems before users report them. The monitoring setup that matters for long-term maintenance:
- Error rate trends. A slow increase in 500 errors often indicates a data quality problem or a gradual incompatibility, not a sudden failure.
- Response time percentiles. p50 tells you about the typical experience. p99 tells you about the worst experience. Tracking both over time reveals degradation before it becomes critical.
- Background job health. Queue depth, failure rate and processing latency. A growing queue often indicates a problem in the application code, not the queue system.
- Database connection pool saturation. ActiveRecord logs connection pool wait events. If these increase, you are approaching the pool size limit and need to either increase the pool or reduce connection hold times.
Common debugging and maintenance anti-patterns
- Debugging by reading code instead of reading logs. Code tells you what should happen. Logs tell you what actually happened. Start with logs.
- Fixing symptoms instead of root causes. Adding a rescue clause around a crashing method instead of understanding why it crashes. The crash will manifest differently next time.
- Avoiding dependency updates. Letting gems drift multiple major versions behind because updating is "risky." The risk of updating grows the longer you wait.
- No runbook for common incidents. Every incident that has happened once should have a documented response procedure. The second occurrence should be faster than the first.
- Refactoring without tests. Changing code structure without a safety net of tests that verify the behavior is preserved.
Frequently asked questions
How do I get started improving a legacy Rails codebase?
Start with a dependency audit (bundle audit, bundle outdated) and a test coverage assessment. These give you a map of the risks. Then write characterization tests for the code you need to change first. Do not try to improve everything at once.
What error tracking tool should I use?
Sentry, Honeybadger and Bugsnag are all solid choices with different pricing models. Any of them is better than none. The important thing is that errors are automatically captured, grouped and alerted on, so you hear about problems before your users email you.
How do I convince my team to invest in maintenance?
Track the cost of incidents: time to diagnose, time to fix, user impact, and frequency. Maintenance investment reduces all four metrics. Present it as risk reduction with measurable returns, not as "cleaning up code."
What is the most common root cause of production Rails issues?
Database-related problems: missing indexes causing slow queries, connection pool exhaustion, and migrations that lock tables during deploys. Application logic bugs are a close second, but they are usually easier to diagnose because they produce clear stack traces.
A seven-step path through dependency audits, test coverage, safe refactoring and dead code removal. ::