We’ve been running Node.js in production since the 0.4 days. The language is easy to get started with. Keeping it running under real traffic is a different problem.

Process management

The application needs to start at boot, restart on crash, and respond to system signals. Upstart handles this on Ubuntu without additional dependencies:

description "myserver"

env APP_HOME=/var/www/myserver/releases/current
env NODE_ENV=production
env RUN_AS_USER=www-data

start on (net-device-up and local-filesystems and runlevel [2345])
stop on runlevel [016]

respawn
respawn limit 5 60

pre-start script
    test -x /usr/local/bin/node || { stop; exit 0; }
    test -e $APP_HOME/logs || { stop; exit 0; }
end script

script
    chdir $APP_HOME
    exec /usr/local/bin/node bin/cluster app.js \
        -u $RUN_AS_USER \
        -l logs/myserver.out \
        -e logs/myserver.err >> $APP_HOME/logs/upstart
end script

respawn limit 5 60 prevents a crash loop – if the process dies 5 times within 60 seconds, Upstart stops trying. The pre-start script verifies that Node and the log directory exist before attempting to start.

restart doesn’t re-read the config file. After editing the Upstart script, use stop then start.

Clustering

Node runs on a single thread. One process uses one core. On a multi-core server, that’s wasted capacity. The cluster module forks worker processes that share the same listening port:

var cluster = require('cluster');
var numCPUs = require('os').cpus().length;

if (cluster.isMaster) {
    for (var i = 0; i < numCPUs; i++) {
        cluster.fork();
    }
    cluster.on('exit', function(worker) {
        console.log('worker ' + worker.process.pid + ' died');
        cluster.fork();
    });
} else {
    require('./app.js');
}

The master forks one worker per core. When a worker dies, the master replaces it. The cluster module is labeled “experimental” in the docs but has been stable in production for us.

Even on single-core instances, clustering is worth it for zero-downtime deploys. Send USR2 to the master and it spawns new workers with updated code while old workers drain existing connections:

kill -USR2 $(cat /var/run/myserver.pid)

New workers start, old workers stop accepting connections, finish serving in-flight requests, then exit. No dropped connections, no downtime window.

Heartbeat monitoring

A Node process can be “running” while being effectively dead – event loop blocked, memory exhausted, or wedged on a stalled I/O call. The process manager sees it as alive. The health check endpoint doesn’t respond.

A heartbeat monitor catches this:

function startHeartbeat(port) {
    setInterval(function() {
        var req = http.get('http://localhost:' + port + '/health', function(res) {
            if (res.statusCode !== 200) {
                restartCluster('heartbeat: status ' + res.statusCode);
            }
        });
        req.on('error', function() {
            restartCluster('heartbeat: connection refused');
        });
        req.setTimeout(10000, function() {
            restartCluster('heartbeat: timeout');
        });
    }, 30000);
}

Every 30 seconds, hit the health endpoint. If it doesn’t respond within 10 seconds, restart the cluster. This has caught wedged processes that would have stayed unresponsive until someone noticed manually.

Log rotation

Logs fill disks. Logrotate handles this without any Node-specific tooling:

"/var/www/myserver/releases/current/logs/*.out"
"/var/www/myserver/releases/current/logs/*.err" {
    daily
    rotate 7
    compress
    delaycompress
    create 644 www-data www-data
    postrotate
        reload myserver >/dev/null 2>&1 || true
    endscript
}

The postrotate sends SIGHUP, which the application handles by reopening log files. Standard Unix pattern – no log library needed, no external log shipper.

Connection tuning

Node’s default HTTP agent keeps a small connection pool. Under production load, this becomes a bottleneck when making outbound HTTP requests:

http.globalAgent.maxSockets = 500;

On the system side: raise ulimit -n for open file descriptors and net.core.somaxconn for the listen backlog. The application’s server.listen backlog should match:

server.listen(port, 'localhost', 511);

511 is the default somaxconn on most Linux distributions. If you’ve raised it, match the value here.

Redis with hiredis

If you’re using Redis for sessions or caching, install the hiredis native driver:

npm install redis hiredis

The Redis client detects hiredis automatically. No code changes. The performance difference is significant – without hiredis, Redis can become a bottleneck at moderate load. With it, the same hardware handles substantially more traffic.

Production debugging

An embedded REPL lets you inspect a running process without restarting it:

var repl = require('repl');
var net = require('net');

net.createServer(function(socket) {
    var r = repl.start({
        prompt: '[' + process.pid + '] > ',
        input: socket,
        output: socket,
        terminal: true,
        useGlobal: false
    });
    r.context.app = app;
    r.on('exit', function() { socket.end(); });
}).listen(1337, 'localhost');

Connect with telnet localhost 1337 and you have access to the running application’s state. Check memory usage, inspect caches, trace request handling. Bind to localhost only – this is a root shell into your application.

Deployments

A modified version of TJ’s deploy.sh handles our deployment pipeline. It’s a shell script – no deployment framework, no runtime dependencies, no agent to install on servers. Git push, SSH in, symlink the release, USR2 the master. Under 200 lines of bash.

Two rules shape all of this: don’t reinvent what Unix already provides, and build redundancy at every level. Upstart restarts crashed processes. Clustering replaces dead workers. Heartbeat monitoring catches wedged processes. Logrotate manages disk space. None of these are Node-specific solutions – they’re Unix patterns applied to a Node application. The application code handles the application problem. The operating system handles the operations problem.


These patterns weren’t mine alone. They came from a runbook we built together on the team and kept iterating on.