Tools, tips and techniques for diagnosing and fixing Ruby memory leaks

At some point in the life of every Rails developer you are bound to hit a memory leak. It may be tiny amount of constant memory growth, or a spurt of growth that hits you on the job queue when certain jobs run.

Sadly, most Ruby devs out there simply employ monit , inspeqtor or unicorn worker killers. This allows you to move along and do more important work, tidily sweeping the problem snugly under the carpet.

Unfortunately, this approach leads to quite a few bad side effects. Besides performance pains, instability and larger memory requirements it also leaves the community with a general lack of confidence in Ruby. Monitoring and restarting processes is a very important tool for your arsenal, but at best it is a stopgap and safeguard. It is not a solution.

We have some amazing tools out there for attacking and resolving memory leaks, especially the easy ones - managed memory leaks.

Are we leaking ?

The first and most important step for dealing with memory problems is graphing memory over time. At Discourse we use a combo of Graphite , statsd and Grafana to graph application metrics.

A while back I packaged up a Docker image for this work which is pretty close to what we are running today. If rolling your own is not your thing you could look at New Relic, Datadog or any other cloud based metric provider. The key metric you first need to track is RSS for you key Ruby processes. At Discourse we look at max RSS for Unicorn our web server and Sidekiq our job queue.

Discourse is deployed on multiple machines in multiple Docker containers. We use a custom built Docker container to watch all the other Docker containers. This container is launched with access to the Docker socket so it can interrogate Docker about Docker. It uses docker exec to get all sorts of information about the processes running inside the container.
Note: Discourse uses the unicorn master process to launch multiple workers and job queues, it is impossible to achieve the same setup (which shares memory among forks) in a one container per process world.

With this information at hand we can easily graph RSS for any container on any machine and look at trends:

Long term graphs are critical to all memory leak analysis. They allow us to see when an issue started. They allow us to see the rate of memory growth and shape of memory growth. Is it erratic? Is it correlated to a job running?

When dealing with memory leaks in c-extensions, having this information is critical. Isolating c-extension memory leaks often involves valgrind and custom compiled versions of Ruby that support debugging with valgrind. It is tremendously hard work we only want to deal with as a last resort. It is much simpler to isolate that a trend started after upgrading EventMachine to version 1.0.5.

Managed memory leaks

Unlike unmanaged memory leaks, tackling managed leaks is very straight forward. The new tooling in Ruby 2.1 and up makes debugging these leaks a breeze.

Prior to Ruby 2.1 the best we could do was crawl our object space, grab a snapshot, wait a bit, grab a second snapshot and compare. I have a basic implementation of this shipped with Discourse in MemoryDiagnostics, it is rather tricky to get this to work right. You have to fork your process when gathering the snapshots so you do not interfere with your process and the information you can glean is fairly basic. We can tell certain objects leaked, but can not tell where they were allocated:

3377 objects have leaked
Summary:
String: 1703
Bignum: 1674 Sample Items:
Bignum: 40 bytes
Bignum: 40 bytes
String: Bignum: 40 bytes
String: Bignum: 40 bytes
String:

If we were lucky we would have leaked a number or string that is revealing and that would be enough to nail it down.

Additionally we have GC.stat that could tell us how many live objects we have and so on.

The information was very limited. We can tell we have a leak quite clearly, but finding the reason can be very difficult.

Note: a very interesting metric to graph is GC.stat[:heap_live_slots] with that information at hand we can easily tell if we have a managed object leak.

Managed heap dumping

Ruby 2.1 introduced heap dumping, if you also enable allocation tracing you have some tremendously interesting information.

The process for collecting a useful heap dump is quite simple:

Turn on allocation tracing

require 'objspace'
ObjectSpace.trace_object_allocations_start

This will slow down your process significantly and cause your process to consume more memory. However, it is key to collecting good information and can be turned off later. For my analysis I will usually run it directly after boot.

When debugging my latest round memory issues with SideKiq at Discourse I deployed an extra docker image to a spare server. This allowed me extreme freedom without impacting SLA.

Next, play the waiting game

After memory has clearly leaked, (you can look at GC.stat or simply watch RSS to determine this happened) run:

io=File.open("/tmp/my_dump", "w")
ObjectSpace.dump_all(output: io); io.close

Running Ruby in an already running process

To make this process work we need to run Ruby inside a process that has already started.

Luckily, the rbtrace gem allows us to do that (and much more). Moreover it is safe to run in production.

We can easily force SideKiq to dump a its heap with:

bundle exec rbtrace -p $SIDEKIQ_PID -e 'Thread.new{GC.start;require "objspace";io=File.open("/tmp/ruby-heap.dump", "w"); ObjectSpace.dump_all(output: io); io.close}'

rbtrace runs in a restricted context, a nifty trick is breaking out of the trap context with Thread.new

We can also crack information out of the box with rbtrace, for example:

bundle exec rbtrace -p 6744 -e 'GC.stat'
/usr/local/bin/ruby: warning: RUBY_HEAP_MIN_SLOTS is obsolete. Use RUBY_GC_HEAP_INIT_SLOTS instead.
*** attached to process 6744
>> GC.stat
=> {:count=>49, :heap_allocated_pages=>1960, :heap_sorted_length=>1960, :heap_allocatable_pages=>12, :heap_available_slots=>798894, :heap_live_slots=>591531, :heap_free_slots=>207363, :heap_final_slots=>0, :heap_marked_slots=>335775, :heap_swept_slots=>463124, :heap_eden_pages=>1948, :heap_tomb_pages=>12, :total_allocated_pages=>1960, :total_freed_pages=>0, :total_allocated_objects=>13783631, :total_freed_objects=>13192100, :malloc_increase_bytes=>32568600, :malloc_increase_bytes_limit=>33554432, :minor_gc_count=>41, :major_gc_count=>8, :remembered_wb_unprotected_objects=>12175, :remembered_wb_unprotected_objects_limit=>23418, :old_objects=>309750, :old_objects_limit=>618416, :oldmalloc_increase_bytes=>32783288, :oldmalloc_increase_bytes_limit=>44484250}
*** detached from process 6744

###Analyzing the heap dump

With a rich heap dump at hand we can start analysis, a first report to run is looking at the count of object per GC generation.

When trace object allocation is enabled the runtime will attach rich information next to all allocations. For each object that is allocated while tracing is on we have:

The GC generation it was allocated in
The filename and line number it was allocated in
A truncated value
bytesize
… and much more

The file is in json format and can easily be parsed line by line. EG:

{"address":"0x7ffc567fbf98", "type":"STRING", "class":"0x7ffc565c4ea0", "frozen":true, "embedded":true, "fstring":true, "bytesize":18, "value":"ensure in dispatch", "file":"/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/activesupport-4.1.9/lib/active_support/dependencies.rb", "line":247, "method":"require", "generation":7, "memsize":40, "flags":{"wb_protected":true, "old":true, "long_lived":true, "marked":true}}

A simple report that shows how many objects are left around from which GC generation is a great start for looking at memory leaks. Its a timeline of object leakage.

require 'json'
class Analyzer def initialize(filename) @filename = filename end def analyze data = [] File.open(@filename) do |f| f.each_line do |line| data << (parsed=JSON.parse(line)) end end data.group_by{|row| row["generation"]} .sort{|a,b| a[0].to_i <=> b[0].to_i} .each do |k,v| puts "generation #{k} objects #{v.count}" end end
end Analyzer.new(ARGV[0]).analyze

For example this is what I started with:

generation objects 334181
generation 7 objects 6629
generation 8 objects 38383
generation 9 objects 2220
generation 10 objects 208
generation 11 objects 110
generation 12 objects 489
generation 13 objects 505
generation 14 objects 1297
generation 15 objects 638
generation 16 objects 748
generation 17 objects 1023
generation 18 objects 805
generation 19 objects 407
generation 20 objects 126
generation 21 objects 1708
generation 22 objects 369
...

We expect a large number of objects to be retained after boot and sporadically when requiring new dependencies. However we do not expect a consistent amount of objects to be allocated and never cleaned up. So let’s zoom into a particular generation:

require 'json'
class Analyzer def initialize(filename) @filename = filename end def analyze data = [] File.open(@filename) do |f| f.each_line do |line| parsed=JSON.parse(line) data << parsed if parsed["generation"] == 18 end end data.group_by{|row| "#{row["file"]}:#{row["line"]}"} .sort{|a,b| b[1].count <=> a[1].count} .each do |k,v| puts "#{k} * #{v.count}" end end
end Analyzer.new(ARGV[0]).analyze

generation 19 objects 407
/usr/local/lib/ruby/2.2.0/weakref.rb:87 * 144
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:21 * 72
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:42 * 72
/var/www/discourse/lib/freedom_patches/translate_accelerator.rb:65 * 15
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/i18n-0.7.0/lib/i18n/interpolate/ruby.rb:21 * 15
/var/www/discourse/lib/email/message_builder.rb:85 * 9
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/actionview-4.1.9/lib/action_view/template.rb:297 * 6
/var/www/discourse/lib/email/message_builder.rb:36 * 6
/var/www/discourse/lib/email/message_builder.rb:89 * 6
/var/www/discourse/lib/email/message_builder.rb:46 * 6
/var/www/discourse/lib/email/message_builder.rb:66 * 6
/var/www/discourse/vendor/bundle/ruby/2.2.0/gems/activerecord-4.1.9/lib/active_record/connection_adapters/postgresql_adapter.rb:515 * 5

Further more, we can chase down the reference path to see who is holding the references to the various objects and rebuild object graphs.

The first thing I attacked in this particular case was code I wrote, which is a monkey patch to Rails localization.

Why we monkey patch rails localization?

At Discourse we patch the Rails localization subsystem for 2 reasons:

Early on we discovered it was slow and needed better performance
Recently we started accumulating an enormous amount of translations and needed to ensure we only load translations on demand to keep memory usage lower. (this saves us 20MB of RSS)

Consider the following bench:

ENV['RAILS_ENV'] = 'production'
require 'benchmark/ips' require File.expand_path("../../config/environment", __FILE__) Benchmark.ips do |b| b.report do |times| i = -1 I18n.t('posts') while (i+=1) < times end
end

Before monkey patch:

sam@ubuntu discourse % ruby bench.rb
Calculating ------------------------------------- 4.518k i/100ms
------------------------------------------------- 121.230k (±11.0%) i/s - 600.894k

After monkey patch:

sam@ubuntu discourse % ruby bench.rb
Calculating ------------------------------------- 22.509k i/100ms
------------------------------------------------- 464.295k (±10.4%) i/s - 2.296M

So our localization system is running almost 4 times faster, but … it is leaking memory.

Reviewing the code I could see that the offending line is:

github.com


 # load it
 I18n.backend.load_translations(I18n.load_path.grep Regexp.new("\\.#{locale}\\.yml$"))


 @loaded_locales << locale
 end
end


def translate(*args)
 @cache ||= LruRedux::ThreadSafeCache.new(LRU_CACHE_SIZE)
 found = true
 k = [args, config.locale, config.backend.object_id]
 t = @cache.fetch(k) { found = false }
 unless found
 load_locale(config.locale) unless @loaded_locales.include?(config.locale)
 begin
 t = translate_no_cache(*args)
 rescue MissingInterpolationArgument
 options = args.last.is_a?(Hash) ? args.pop.dup : {}
 options.merge!(locale: config.default_locale)
 key = args.shift
 t = translate_no_cache(key, options)

Turns out, we were sending a hash in from the email message builder that includes ActiveRecord objects, this later was used as a key for the cache, the cache was allowing for 2000 entries. Considering that each entry could involve a large number of AR objects memory leakage was very high.

To mitigate, I changed the keying strategy, shrunk the cache and completely bypassed it for complex localizations:

One day later when looking at memory graphs we can easily see the impact of this change:

This clearly did not stop the leak but it definitely slowed it down.

###therubyracer is leaking

At the top of our list we can see our JavaScript engine therubyracer is leaking lots of objects, in particular we can see the weak references it uses to maintain Ruby to JavaScript mappings are being kept around for way too long.

To maintain performance at Discourse we keep a JavaScript engine context around for turning Markdown into HTML. This engine is rather expensive to boot up, so we keep it in memory plugging in new variables as we bake posts.

Since our code is fairly isolated a repro is trivial, first let’s see how many objects we are leaking using the memory_profiler gem:

ENV['RAILS_ENV'] = 'production'
require 'memory_profiler'
require File.expand_path("../../config/environment", __FILE__) # warmup
3.times{PrettyText.cook("hello world")} MemoryProfiler.report do 50.times{PrettyText.cook("hello world")}
end.pretty_print

At the top of our report we see:

retained objects by location
----------------------------------- 901 /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/2.1.0/weakref.rb:87 901 /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:21 600 /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/weak.rb:42 250 /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/context.rb:97 50 /home/sam/.rbenv/versions/2.1.2.discourse/lib/ruby/gems/2.1.0/gems/therubyracer-0.12.1/lib/v8/object.rb:8

So we are leaking 54 or so objects each time we cook a post, this adds up really fast. We may also be leaking unmanaged memory here which could compound the problem.

Since we have a line number it is easy to track down the source of the leak

require 'weakref'
class Ref def initialize(object) @ref = ::WeakRef.new(object) end def object @ref.__getobj__ rescue ::WeakRef::RefError nil end
end class WeakValueMap def initialize @values = {} end def [](key) if ref = @values[key] ref.object end end def []=(key, value) @values[key] = V8::Weak::Ref.new(value) end
end

This WeakValueMap object keeps growing forever and nothing seems to be clearing values from it. The intention with the WeakRef usage was to ensure we allow these objects to be released when no-one holds references. Trouble is that the reference to the wrapper is now kept around for the entire lifetime of the JavaScript context.

A fix is pretty straight forward:

class WeakValueMap def initialize @values = {} end def [](key) if ref = @values[key] ref.object end end def []=(key, value) ref = V8::Weak::Ref.new(value) ObjectSpace.define_finalizer(value, self.class.ensure_cleanup(@values, key, ref)) @values[key] = ref end def self.ensure_cleanup(values,key,ref) proc { values.delete(key) if values[key] == ref } end
end

We define a finalizer on the wrapped object that ensures we clean up the all these wrapping objects and keep on WeakValueMap small.

The effect is staggering:

ENV['RAILS_ENV'] = 'production'
require 'objspace'
require 'memory_profiler'
require File.expand_path("../../config/environment", __FILE__) def rss `ps -eo pid,rss | grep #{Process.pid} | awk '{print $2}'`.to_i
end PrettyText.cook("hello world") # MemoryProfiler has a helper that runs the GC multiple times to make
# sure all objects that can be freed are freed.
MemoryProfiler::Helpers.full_gc
puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}" 20.times do 5000.times { |i| PrettyText.cook("hello world") } MemoryProfiler::Helpers.full_gc puts "rss: #{rss} live objects #{GC.stat[:heap_live_slots]}" end

Before

rss: 137660 live objects 306775
rss: 259888 live objects 570055
rss: 301944 live objects 798467
rss: 332612 live objects 1052167
rss: 349328 live objects 1268447
rss: 411184 live objects 1494003
rss: 454588 live objects 1734071
rss: 451648 live objects 1976027
rss: 467364 live objects 2197295
rss: 536948 live objects 2448667
rss: 600696 live objects 2677959
rss: 613720 live objects 2891671
rss: 622716 live objects 3140339
rss: 624032 live objects 3368979
rss: 640780 live objects 3596883
rss: 641928 live objects 3820451
rss: 640112 live objects 4035747
rss: 722712 live objects 4278779
/home/sam/Source/discourse/lib/pretty_text.rb:185:in `block in markdown': Script Timed Out (PrettyText::JavaScriptError) from /home/sam/Source/discourse/lib/pretty_text.rb:350:in `block in protect' from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `synchronize' from /home/sam/Source/discourse/lib/pretty_text.rb:348:in `protect' from /home/sam/Source/discourse/lib/pretty_text.rb:161:in `markdown' from /home/sam/Source/discourse/lib/pretty_text.rb:218:in `cook' from tmp/mem_leak.rb:30:in `block (2 levels) in <main>' from tmp/mem_leak.rb:29:in `times' from tmp/mem_leak.rb:29:in `block in <main>' from tmp/mem_leak.rb:27:in `times' from tmp/mem_leak.rb:27:in `<main>'

After

rss: 137556 live objects 306646
rss: 259576 live objects 314866
rss: 261052 live objects 336258
rss: 268052 live objects 333226
rss: 269516 live objects 327710
rss: 270436 live objects 338718
rss: 269828 live objects 329114
rss: 269064 live objects 325514
rss: 271112 live objects 337218
rss: 271224 live objects 327934
rss: 273624 live objects 343234
rss: 271752 live objects 333038
rss: 270212 live objects 329618
rss: 272004 live objects 340978
rss: 270160 live objects 333350
rss: 271084 live objects 319266
rss: 272012 live objects 339874
rss: 271564 live objects 331226
rss: 270544 live objects 322366
rss: 268480 live objects 333990
rss: 271676 live objects 330654

Looks like memory stabilizes and live object count stabilizes after the fix.

Summary

The current tooling offered in Ruby offer spectacular visibility into the Ruby runtime. The tooling around this new instrumentation is improving but still rather rough at spots.

As a .NET developer in a previous lifetime I really missed the excellent memory profiler tool, luckily we now have all the information needed to be able to build tools like this going forward.

Good luck hunting memory leaks, I hope this information helps you and please think twice next time you deploy a “unicorn OOM killer”.

A huge thank you goes to Koichi Sasada and Aman Gupta for building us all this new memory profiling infrastructure.

PS: another excellent resource worth reading is Oleg Dashevskii’s “How I spent two weeks hunting a memory leak in Ruby”