Optimizing Large CSV Processing in Ruby

boxes for my drumset

I work in the data-migration team at Teamtailor. Our daily job is to provide our customers with fast, secure, and stable data migration services along with their data. This service must be of high quality because data is very valuable in the HR industry, and also this process needs to be as fast as possible because time is as important as data in the HR sector.

As a team, we use tools that we've built ourselves. We try to accomplish our work while seeing and touching the data as little as possible. However, sometimes our customers' data can be very large and complex. Since the applications we've written are internal company tools, it's quite normal for us to face performance issues.

We perform data transfer operations by reading CSV files. Some CSV files can be related to one or more other CSV files. For example, information related to a candidate like candidate_comments.csv or candidate_info.csv cannot always be placed directly inside candidate.csv. Here you need to establish relationships and read the files. However, this situation can become a complete nightmare with large files.

The general approach, just like any Ruby developer would do, is for us to do it like this;

candidates = CSV.read("candidates.csv", headers: true)
comments = CSV.read("candidates_comments.csv", headers: true)

comments.each do |comment|
  candidate = candidates.find { |c| c["id"] == comment["candidate_id"] }
  puts candidate["name"]
end

While the code snippet above doesn't cause problems with small datasets, with large datasets it turns into endless loops and sometimes blows up the memory, killing the script. The reason for this is that the complexity of this each loop is O(m*n). In other words, it reads two files row by row with each loop.

candidates_by_id = {}
candidates.each do |candidate|
  candidates_by_id[candidate["id"]] = candidate
end

comments.each do |comment|
  candidate = candidates_by_id[comment["candidate_id"]]
  puts candidate["name"]
end

This code snippet is a good solution for faster usage. As for memory, it's harder to say the same thing. Of course, there could be better solutions than this. What's being done here is to first map the candidate file with a simple mapping and find the relevant candidate in the comment within that object. The complexity here becomes O(1). Because now, instead of looking through the file one by one, we directly access the address of the candidate_id in the object with a pinpoint hit.

In short;

find/select = searches through the entire array each time (very slow with large files)
Hash = looks directly at the key (very fast)

You can see the script below and the result for time.

require 'csv'
require 'benchmark'

# Create sample CSV files
CSV.open("candidates.csv", "w") do |csv|
  csv << ["id", "name", "email"]
  1000.times { |i| csv << [i, "Candidate #{i}", "candidate#{i}@example.com"] }
end

CSV.open("candidates_comments.csv", "w") do |csv|
  csv << ["comment_id", "candidate_id", "comment"]
  500.times { |i| csv << [i, rand(0...1000), "Comment #{i}"] }
end

candidates = CSV.read("candidates.csv", headers: true)
comments = CSV.read("candidates_comments.csv", headers: true)

Benchmark.bm do |x|
  x.report("find method:") do
    comments.each do |comment|
      candidates.find { |c| c["id"] == comment["candidate_id"] }
    end
  end

  x.report("hash lookup:") do
    candidates_by_id = {}
    candidates.each { |c| candidates_by_id[c["id"]] = c }
    comments.each { |comment| candidates_by_id[comment["candidate_id"]] }
  end
end

File.delete("candidates.csv")
File.delete("candidates_comments.csv")

                  user     system      total        real
find method:  0.137130   0.000331   0.137461 (  0.137459)
hash lookup:  0.000616   0.000006   0.000622 (  0.000621)

We need another approach for memory. Time consuming is important but memory is important as much as it. You can see a script for memory and result.

require 'csv'
require 'benchmark'

# Create sample CSV files
CSV.open("candidates.csv", "w") do |csv|
  csv << ["id", "name", "email"]
  1000.times { |i| csv << [i, "Candidate #{i}", "candidate#{i}@example.com"] }
end

CSV.open("candidates_comments.csv", "w") do |csv|
  csv << ["comment_id", "candidate_id", "comment"]
  500.times { |i| csv << [i, rand(0...1000), "Comment #{i}"] }
end

candidates = CSV.read("candidates.csv", headers: true)
comments = CSV.read("candidates_comments.csv", headers: true)

puts "=" * 60
puts "METHOD 1: find() inside loop"
puts "=" * 60

GC.start
mem_before = `ps -o rss= -p #{Process.pid}`.to_i / 1024.0

time1 = Benchmark.measure do
  comments.each do |comment|
    candidates.find { |c| c["id"] == comment["candidate_id"] }
  end
end

GC.start
mem_after = `ps -o rss= -p #{Process.pid}`.to_i / 1024.0

puts "Time:   #{time1.real.round(3)}s"
puts "Memory: #{mem_before.round(2)}MB → #{mem_after.round(2)}MB (#{(mem_after - mem_before).round(2)}MB used)"

puts "\n" + "=" * 60
puts "METHOD 2: Hash lookup"
puts "=" * 60

GC.start
mem_before = `ps -o rss= -p #{Process.pid}`.to_i / 1024.0

time2 = Benchmark.measure do
  candidates_by_id = {}
  candidates.each { |c| candidates_by_id[c["id"]] = c }
  comments.each { |comment| candidates_by_id[comment["candidate_id"]] }
end

GC.start
mem_after = `ps -o rss= -p #{Process.pid}`.to_i / 1024.0

puts "Time:   #{time2.real.round(3)}s"
puts "Memory: #{mem_before.round(2)}MB → #{mem_after.round(2)}MB (#{(mem_after - mem_before).round(2)}MB used)"

puts "\n" + "=" * 60
puts "SUMMARY"
puts "=" * 60
puts "Speedup: #{(time1.real / time2.real).round(1)}x faster"

File.delete("candidates.csv")
File.delete("candidates_comments.csv")

============================================================
METHOD 1: find() inside loop
============================================================
Time:   1.247s
Memory: 45.32MB → 46.15MB (0.83MB used)

============================================================
METHOD 2: Hash lookup
============================================================
Time:   0.045s
Memory: 46.15MB → 47.82MB (1.67MB used)

============================================================
SUMMARY
============================================================
Speedup: 27.7x faster

In summary, the choice between Hash and Find approaches depends on the dataset we have and the machine we're working on. Since the data we work with is large and the machines are relatively good, the Hash method saves us more in terms of time. Besides that, since the data doesn't change, we prefer the Hash method. Find or Select is what comes to mind first, but what comes to mind first isn't always the best.

So, have you had any performance experience or challenge when reading data and doing search operations on data?