Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicate diffs #36

Open
raphink opened this issue Jan 8, 2016 · 0 comments
Open

Deduplicate diffs #36

raphink opened this issue Jan 8, 2016 · 0 comments

Comments

@raphink
Copy link

raphink commented Jan 8, 2016

When running the catalog diff, it is very common that lots of resources will share the same diff on various machines. This results in heavy JSON reports (I've had reports up to 30MB) with repeated diff content.

I suggest we could add deduplication to the JSON format, using a combination of size/hash as key (to be on the safish side). Something along the lines of:

{
  "diff_content": {
    "1234_abcd123def": [
      " class { 'mountain':",
      "+    ensure => nodragon",
      "  }"
    ]
  }
  "ererebor.middle-earth": {
    "differences_as_diff": [
      "class[moutain]": "1234_abcd123def"
    ]
  }
}

I'm pretty sure this would greatly reduce the size of the JSON reports, while not greatly improving the complexity of displaying them.

Here's a little experiment on existing reports:

#!/usr/bin/ruby
#
# Pass the JSON report as argument

require 'json'
require 'digest'

file = File.open(ARGV[0]).read
old_size = file.size
json = JSON.parse(file)
reserved = ['date', 'max_diff', 'most_changed', 'most_differences', 'total_nodes', 'total_percentage', 'with_changes', 'pull_output', 'fact_search']
machines = json.keys - reserved

json['diff_content'] = {}

machines.each do |m|
  json[m]['differences_as_diff'].each do |k, v|
    size = v.size
    diff = (v.class == Array) ? v.join("\n") : v;
    hash = Digest::MD5.new.update(diff).to_s
    key = "#{size}_#{hash}"
    json['diff_content'][key] = v
    json[m]['differences_as_diff'][k] = key
  end
end

file = json.to_json
new_size = file.size
percentage_gained = (old_size - new_size)*100/old_size
puts "Gained #{percentage_gained}%"

I've tried it against a few reports I have here. With the demo1.json file I use for the catalog-dfif-viewer demo, I get 8% gain. On production reports, I've gained between 1% and up to 46% in size.

The gain can be improved by using an indexed array instead of pseudo-unique keys:

#!/usr/bin/ruby
#
# Pass the JSON report as argument

require 'json'
require 'digest'

file = File.open(ARGV[0]).read
old_size = file.size
json = JSON.parse(file)
reserved = ['date', 'max_diff', 'most_changed', 'most_differences', 'total_nodes', 'total_percentage', 'with_changes', 'pull_output', 'fact_search']
machines = json.keys - reserved

json['diff_content'] = []
diff_content_index = []

machines.each do |m|
  json[m]['differences_as_diff'].each do |k, v|
    size = v.size
    diff = (v.class == Array) ? v.join("\n") : v;
    hash = Digest::MD5.new.update(diff).to_s
    key = "#{size}_#{hash}"

    unless diff_content_index.index(key)
      json['diff_content'] << v
      diff_content_index << key
    end
    json[m]['differences_as_diff'][k] = diff_content_index.index(key)
  end
end

file = json.to_json
new_size = file.size
percentage_gained = (old_size - new_size)*100/old_size
puts "Gained #{percentage_gained}%"

With this method, the demo1.json file gets a 12% gain, and my production tests give me up to 47% gain. The diff content can still be easily accessed by index:

irb(main):012:0> json['gandalf01']['differences_as_diff']['class[Puppet]']
=> 0
irb(main):013:0> json['diff_content'][0]
=> [" \t                                   stringify_facts =>                               false,", " \t                                  }", " \t     agent_noop => false", "+\t     agent_restart_command => \"/usr/sbin/service puppet reload\"", " \t     agent_template => \"puppet/agent/puppet.conf.erb\"", " \t     allow_any_crl_auth => false", " \t     auth_allowed => ["]
natemccurdy pushed a commit to natemccurdy/puppet-catalog-diff that referenced this issue Jun 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant