Compression Level is ignored. #142

wilcoln · 2020-03-03T20:30:18Z

I want to compress some file already inside hdfs using different compression levels.
To do so, I write the following program:

Compress.java

import ...
import com.hadoop.compression.lzo.LzoCodec;

public class Compress {
  
 public static class VoidReducer extends Reducer<LongWritable, Text, Text, Text> {
  
   @Override
   public void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
      for(Text value: values)
        context.write(value, new Text(""));
      }
   }

  public static void main(String[] args) throws Exception{

    Configuration conf = new Configuration();
    int level = Integer.parseInt(args[2]);
    conf.setInt("io.compression.codec.lzo.compression.level", level);

    Job job = Job.getInstance(conf);
    job.setJobName("Compresser Job");
    job.setJarByClass(Compress.class);
    job.setMapperClass(Mapper.class);
    job.setReducerClass(VoidReducer.class);
    job.setNumReduceTasks(1);

    TextInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job,  new Path(args[1]));
    FileOutputFormat.setCompressOutput(job, true);

    FileOutputFormat.setOutputCompressorClass(job, LzoCodec.class);
    // submit and wait for completion
    job.waitForCompletion(true);

Then I execute run the following commands

$ javac -classpath $(hadoop classpath) *.java
$ jar -cvf Compress.jar Compress.class
$ hadoop jar Compress.jar Compress file.txt test1 1
$ hadoop jar Compress.jar Compress file.txt test7 7

The filefile.txt is of size 1Gb. When I then check the size of test1 and test2 with
hdfs dfs -du -s -h, I get 594.6 M for each.
This proves that the compression level is ignored.

The text was updated successfully, but these errors were encountered:

toddlipcon · 2020-03-06T18:30:04Z

Your code looks fine at first glance. I'm not actively maintaining this project anymore -- it's largely in maintenance mode as most people have moved on to using better file formats like Parquet along with LZ4 or Snappy. I'd suggest doing some debugging of your own -- rebuild hadoop-lzo with logging at the point where the compressor is created and see if it's getting passed through properly, and follow the breadcrumbs from there.

wilcoln · 2020-03-07T10:47:05Z

Ok thanks

wilcoln changed the title ~~Comression Level is ignored.~~ Compression Level is ignored. Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression Level is ignored. #142

Compression Level is ignored. #142

wilcoln commented Mar 3, 2020 •

edited

Loading

toddlipcon commented Mar 6, 2020

wilcoln commented Mar 7, 2020

Compression Level is ignored. #142

Compression Level is ignored. #142

Comments

wilcoln commented Mar 3, 2020 • edited Loading

toddlipcon commented Mar 6, 2020

wilcoln commented Mar 7, 2020

wilcoln commented Mar 3, 2020 •

edited

Loading