Using Lzo compression codec in Hadoop

July 19, 2015

While working on Hadoop, most of the time, the files that we handle are very huge. It is very much required that we compress these kind of files and then use them with Hive or Pig. Hadoop provide various compression formats. there are different advantages and disadvantages of each format.

Let us start with different options of compression of a file available with us.

Name Tool Splittable
gzip gzip No
LZO lzop Yes(If Indexed )
bzip bzip2 Yes
Snappy NA No

Normally, you will like to chose an option where you can split the file and use power of Map Reduce to process that file. otherwise you will be forced to use single Mapper to process that file.

I normally prefer LZO format. Since it is very fast to decompress data when compressed with LZO. it makes it faster to read and process data compressed with this format. However this compression methodology is not as bzip , when it comes to size reduction.

How can you work on big files

First of all you need to have LZO tool installed on your machine. following are the commands to do this installation

Installing on Mac
sudo port install lzop lzo2

Installing on Redhat and CentOS based systems
sudo yum install liblzo-devel

Installing on Debian (Ubuntu , Lubuntu etc )based systems
sudo apt-get install liblzo2-dev

Now you need to download hadoop-lzo jar file. You can download jar file from here or you can yourself compile and create it from project site.

Put the jar file in lib folder of your hadoop installation. If you followed Hadoop installation steps from this blog, then paste this jar file in "/home/hduser/hadoop/lib" folder. If you are using Cloudera installation, then most likely you should paste it in "/usr/local/hadoop/lib" path.

To use this compression format with Map reduce , we also need to set some properties.
Now we need to copy following property in "core-site.xml" file between configuration tag. Basically we are telling hadoop to use these compression formats.

<property>
<name>io.compression.codecs</name>
<value>com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec</value>
</property>

<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

for also compressing the output we need to add following properties.

<property>
<name>mapred.output.compress</name>
<value>true</value>
</property>

<property>
<name>mapred.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzopCodec</value>
</property>

If you also want to compress the intermediate output of Map Reduce Program. i.e. output of Mapper Tasks. then set following properties also.

<property>
<name>mapred.compress.map.output</name>
<value>true</value>
</property>

<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>

After all the configurations are done, Now you can use LZO compressor

Suppose i have a text file that i want to compress and load in Hive Table. I can do following to achieve this.

Run following command to compress "input.txt" file and load in hadoop file system.

lzop -c "/home/hduser/input.txt" | hadoop fs -put _ /user/hduser/input.lzo

This command will compress the file and put it on Hadoop File system on "/user/hduser/input.lzo" path.

Our next step is to index this file using Hadoop LZO indexer so that we can use this file with map reduce.

Run following command to achieve this.

hadoop jar /home/harjeet/hadoop/lib/hadoop-lzo.jar \ com.hadoop.compression.lzo.DistributedLzoIndexer /user/hduser/input.lzo

This will create the index file in "/user/hduser/" folderin HDFS.

Now open shell for hive and set following properties and set properties

SET mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec
SET hive.exec.compress.output=true
SET mapreduce.output.fileoutputformat.compress=true

Now we can create a table to use compressed data and use it.

create table user_tb
(id int,
name string
) row format delimited
fields terminated by ','
STORED AS INPUTFORMAT \"com.hadoop.mapred.DeprecatedLzoTextInputFormat\" OUTPUTFORMAT \"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat\";

Now once the table is successfully created. we need to load the data in this table with the following command

load data inpath '/user/hduser/input*' into table user_tb.

after this step, you can normally use this table as any other table in hive.

Please feel free to leave any comments and feedback. Thanks... :)

Comments

mytechnicalstuff22 August 2015 at 11:06
please send materials to mytechnicalstuff007@gmail.com
ReplyDelete
Replies
Unknown22 August 2015 at 11:08
please provide more info on isha.mandrekar@gmail.com
ReplyDelete
Replies
Unknown22 August 2015 at 11:09
Please send info @ vijay.vjmehra@gmail.com
ReplyDelete
Replies
shirsch31 October 2016 at 11:45
I wish this worked for me. I have followed directions to the letter, yet my jobs fail with a warning:

you may need to add the LZO codec to your io.compression.codecs configuration in core-site.xml.

It is in there exactly as you show above and the cluster has been restarted. Hadoop makes me want to pull my hair out by the roots sometimes...
ReplyDelete
Replies
HADOOP SCHOOL10 July 2017 at 08:11
please send materials to rajesh.ramasankar@gmail.com
ReplyDelete
Replies

Add comment

Search This Blog

Harjeet's blog

Using Lzo compression codec in Hadoop

Comments

Post a Comment

Popular posts from this blog

Hive UDF Example

Custom UDF in Apache Spark

Hadoop series : Pseudo Distributed Mode Hadoop Instalation