INSTALL

Detailed Installation Instructions
----------------------------------

BioPig is a framework for genomic data analysis using Apache Pig and Hadoop.  The software builds a
jar file that can then be used on hadoop clusters or through the pig query language.

---------------------------------------------------------
0. Requirements
---------------------------------------------------------

You will need java and maven to build the software.  To run, you need
a hadoop cluster (version 2.70 or later) and Pig (version 0.15.0 or later).

---------------------------------------------------------
1. add non-maven libraries to your local maven repository
---------------------------------------------------------

Maven is great, however not all libraries are always available from
maven repos.  The directory nonmavenlibs contains a set of jars that
aren't found in any repo.  you will need to add them to your local
maven repository (by default that's ~/.m2).  just run the following
script:

    % cd nonmavenlibs; ./installLibsInMaven.sh

this will generate appropriate maven metadata.

---------------------------------
2. build
---------------------------------

% mvn install  -Dmaven.test.skip=true

This produces a tar file in biopig/target/biopig-core-1.0.0-bin.tar.gz
that includes the apps, the configuration directory, and the library files.

dev$ tar -ztf biopig/target/biopig-core-1.0.0-bin.tar.gz
biopig-core-1.0.0/INSTALL
biopig-core-1.0.0/LICENSE
biopig-core-1.0.0/README
biopig-core-1.0.0/lib/biopig-job.jar
biopig-core-1.0.0/apps/
biopig-core-1.0.0/apps/barcode.pig
biopig-core-1.0.0/apps/contigKmer.pig
biopig-core-1.0.0/apps/dereplicate.pig
biopig-core-1.0.0/apps/kmerGenerator.pig
biopig-core-1.0.0/apps/kmerStats.pig
biopig-core-1.0.0/apps/n50.pig
biopig-core-1.0.0/apps/velvetAssembly.pig

---------------------------------
3. Installation
---------------------------------

To install, untar the built tar file in some location.

% tar -zxvf biopig/target/biopig-core-1.0.0-bin.tar.gz -C /usr/local/
==> produces a directory /usr/local/biopig-1.0.0/

add /usr/local/biopig-1.0.0/lib/biopig-job.jar to your PIG_CLASSPATH environment variable and
you are off.

---------------------------------
4. Running apps
---------------------------------
**Configuration

The file biopig/etc/BioPig.xml.magellan is read as the configuration file. You should modify it to refer to programs such as blastall that are accessible from hadoop. Also make sure that the temp folder you specify have global write and execute permissions (so that the hadoop user that runs your map-red steps can use it) -- chmod 777 tmp. You must also make sure that you have +x perms for all folders leading up to the temp folder. It is recommended that you create the tmp folder in a scratch area.

When you have made changes to the biopig jar file (e.g., when you change the configuration, or when you change the biopig Java code), you copy that file to 

cp biopig-core-1.0.0-job.jar biopig/lib/biopig-core-1.0.0-job.jar

** Running BioPig Scripts

There are sample scripts for running BioPig jobs
First set the parameters in the .pig file,
then you run the scripts like this (assuming biopig/ is in your $PATH): 

Run pig app in mapreduce mode
pig biopig/apps/kmerCount.pig

Run pig app in local mode
pig -x local biopig/apps/kmerCount.pig