Apache Pig
Introduction to Apache Pig
A hypothesis called Apache Pig was developed on top of Apache Hadoop ecosystem MapReduce to process wide range and a large amount of dataset. Pig is a platform developed by Yahoo which was later converted into an open source Apache project. A large amount of data can be analyzed using Apache Pig by transforming the dataset into data flows. Map Reduce architecture works functions by transforming the programs into series of different Map and Reduces phases. In Map Reduce framework, programs need to be translated into a series of Map and Reduce stages. Since this does not come under a programming model, it becomes difficult for a data analyst to get accustomed with. Apache Pig was built on top of Apache Hadoop ecosystem to fill this gap.
Pig Latin and Runtime environment
Pig Latin is a scripting language which can be used to perform ETL operations (Extract, Transform and Load) as well as raw data analysis. Similar to SQL scripting and query language, Pig is also used to load and dump the data in the required structure after applying various constraints and filters.
A java runtime environment i.e. JRE is required to run the created programs. All the operations are effectively handled by Hadoop by converting the operations into Map and Reduce modules. This allows the programmer to focus on the operations rather than concentrating on each and every Mapper and Reducer functions.
Pig converts all the operations into Map and Reduce tasks which can be efficiently processed on Hadoop. It basically allows us to concentrate on the whole operation irrespective of the individual mapper and reducer functions. The main reason behind the development of Pig is to find an easy and better approach to program applications using Map Reduce. Previously, JAVA was the only programming language used to process and analyze dataset using HDFS.
Pig can find applications in data analysis where query operations are being used on a dataset. For example, to find all the rows having a value of the variable income greater than $50,000. This can also be applied in the cases where there is a need to combine two different datasets on the basis of a key value. Pig can also be used in the situations when it is required to keep on applying an algorithm on a dataset in an iterative manner. Ideal for ETL operations and managing irregular schema data, it also promotes applying a sequential procedure to transform the data.
Features of Apache Pig
Let’s go through the following features of Apache Pig:
- Pig provides a great number of the rich set of operators to be used. Providing operations such as filter, join, sort etc. to different operators.
- Pig Latin is easy to write and shares similarity with SQL query language. Being good at one of these can help with the other one as well. Enabling ease to write programs.
- Provides various optimization opportunities. It is required to concentrate more on the semantics of the language. Roles related to the execution of the programs are handled by the Apache Pig itself.
- Apache Pig provides the ability to extend the system i.e. extensibility. Users can create their desired functions using the existing operators, users can develop their own functions to read, process, and write data.
- Procedures to developing user defined function using other programming languages like JAVA are provided as well. Also, it is possible to call and merge these functions using Pig Scripts.
- A wide range of data can be handled and managed using Apache Pig. Provides ways to manage both structured and unstructured data. The result is stored using HDFS.
Apache Pig vs. MapReduce
Apache Pig is a preferred approach while writing MapReduce modules as knowledge of Python and Java Programming language is required otherwise.
- Apache Pig differs from MapReduce in the type of the approach and data flow. MapReduce is a low-level model while Apache Pig is a high-level model used for data processing.
- Pig Latin can be used to achieve the benefits of MapReduce without even using JAVA programming language and its implementations. Making it very easy for the user to grasp the concept of Pig.
- Varied MapReduce tasks can be achieved using a single MapReduce query. This shortens the length of the code by a great extent. Reducing the development period by a large degree.
- Performing various data operations such as sorting, joins and filtering etc. is a big task. These same functions can be performed easily using Apache Pig.
- A join operation is simple to execute in case of Apache Pig. In the case of MapReduce, it is required to create and initiate multiple MapReduce tasks. These tasks are required to be executed sequentially to complete the desired functionality.
- Nested data types such as maps, tuples and bags are provided as well which are not present in MapReduce.
Apache Pig Components
The following components form the fundamentals behind the Apache Pig architecture and its working. Let’s go through these components one by one.
1. Parser
The Parser performs the functioning of handling the script as and performs the functions related to syntax checking, type checking etc. The Parser provided the output to the user in the form of Directed acyclic graphs. These graphs represent the statements with logical operators belonging to Pig Latin.
The Directed acyclic graphs contain the information about the nodes using the logical operators and the edges using the flow of the data.
2. Optimizer
The operations such as projection and pushdown which comes under logical optimizer are performed using the logic plan represented using the Directed acyclic graphs.
3. Compiler
The logical plan from the optimizer is given to the compiler which performs the job of converting it to a sequence of MapReduce jobs.
4. Execution engine
After performing the above operations, all the jobs under MapReduce are handed to Hadoop sequentially. The execution of the MapReduce jobs then takes place on Hadoop for the required results.
Download and Install Apache Pig
Apache Pig can be downloaded using the following command on a Linux operating system.
$ wget http://mirror.symnds.com/software/Apache/pig/pig-0.12.0/pig-0.12.0.tar.gz
To untar the above-downloaded package following command can be used.
$ tar xvzf pig-0.12.0.tar.gz
Set the required path to the Apache Pig directory structure using the command as follows.
export PATH=$PATH:/home/hduser/pig/bin
Executing Apache Pig Script
Yelp Dataset
We are going to use YELP review dataset which contains the reviews related to different businesses.The data has been extracted from the YELP website and stored in a script named as yelp.csv. The yelp.csv file contains variables names as user_id, business_id, date, stars, review_length, votes_funny, votes_useful, votes_cool, votes_total, pos_words, neg_words and net_sentiment.
Below are the sample rows extracted from the YELP reviews dataset.
{` Xqd0DzHaiyRqVH3WRG7hzg vcNAWiLM4dR7D2nwwJ7nCA 5/17/2007 5 94 0 2 1 3 4 1 3 H1kH6QZV7Le4zqTRNxoZow vcNAWiLM4dR7D2nwwJ7nCA 3/22/2010 2 114 0 2 0 2 3 7 -4 zvJCcrpm2yOZrxKffwGQLA vcNAWiLM4dR7D2nwwJ7nCA 2/14/2012 4 55 0 1 1 2 6 0 6 KBLW4wJA_fwoWmMhiHRVOA vcNAWiLM4dR7D2nwwJ7nCA 3/2/2012 4 97 0 0 0 0 3 0 3 zvJCcrpm2yOZrxKffwGQLA vcNAWiLM4dR7D2nwwJ7nCA 5/15/2012 4 53 0 2 1 3 1 2 -1 Qrs3EICADUKNFoUq2iHStA vcNAWiLM4dR7D2nwwJ7nCA 4/19/2013 1 212 0 0 0 0 4 8 -4 jE5xVugujSaskAoh2DRx3Q vcNAWiLM4dR7D2nwwJ7nCA 1/2/2014 5 62 0 0 0 0 6 0 6 QnhQ8G51XbUpVEyWY2Km-A vcNAWiLM4dR7D2nwwJ7nCA 1/8/2014 5 67 0 0 0 0 4 1 3 tAB7GJpUuaKF4W-3P0d95A vcNAWiLM4dR7D2nwwJ7nCA 8/1/2014 1 194 0 1 0 1 5 2 3 GP-h9colXgkT79BW7aDJeg vcNAWiLM4dR7D2nwwJ7nCA 12/12/2014 5 52 0 0 0 0 8 0 8 uK8tzraOp4M5u3uYrqIBXg UsFtqoBl7naz8AVUBZMjQQ 11/8/2013 5 75 0 0 0 0 12 0 12 I_47G-R2_egp7ME5u_ltew UsFtqoBl7naz8AVUBZMjQQ 3/29/2014 3 137 0 0 0 0 5 0 5 PP_xoMSYlGr2pb67BbqBdA UsFtqoBl7naz8AVUBZMjQQ 10/29/2014 2 61 0 0 0 0 10 0 10 JPPhyFE-UE453zA6K0TVgw 11/28/2014 4 63 1 1 1 3 7 2 5 fhNxoMwwTipzjO8A9LFe8Q cE27W9VPgO88Qxe4ol6y_g 8/19/2012 3 86 0 1 0 1 8 3 5 -6rEfobYjMxpUWLNxszaxQ cE27W9VPgO88Qxe4ol6y_g 4/18/2013 1 218 0 1 0 1 7 4 3 KZuaJtFindQM9x2ZoMBxcQ cE27W9VPgO88Qxe4ol6y_g 7/14/2013 1 108 0 0 0 0 3 1 2 H9E5VejGEsRhwcbOMFknmQ cE27W9VPgO88Qxe4ol6y_g 8/16/2013 4 186 0 0 0 0 7 0 7 ljwgUJowB69klaR8Au-H7g cE27W9VPgO88Qxe4ol6y_g 7/11/2014 4 74 0 0 0 0 3 1 2 JbAeIYc89Sk8SWmrBCJs9g HZdLhv6COCleJMo7nPl-RA 6/10/2013 5 121 3 7 7 17 6 2 4 l_szjd-ken3ma6oHDkTYXg HZdLhv6COCleJMo7nPl-RA 12/23/2013 2 50 1 1 1 3 4 1 3 zo_soThZw8eVglPbCRNC9A HZdLhv6COCleJMo7nPl-RA 9/4/2014 4 27 0 0 0 0 3 0 3 LWbYpcangjBMm4KPxZGOKg mVHrayjG3uZ_RLHkLj-AMg 12/1/2012 5 184 0 5 0 5 14 1 13 m1FpV3EAeggaAdfPx0hBRQ mVHrayjG3uZ_RLHkLj-AMg 3/15/2013 5 10 0 0 0 0 1 1 0 8fApIAMHn2MZJFUiCQto5Q mVHrayjG3uZ_RLHkLj-AMg 3/30/2013 5 228 0 2 1 3 17 6 11 uK8tzraOp4M5u3uYrqIBXg mVHrayjG3uZ_RLHkLj-AMg 10/20/2013 4 75 0 1 0 1 7 1 6 6wvlM5L4_EroGXbnb_92xQ mVHrayjG3uZ_RLHkLj-AMg 11/7/2013 5 37 0 0 0 0 6 1 5 345nDw0oC-jOcglqxmzweQ mVHrayjG3uZ_RLHkLj-AMg 3/22/2014 5 67 0 2 1 3 6 0 6 u9ULAsnYTdYH65Haj5LMSw mVHrayjG3uZ_RLHkLj-AMg 9/29/2014 4 24 0 0 0 0 2 1 1 `}
This file has 1,569,264 rows of data across 12 columns.
Apache Pig can be used in the following modes:
- Local Mode
- Cluster Mode
Running Apache Pig in Local Mode:
We can run Apache Pig in local mode using the following command.
{` $ pig -x local After executing the above command on the terminal, the below output is observed. 2017-08-03 17:00:23,258 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2017-08-03 17:00:23,259 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig/myscripts/pig_1388027786256.log 2017-08-03 17:00:23,281 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found 2017-08-03 17:00:23,381 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///grunt `}
Running Apache Pig in Cluster Mode:
We can run Apache Pig in cluster mode using the following command.
{`$ pig 2017-08-03 17:37:23,274 [main] INFO org.apache.pig.Main - Apache Pig version 0.12.0 (r1529718) compiled Oct 07 2013, 12:20:14 2017-08-03 17:37:23,274 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hadoop/pig/myscripts/pig_1388027982272.log 2017-08-03 17:37:23,300 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found 2017-08-03 17:37:23,463 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:54310 2017-08-03 17:37:23,672 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: hdfs://localhost:9001 grunt `}
The above command shows a grunt shell. PigLatin statements can be executed in a fast manner using this grunt shell. This can be used to verify or test the data flows without following a sequential procedure with a complete script. Now let’s move ahead and test our data using PigLatin.
Pig Latin
The data present in the dataset can be queried to test various Pig Latin implementations and procedures. The first step is to make data accessible in Pig.
The following command can be used to load the yelp reviews data into a variable using Pig Latin.
grunt reviews = LOAD '/home/hadoop/pig/myscripts/yelp.csv' USING PigStorage(',') as (user_id,business_id,date,stars,review_length,votes_funny,votes_useful,votes_coolvotes_total,pos_words,neg_words,net_sentiment);
In the previous command, we have used reviews which are called the relation or alias in Pig but are not a variable. This statement does not make any Map Reduce task to execute. The keyword PigStorage(‘,’) is used since the records in the database are separated using a comma operator.
Names of the fields present in a dataset can be given using ‘as’ keyword. This keyword assigns a name for every field or column present in the dataset.
Testing the loaded data
To test whether the data has been successfully loaded using the previous command, a DUMP command can be used.
{`grunt DUMP reviews; After executing the previous command, the terminal shows a large text on the screen which forms the output of the DUMP command. We have shown only partial output as below. 2017-03-08 17:40:08,550 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN 2017-03-08 17:40:08,633 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - &lbrace RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier] &rbrace 2017-03-08 17:40:08,748 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false 2017-03-08 17:40:08,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2017-03-08 17:40:08,805 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2017-03-08 17:40:08,853 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job ................ HadoopVersion PigVersion UserId StartedAt FinishedAt Features 1.1.2 0.12.0 hadoop 2013-12-25 23:03:04 2013-12-25 23:03:05 UNKNOWN Success! Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0006 reviews MAP_ONLY file:/ptmp/ptemp-5323122347/tmp2718191010, Input(s): Successfully read records from: "/home/hadoop/pig/myscripts/yelp.csv" Output(s): Successfully stored records in: "file:/ptmp/ptemp-5323122347/tmp2718191010," Job DAG: job_local_0006 ................ (Xqd0DzHaiyRqVH3WRG7hzg ,vcNAWiLM4dR7D2nwwJ7nCA,17-05-2007,5,94,0,2) (H1kH6QZV7Le4zqTRNxoZow ,vcNAWiLM4dR7D2nwwJ7nCA,22-03-2010,2,114,0,2) (zvJCcrpm2yOZrxKffwGQLA,vcNAWiLM4dR7D2nwwJ7nCA,14-02-2012,4,55,0,1) (KBLW4wJA_fwoWmMhiHRVOA,vcNAWiLM4dR7D2nwwJ7nCA,02-03-2012,4,97,0,0) (zvJCcrpm2yOZrxKffwGQLA,vcNAWiLM4dR7D2nwwJ7nCA,15-05-2012,4,53,0,2) (Qrs3EICADUKNFoUq2iHStA,vcNAWiLM4dR7D2nwwJ7nCA,19-04-2013,1,212,0,0) (jE5xVugujSaskAoh2DRx3Q,vcNAWiLM4dR7D2nwwJ7nCA,02-01-2014,5,62,0,0) (QnhQ8G51XbUpVEyWY2Km-A,vcNAWiLM4dR7D2nwwJ7nCA,08-01-2014,5,67,0,0) (tAB7GJpUuaKF4W-3P0d95A,vcNAWiLM4dR7D2nwwJ7nCA,01-08-2014,1,194,0) `}
Once a DUMP statement is executed, the MapReduce job starts as well. From the previous output, it can be seen that the data has been successfully loaded into the reviews field.
Performing Queries
After loading the data desired queries can be performed on the dataset. Listing the reviews with net sentiment with value less than 10.
{`grunt netsentiment_less_than_ten = FILTER reviews BY (int)rating 10.0; grunt DUMP netsentiment_less_than_ten; `}
The above statements filter the alias reviews and store the results in a new alias netsentiment_less_than_ten. The netsentiment_less_than_ten alias will have only records of reviews where the net_sentiment is greater than 10.
The DUMP command is only used to display information onto the standard output. The data can be stored in a file using the following command.
{`grunt store netsentiment_less_than_ten into '/user/hadoop/ netsentiment_less_than_ten ;`}