Pig Basics for SQL Server DBA

The estimated reading time for this post is 8 minutes

You might think why Pig is related to SQL Server, at first glance it might not be related but as we know that SQL Server is already integrated with Hadoop and Big Data, is going to be used by corporations, then SQL Server DBAs should learn some basics of Pig to perform some data extraction from Hadoop cluster. This blog post introduces Pig and also some basic commands including some examples. You may refer to my previous blog post to understand the ‘What is Big Data and Hadoop?’.

What is Pig?

Apache Pig is a Hadoop platform for creating MapReduce jobs. Pig uses a high-level, SQL-Like programming language called Pig Latin. The benefits of Pig includes:-

  • Run a MapReduce job with a few simple lines of code.
  • Process structured data with a schema, or Pig can process unstructured data without a schema.
  • Pig Latin uses a familiar SQL-Like syntax.
  • Pig scripts read and write data from HDFS.
  • Pig Latin is a data flow language, a logical solution for many MapReduce algorithms.
  • Pig Latin can be used as Extract-Transform-Load (ETL) process.

Pig was created at Yahoo! to make it easier to analyze the data in HDFS without the complexities of writing a traditional MapReduce program.

The developers of Pig published their philosophy to summarize the goals of Pig, using comparisons to actual pigs:

  • Pig eat anything: Pig can process any data, structured or unstructured.
  • Pig live anywhere: Pig can run on any parallel data processing framework, so Pig scripts do not have to run just on Hadoop.
  • Pig is domestic animal: Pig is designed to be easily controlled and modified by its users.
  • Pig fly: Pig is designed to process data quickly.

What is Pig Latin?

Pig Latin is a high-level data flow scripting language. Pig Latin scripts can be executed from one of the following ways:

  1. Pig Script: write a Pig Latin program in a text file and execute it using the Pig executable.
  2. Grunt Shell: enter Pig statements manually one-at-a-time from a CLI tool known as the Grunt interactive shell.
  3. Embedded in Java: use PigServer class to execute a Pig query from within Java code.

Pig executes in a unique fashion – some commands build on previous co  ands, while certain commands trigger a MapReduce job:

  • During execution each statement is processed by the Pig interpreter.
  • If a statement is valid, it gets added to a logical plan built by the interpreter.
  • The steps in the logical plan do not actually execute until a DUMP or STORE command.

Pig Latin Relation Names

Each processing step of a Pig Latin script results in a new data set, referred to as a Relation. You assign names to relations, and the name of the relation is referred to as its Alias. For example, consider the following Pig Latin statement:

Stocks = LOAD ‘marketdata.txt’ USING TextLoader();

The alias Stocks is assigned to the relation created by the LOAD command, which in this statement is a line of text from the marketdata.txt file. the Stocks alias now represents the collection of records in marketdata.txt.

TextLoader is a simple way of loading each line of text in a file into a record, no matter what the format of the data is.

Relation names are not variables, even though they look like variables. You can reassign an alias to a different relation, but that is not recommended.

Pig Latin Field Names

You can also define field names when using the LOAD command to define a relation. Use the AS keywork to define field names.

Salaries = LOAD ‘SalariesData.csv’ USING PigStorage(‘,’) AS (Designation,Gender,Age,Salary,Tax);

The alias for this relation is Salaries, and Salaries has five field names: Designation, Gender, Age, Salary and Tax. Field names can be used in subsequent processing commands. For example, when filtering a relation, you can refer to its fields in the BY clause, as shown in the following statement:

HighSalaries = FILTER Salaries BY Salary > 500000;

Field names contain the values of the current record as the data passes through the pipeline of the Pig application. the HighSalaries relation will contain all records whose Salary field is greater than 500,000.

Remember that both field names and relation names must satisfy the following naming criteria:

  • Start with an alphabetic character.
  • Can contain alphabetic and numeric characters, as well as the underscore (_) character.
  • All characters must be ASCII.

Field names and relation names and user defined functions are case sensitive in Pig Latin script. However, Pig Latin keywords such as LOAD and AS are not case sensitive.

Pig Data Types

Pig has six built-in scalar data types:

Data TypeDescription
inta 32-bits signed integer
longa 64-bits signed integer
float32-bits floating-point number.
double64-bits floating-point number.
chararraystrings of Unicode characters
bytearraya blob or array of bytes.
Booleancan be either true or false.
datetimestores a date and time in the format of YYYY-MM-DD HH:MM:SS.MS
bigdecimal and biginteger for performing precision arithmetic

Pig Complex Types

Pig has three complex types:

  1. Tuple: ordered set of fields. A tuple is analogous to a row in a SQL table, with the fields being SQL columns.
  2. Bag: Unordered collection of tuples.
  3. Map: Collection of key value pairs.

Complex Types formats:

Tuple:

(KL,Hamid,J. Fard,9999)

Bag:

{ (KL,Hamid,J. Fard,9999) , (KL, Melissa, Lenjap, 10000) , (KL, John, Smith, 10001) }

Map:

[State#KL,FN#Hamid,LN#J. Fard,ID#9999]

Nested Complex Types (Bag an element of Tuple):

(Employee, { (KL,Hamid,J. Fard,9999) , (KL, Melissa, Lenjap, 10000) , (KL, John, Smith, 10001) } )

Defining a Schema

As mentioned above, Pig will eat any kind of data, but if your data has a known structure to it, then you can define a schema for it. The schema is typically defined when you load the data using the AS keyword. Consider the following statements:

Employee = LOAD ‘HRdata.csv’ USING PigStorage(‘,’) AS (Firstname:chararray, Lastname:chararray, DOB:datetime, Salary:double);

If you load a Employee record that has more than four fields, then the extra fields will be truncated. and if you load a Employee record that has fewer than four fields, then it will pad the end of the record with nulls.

The schema can also specify complex types, for example, suppose we have the following dataset in a file names ‘Car_Insurance.txt’:

Toyota,Vios, { (W999,1234) , (W989,3345) }

Honda, City, { (H456,7834) , (U3453,8134) }

Mercedes, E250, { (I899, 87223) , (Y3431, 9234) }

Corresponding relation might look like:

Accidents = LOAD ‘Car_Insurance.txt’ AS (Manu:chararray, Model:chararray, Detail:bag{d:(PlateNumber:chararray,InsuranceNumber:int) } );

Note: Bags are odd in that the tuple inside the bag must have a name, which is d in the Accidents relation. But you will never actually use the d name in any figure Pig commands.

The GROUP Operator

pig group

One of the most common operators in Pig is GROUP, which collects all records with the same value for a provided key and puts them together into a bag. The result of a GROUP operation is a relation that includes one tuple per group. This tuple contains two fields:

  • The first field is named ‘group’ and is the same type as the group key.
  • The second field takes the name of the original relation and is type bag.

Suppose we have the following data set:

F,17,30000
M,19,75000
M,21,80000
F,17,38000
M,17,25000
F,21,56000

Let’s group the records together by Age:

Salaries = LOAD ‘SalaryData.txt’ USING PigStorage(‘,’) AS (Gender:chararray, Age:int, Salary:double);

SalaryByAge = GROUP Salaries BY Age;

You can also group a relation by multiple keys. the keys must be listed in parantheses.

SalariesByAgeGender = GROUP Salaries BY (Age, Gender);

GROUP ALL Operator

pig group all

You can group all of the records of a relation into a single tuple using the ALL option. Consider the following statement:

SalariesByAll = GROUP Salaries ALL;

Relation without a Schema

pig group no schema

If you do not define a schema, then the fields of a relation are specified using an index that starts at $0. This works well for datasets that have a lot of columns or for data that is not structured. Consider the following statements:

Salaries = LOAD ‘Salaries.txt’ USING PigStorage(‘,’);

The Salaries relation has four columns but does not define a schema.

SalariesGroup = GROUP Salaries By $2;

The FOREACH GENERATE Operator

pig foreach

The FOREACH..GENERATE operator transforms records based on a set of expressions you define. The operator works on each record in the data set. The result of a FOREACH is a new tuple, typically with a different schema. The FOREACH operator is a great tool for transforming your data into different data sets. The expression in a FOREACH can contain fields, constants, mathematical expression, the result of invoking a Pig function, and many other variations and nesting.

Let’s look at an example. the following command takes in the Salaries relation and generates a new relation that only contains two of the columns in Salaries:

Result = FOREACH Salaries GENERATE Age, Salary;

You can perform mathematical computations in the GENERATE clause.

Result = FOREACH Salaries GENERATE Age, Salary, Salary * 0.06 As Tax;

Specifying Ranges in FOREACH

In the GENERATE clause, you can specify a range of values, which is useful when working with datasets that have so many fields. Consider the following statements:

Salaries = LOAD ‘Salarydata.txt’ USING PigStorage(‘,’) AS (Gender:chararray, Age:int, Salary:double, Zip:int);

C = FOREACH Salaries GENERATE Age..Zip;

D = FOREACH Salaries GENERATE Age..;

E = FOREACH Salaries GENERATE ..Salary;

FOREACH with GROUP

Consider the following Statement:

Salaries = LOAD ‘Salarydata.txt’ USING PigStorage(‘,’) AS (Gender:chararray, Age:int, Salary:double, Zip:int);

GResult = GROUP Salaries BY Gender;

Final = FOREACH GResult GENERATE group, COUNT(Salaries);

DUMP Final;

The FILTER Operator

The FILTER operator selects tuples from a relation based on specified Boolean expressions. Use FILTER to select the rows you want. Consider the following statement:

G = FILTER Salaries BY Salary >= 20000.0;

Conditions can be combined using AND or OR:

G = FILTER Salaries BY Salary >= 20000.0 AND Gender == ‘F’ OR Age > 17;

Use the NOT operator to reverse a condition. The following NOT operator filters out all rows that match a regular expression from the dataset:

Dataset

——————–

Hamid

Melissa

John

Fred

Jane

Kevin

 

Names = LOAD ‘names.txt’ AS (FN:chararray);

NO_H = FILTER Names BY NOT FN MATCHES ‘.*h|H.*’;

The LIMIT Operator

The LIMIT command limits the number of output tuples for a relation. Consider the following statement:

Salaries = LOAD ‘Salarydata.txt’ USING PigStorage(‘,’) AS (Gender:chararray, Age:int, Salary:double, Zip:int);

L = LIMIT Salaries 3;

There is no guarantee which 3 tuples will be returned, and the tuples that are returned can change from one run to the next.

Hamid J. Fard

I am SQL Server Data Platform Expert with more than 9 years’ of professional experience, I am currently Microsoft Certified Master: SQL Server 2008, Microsoft Certified Solutions Master: Charter-Data Platform, Microsoft Data Platform MVP and CIW Database Design Specialist. I also do Pig, Hive and Sqoop development regarding Hadoop Big Data platform. After a few years of being a production database administrator I jumped into the role of Data Platform Expert. Being a consultant allows me to work directly with customers to help solve questions regarding database issues for SQL Server.

More Posts

Follow Me:
FacebookLinkedIn


Tags: ,

Leave a Comment

Be the First to Comment!

Notify of
avatar
1000
wpDiscuz