You might think why Pig is related to SQL Server, at first glance it might not be related but as we know that SQL Server is already integrated with Hadoop and Big Data, is going to be used by corporations, then SQL Server DBAs should learn some basics of Pig to perform some data extraction from Hadoop cluster. This blog post introduces Pig and also some basic commands including some examples. You may refer to my previous blog post to understand the ‘What is Big Data and Hadoop?’.
What is Pig?
Apache Pig is a Hadoop platform for creating MapReduce jobs. Pig uses a high-level, SQL-Like programming language called Pig Latin. The benefits of Pig includes:-
- Run a MapReduce job with a few simple lines of code.
- Process structured data with a schema, or Pig can process unstructured data without a schema.
- Pig Latin uses a familiar SQL-Like syntax.
- Pig scripts read and write data from HDFS.
- Pig Latin is a data flow language, a logical solution for many MapReduce algorithms.
- Pig Latin can be used as Extract-Transform-Load (ETL) process.
The developers of Pig published their philosophy to summarize the goals of Pig, using comparisons to actual pigs:
- Pig eat anything: Pig can process any data, structured or unstructured.
- Pig live anywhere: Pig can run on any parallel data processing framework, so Pig scripts do not have to run just on Hadoop.
- Pig is domestic animal: Pig is designed to be easily controlled and modified by its users.
- Pig fly: Pig is designed to process data quickly.
What is Pig Latin?
Pig Latin is a high-level data flow scripting language. Pig Latin scripts can be executed from one of the following ways:
- Pig Script: write a Pig Latin program in a text file and execute it using the Pig executable.
- Grunt Shell: enter Pig statements manually one-at-a-time from a CLI tool known as the Grunt interactive shell.
- Embedded in Java: use
PigServerclass to execute a Pig query from within Java code.
Pig executes in a unique fashion – some commands build on previous co ands, while certain commands trigger a MapReduce job:
- During execution each statement is processed by the Pig interpreter.
- If a statement is valid, it gets added to a logical plan built by the interpreter.
- The steps in the logical plan do not actually execute until a
Pig Latin Relation Names
Each processing step of a Pig Latin script results in a new data set, referred to as a Relation. You assign names to relations, and the name of the relation is referred to as its Alias. For example, consider the following Pig Latin statement:
The alias Stocks is assigned to the relation created by the LOAD command, which in this statement is a line of text from the marketdata.txt file. the Stocks alias now represents the collection of records in marketdata.txt.
Relation names are not variables, even though they look like variables. You can reassign an alias to a different relation, but that is not recommended.
Pig Latin Field Names
You can also define field names when using the LOAD command to define a relation. Use the AS keywork to define field names.
The alias for this relation is Salaries, and Salaries has five field names: Designation, Gender, Age, Salary and Tax. Field names can be used in subsequent processing commands. For example, when filtering a relation, you can refer to its fields in the BY clause, as shown in the following statement:
Field names contain the values of the current record as the data passes through the pipeline of the Pig application. the HighSalaries relation will contain all records whose Salary field is greater than 500,000.
Remember that both field names and relation names must satisfy the following naming criteria:
- Start with an alphabetic character.
- Can contain alphabetic and numeric characters, as well as the underscore (_) character.
- All characters must be ASCII.
Pig Data Types
Pig has six built-in scalar data types:
Data Type Description
int a 32-bits signed integer
long a 64-bits signed integer
float 32-bits floating-point number.
double 64-bits floating-point number.
chararray strings of Unicode characters
bytearray a blob or array of bytes.
Boolean can be either true or false.
datetime stores a date and time in the format of YYYY-MM-DD HH:MM:SS.MS
bigdecimal and biginteger for performing precision arithmetic
Pig Complex Types
Pig has three complex types:
- Tuple: ordered set of fields. A tuple is analogous to a row in a SQL table, with the fields being SQL columns.
- Bag: Unordered collection of tuples.
- Map: Collection of key value pairs.
Complex Types formats:
Defining a Schema
As mentioned above, Pig will eat any kind of data, but if your data has a known structure to it, then you can define a schema for it. The schema is typically defined when you load the data using the AS keyword. Consider the following statements:
The schema can also specify complex types, for example, suppose we have the following dataset in a file names ‘Car_Insurance.txt’:
Corresponding relation might look like:
The GROUP Operator
One of the most common operators in Pig is
GROUP, which collects all records with the same value for a provided key and puts them together into a bag. The result of a
GROUP operation is a relation that includes one tuple per group. This tuple contains two fields:
- The first field is named ‘group’ and is the same type as the group key.
- The second field takes the name of the original relation and is type bag.
Suppose we have the following data set:
Let’s group the records together by Age:
GROUP ALL Operator
You can group all of the records of a relation into a single tuple using the
ALL option. Consider the following statement:
Relation without a Schema
If you do not define a schema, then the fields of a relation are specified using an index that starts at
$0. This works well for datasets that have a lot of columns or for data that is not structured. Consider the following statements:
The FOREACH GENERATE Operator
FOREACH..GENERATE operator transforms records based on a set of expressions you define. The operator works on each record in the data set. The result of a
FOREACH is a new tuple, typically with a different schema. The
FOREACH operator is a great tool for transforming your data into different data sets. The expression in a
FOREACH can contain fields, constants, mathematical expression, the result of invoking a Pig function, and many other variations and nesting.
Let’s look at an example. the following command takes in the Salaries relation and generates a new relation that only contains two of the columns in Salaries:
Specifying Ranges in FOREACH
GENERATE clause, you can specify a range of values, which is useful when working with datasets that have so many fields. Consider the following statements:
FOREACH with GROUP
Consider the following Statement:
The FILTER Operator
FILTER operator selects tuples from a relation based on specified Boolean expressions. Use
FILTER to select the rows you want. Consider the following statement:
NOT operator to reverse a condition. The following
NOT operator filters out all rows that match a regular expression from the dataset:
The LIMIT Operator
LIMIT command limits the number of output tuples for a relation. Consider the following statement: