You are previewing Programming Pig.

Programming Pig

Cover of Programming Pig by Alan Gates Published by O'Reilly Media, Inc.
  1. Programming Pig
    1. SPECIAL OFFER: Upgrade this ebook with O’Reilly
    2. Preface
      1. Data Addiction
      2. Who Should Read This Book
      3. Conventions Used in This Book
      4. Code Examples in This Book
      5. Using Code Examples
      6. Safari® Books Online
      7. How to Contact Us
      8. Acknowledgments
    3. 1. Introduction
      1. What Is Pig?
      2. Pig’s History
    4. 2. Installing and Running Pig
      1. Downloading and Installing Pig
      2. Running Pig
    5. 3. Grunt
      1. Entering Pig Latin Scripts in Grunt
      2. HDFS Commands in Grunt
      3. Controlling Pig from Grunt
    6. 4. Pig’s Data Model
      1. Types
      2. Schemas
    7. 5. Introduction to Pig Latin
      1. Preliminary Matters
      2. Input and Output
      3. Relational Operations
      4. User Defined Functions
    8. 6. Advanced Pig Latin
      1. Advanced Relational Operations
      2. Integrating Pig with Legacy Code and MapReduce
      3. Nonlinear Data Flows
      4. Controlling Execution
      5. Pig Latin Preprocessor
    9. 7. Developing and Testing Pig Latin Scripts
      1. Development Tools
      2. Testing Your Scripts with PigUnit
    10. 8. Making Pig Fly
      1. Writing Your Scripts to Perform Well
      2. Writing Your UDF to Perform
      3. Tune Pig and Hadoop for Your Job
      4. Using Compression in Intermediate Results
      5. Data Layout Optimization
      6. Bad Record Handling
    11. 9. Embedding Pig Latin in Python
      1. Compile
      2. Bind
      3. Run
      4. Utility Methods
    12. 10. Writing Evaluation and Filter Functions
      1. Writing an Evaluation Function in Java
      2. Algebraic Interface
      3. Accumulator Interface
      4. Python UDFs
      5. Writing Filter Functions
    13. 11. Writing Load and Store Functions
      1. Load Functions
      2. Store Functions
    14. 12. Pig and Other Members of the Hadoop Community
      1. Pig and Hive
      2. Cascading
      3. NoSQL Databases
      4. Metadata in Hadoop
    15. A. Built-in User Defined Functions and Piggybank
      1. Built-in UDFs
      2. Piggybank
    16. B. Overview of Hadoop
      1. MapReduce
      2. Hadoop Distributed File System
    17. Index
    18. About the Author
    19. Colophon
    20. SPECIAL OFFER: Upgrade this ebook with O’Reilly
O'Reilly logo

Schemas

Pig has a very lax attitude when it comes to schemas. This is a consequence of Pig’s philosophy of eating anything. If a schema for the data is available, Pig will make use of it, both for up-front error checking and for optimization. But if no schema is available, Pig will still process the data, making the best guesses it can based on how the script treats the data. First, we will look at ways that you can communicate the schema to Pig; then, we will examine how Pig handles the case where you do not provide it with the schema.

The easiest way to communicate the schema of your data to Pig is to explicitly tell Pig what it is when you load the data:

dividends = load 'NYSE_dividends' as
    (exchange:chararray, symbol:chararray, date:chararray, dividend:float);

Pig now expects your data to have four fields. If it has more, it will truncate the extra ones. If it has less, it will pad the end of the record with nulls.

It is also possible to specify the schema without giving explicit data types. In this case, the data type is assumed to be bytearray:

dividends = load 'NYSE_dividends' as (exchange, symbol, date, dividend);

Warning

You would expect that this also would force your data into a tuple with four fields, regardless of the number of actual input fields, just like when you specify both names and types for the fields. And in Pig 0.9 this is what happens. But in 0.8 and earlier versions it does not; no truncation or null padding is done in the case where you do not provide explicit types for the fields.

Also, when you declare a schema, you do not have to declare the schema of complex types, but you can if you want to. For example, if your data has a tuple in it, you can declare that field to be a tuple without specifying the fields it contains. You can also declare that field to be a tuple that has three columns, all of which are integers. Table 4-1 gives the details of how to specify each data type inside a schema declaration.

Table 4-1. Schema syntax

Data typeSyntaxExample
intintas (a:int)
longlongas (a:long)
floatfloatas (a:float)
doubledoubleas (a:double)
chararraychararrayas (a:chararray)
bytearraybytearrayas (a:bytearray)
mapmap[] or map[type], where type is any valid type. This declares all values in the map to be of this type.as (a:map[], b:map[int])
tupletuple() or tuple(list_of_fields), where list_of_fields is a comma-separated list of field declarations.as (a:tuple(), b:tuple(x:int, y:int))
bagbag{} or bag{t:(list_of_fields)}, where list_of_fields is a comma-separated list of field declarations. Note that, oddly enough, the tuple inside the bag must have a name, here specified as t, even though you will never be able to access that tuple t directly.(a:bag{}, b:bag{t:(x:int, y:int)})

The runtime declaration of schemas is very nice. It makes it easy for users to operate on data without having to first load it into a metadata system. It also means that if you are interested in only the first few fields, you only have to declare those fields.

But for production systems that run over the same data every hour or every day, it has a couple of significant drawbacks. One, whenever your data changes, you have to change your Pig Latin. Two, although this works fine on data with 5 columns, it is painful when your data has 100 columns. To address these issues, there is another way to load schemas in Pig.

If the load function you are using already knows the schema of the data, the function can communicate that to Pig. (Load functions are how Pig reads data; see Load for details.) Load functions might already know the schema because it is stored in a metadata repository such as HCatalog, or it might be stored in the data itself (if, for example, the data is stored in JSON format). In this case, you do not have to declare the schema as part of the load statement. And you can still refer to fields by name because Pig will fetch the schema from the load function before doing error checking on your script:

mdata = load 'mydata' using HCatLoader();
cleansed = filter mdata by name is not null;
...

But what happens when you cross the streams? What if you specify a schema and the loader returns one? If they are identical, all is well. If they are not identical, Pig will determine whether it can adapt the one returned by the loader to match the one you gave. For example, if you specified a field as a long and the loader said it was an int, Pig can and will do that cast. However, if it cannot determine a way to make the loader’s schema fit the one you gave, it will give an error. See Casts for a list of casts Pig can and cannot insert to make the schemas work together.

Now let’s look at the case where neither you nor the load function tell Pig what the data’s schema is. In addition to being referenced by name, fields can be referenced by position, starting from zero. The syntax is a dollar sign, then the position: $0 refers to the first field. So it is easy to tell Pig which field you want to work with. But how does Pig know the data type? It does not, so it starts by assuming everything is a bytearray. Then it looks at how you use those fields in your script, drawing conclusions about what you think those fields are and how you want to use them. Consider the following:

--no_schema.pig
daily = load 'NYSE_daily';
calcs = foreach daily generate $7 / 1000, $3 * 100.0, SUBSTRING($0, 0, 1), $6 - $3;

In the expression $7 / 1000, 1000 is an integer, so it is a safe guess that the eighth field of NYSE_daily is an integer or something that can be cast to an integer. In the same way, $3 * 100.0 indicates $3 is a double, and the use of $0 in a function that takes a chararray as an argument indicates the type of $0. But what about the last expression, $6 - $3? The - operator is used only with numeric types in Pig, so Pig can safely guess that $3 and $6 are numeric. But should it treat them as integers or floating-point numbers? Here Pig plays it safe and guesses that they are floating points, casting them to doubles. This is the safer bet because if they actually are integers, those can be represented as floating-point numbers, but the reverse is not true. However, because floating-point arithmetic is much slower and subject to loss of precision, if these values really are integers, you should cast them so that Pig uses integer types in this case.

There are also cases where Pig cannot make any intelligent guess:

--no_schema_filter
daily = load 'NYSE_daily';
fltrd = filter daily by $6 > $3;

> is a valid operator on numeric, chararray, and bytearray types in Pig Latin. So, Pig has no way to make a guess. In this case, it treats these fields as if they were bytearrays, which means it will do a byte-to-byte comparison of the data in these fields.

Pig also has to handle the case where it guesses wrong and must adapt on the fly. Consider the following:

--unintended_walks.pig
player     = load 'baseball' as (name:chararray, team:chararray,
               pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate bat#'base_on_balls' - bat#'ibbs';

Because the values in maps can be of any type, Pig has no idea what type bat#'base_on_balls' and bat#'ibbs' are. By the rules laid out previously, Pig will assume they are doubles. But let’s say they actually turn out to be represented internally as integers.[5] In that case, Pig will need to adapt at runtime and convert what it thought was a cast from bytearray to double into a cast from int to double. Note that it will still produce a double output and not an int output. This might seem nonintuitive; see How Strongly Typed Is Pig? for details on why this is so. It should be noted that in Pig 0.8 and earlier, much of this runtime adaption code was shaky and often failed. In 0.9, much of this has been fixed. But if you are using an older version of Pig, you might need to cast the data explicitly to get the right results.

Finally, Pig’s knowledge of the schema can change at different points in the Pig Latin script. In all of the previous examples where we loaded data without a schema and then passed it to a foreach statement, the data started out without a schema. But after the foreach, the schema is known. Similarly, Pig can start out knowing the schema, but if the data is mingled with other data without a schema, the schema can be lost. That is, lack of schema is contagious:

--no_schema_join.pig
divs  = load 'NYSE_dividends' as (exchange, stock_symbol, date, dividends);
daily = load 'NYSE_daily';
jnd   = join divs by stock_symbol, daily by $1;

In this example, because Pig does not know the schema of daily, it cannot know the schema of the join of divs and daily.

Casts

The previous sections have referenced casts in Pig without bothering to define how casts work. The syntax for casts in Pig is the same as in Java—the type name in parentheses before the value:

--unintended_walks_cast.pig
player     = load 'baseball' as (name:chararray, team:chararray,
               pos:bag{t:(p:chararray)}, bat:map[]);
unintended = foreach player generate (int)bat#'base_on_balls' - (int)bat#'ibbs';

The syntax for specifying types in casts is exactly the same as specifying them in schemas, as shown previously in Table 4-1.

Not all conceivable casts are allowed in Pig. Table 4-2 describes which casts are allowed between scalar types. Casts to bytearrays are never allowed because Pig does not know how to represent the various data types in binary format. Casts from bytearrays to any type are allowed. Casts to and from complex types currently are not allowed, except from bytearray, although conceptually in some cases they could be.

Table 4-2. Supported casts

 To intTo longTo floatTo doubleTo chararray
From int Yes.Yes.Yes.Yes.
From longYes. Any values greater than 231 or less than –231 will be truncated. Yes.Yes.Yes.
From floatYes. Values will be truncated to int values.Yes. Values will be truncated to long values. Yes.Yes.
From doubleYes. Values will be truncated to int values.Yes. Values will be truncated to long values.Yes. Values with precision beyond what float can represent will be truncated. Yes.
From chararrayYes. Chararrays with nonnumeric characters result in null.Yes. Chararrays with nonnumeric characters result in null.Yes. Chararrays with nonnumeric characters result in null.Yes. Chararrays with nonnumeric characters result in null. 

One type of casting that requires special treatment is casting from bytearray to other types. Because bytearray indicates a string of bytes, Pig does not know how to convert its contents to any other type. Continuing the previous example, both bat#'base_on_balls' and bat#'ibbs' were loaded as bytearrays. The casts in the script indicate that you want them treated as ints.

Pig does not know whether integer values in baseball are stored as ASCII strings, Java serialized values, binary-coded decimal, or some other format. So it asks the load function, because it is that function’s responsibility to cast bytearrays to other types. In general this works nicely, but it does lead to a few corner cases where Pig does not know how to cast a bytearray. In particular, if a UDF returns a bytearray, Pig will not know how to perform casts on it because that bytearray is not generated by a load function.

Before leaving the topic of casts, we need to consider cases where Pig inserts casts for the user. These casts are implicit, compared to explicit casts where the user indicates the cast. Consider the following:

--total_trade_estimate.pig
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
            date:chararray, open:float, high:float, low:float, close:float,
            volume:int, adj_close:float);
rough = foreach daily generate volume * close;

In this case, Pig will change the second line to (float)volume * close to do the operation without losing precision. In general, Pig will always widen types to fit when it needs to insert these implicit casts. So, int and long together will result in a long; int or long and float will result in a float; and int, long, or float and double will result in a double. There are no implicit casts between numeric types and chararrays or other types.



[5] That is not the case in the example data. For that to be the case, you would need to use a loader that did load the bat map with these values as integers.

The best content for your career. Discover unlimited learning on demand for around $1/day.