Thursday, November 05, 2015

Null Handling in Hadoop Pig Latin

For chararray type, when you load a dataset, PigStorage will convert empty fields to null. So in any relations, you won't find any empty string but only nulls.

However, in the pig script, if you have a constant as '', it is not treated as null.

So '' is not null return true.
'' is null return not true.

If A is a relation immediately after a load, A.$0 == '' will never be true.

If you compose something manually with GENERATE, it will keep the origin.

B = FOREACH A GENERATE $0, $1, ''; -- Will keep the value as empty string
C = FOREACH A GENERATE $0, $2, (chararry) null; -- Will keep the value as null

Sorting for NULLs

NULL is always treated as smallest value, if you do ORDER BY DESC, it will come last. If you do ASC, it comes first.