What is the Hive SQL COALESCE function?
There is a function called COALESCE that exists in Hive SQL. It has the following syntax.
COALESCE(field1, field2, … , fieldn)
What does it do?
Of the fields above (field1, field2, … , fieldn)
, for each record returned it returns the value of the first field of the ones listed in the COALESCE function that is not NULL. If all of the fields are NULL, it returns NULL.
For example, let’s say one had 3 date fields, datefield1, datefield2, and datefield3
from the table tblDates
.
primary_key | datefield1 | datefield2 | datefield3 |
---|---|---|---|
1 | NULL | NULL | 1993-06-04 |
The code:
SELECT COALESCE(datefield1, datefield2, datefield3) as first_date_found
FROM
tblDates
WHERE
primary_key = 1
will return ‘1993-06-04’
OK, what on earth do I use this for?
You use it to create proxy values for NULLs in a field.
It is very useful when you have dirty data and have to use another field as an approximation of the dirty data’s actual value. If you have a field that is full of NULLs, you can use another field to put values in for those NULLs that you think provide a good approximate value of what should be there.
I want an example!
OK.
Let’s say you are William the Conqueror and have just decided to put together the Domesday Book, where you need to get everyone’s birth date so that you can figure out who and how much you can tax in your new conquered land of England and Wales. Let’s say to tax them that you need to know that they are legal adults over the age of 18 but birth records are not very good in 1066. Let’s also say you magically have an apache server with hadoop on it.
What do you do?
Let’s say your tblDomesDayProxyBirthDays table has 4 fields
table fields | data status of the field |
---|---|
thouGreatethsOfKeyeths | an 11th century version of a Social Security number |
thouDayOfBirth | You have some birthdays but not all |
thouBaptism | You have more baptisms but not all |
thouLordRegistrationDate | You have all the days they first registered with the local lord of the manor |
Let’s say you have these three records in tblDomesDayProxyBirthDays.
thouGreatethsOfKeyeths | thouDayOfBirth | thouBaptism | thouLordRegistrationDate |
---|---|---|---|
1 | 1048-02-15 | 1048-03-01 | 1048-04-01 |
2 | NULL | 1049-08-29 | 1049-10-29 |
3 | NULL | NULL | 1050-07-01 |
Well, you need their birthday to figure out if they are adults you can tax. However, you do not always have their “thouDayOfBirth” as shown above.
COALESCE allows you to use other data from other fields as a proxy.
If you use the code:
SELECT
thouGreatethsOfKeyeths
, COALESCE(thouDayOfBirth, thouBaptism, thouLordRegistrationDate) AS proxy_bday
FROM
tblDomesDayBirthDays
You will get:
thouGreatethsOfKeyeths | proxy_bday |
1 | 1048-02-15 |
2 | 1049-08-29 |
3 | 1050-07-01 |
For the first subject, you have their actual birthday. For the second subject, you have their first day of kindergarten. For the third subject, you have the day they registered with the lord of the manor. You now have a birthday for all of your subjects, even if some of them are not as accurate as you would like.
Related books picked – and if possible read – by me. Sponsored by Amazon Associates.