I currently work using SQL for Hive for Hadoop. Hive SQL is a little eccentric. If you have worked in other forms of SQL, it takes some getting used to.
Below are five blog posts from folkstalk.com which have helped me immensely.
- Date and Time Functions
- Numeric functions
- String functions
- Conditional functions
- Collection functions
Why are these posts any good?
- Date and Time Functions – Dates and Times in Hadoop are stored as strings in the format ‘YYYY-MM-DD HH:MM:SS.T’ (T is tenth of a second), not a date/time datatype or an integer. It gives you a good sense of how to deal with this.
- Numeric functions – It’s pretty much the same as other forms of SQL. Given how nerve rattling Hive SQL can be at first, it’s reassuring to know that some things are still the same.
- String functions – Check out the CONCAT function. It’s an actual function in this case and not a symbol like ‘&’.
- Conditional functions – Check out the COALESCE function. This one gets used quite a bit when one is trying to clean data by combining multiple fields. All it does is return the first non NULL value from the fields listed. It’s surprisingly useful.
- Collection functions – Check out the CAST function. This is how you change a value from one type to another. For example, to change the field ‘beers_drunk’ from a string to an integer would be
CAST(beers_drunk AS int)