(As an FYI I'm relatively new to Hive and have no previous SQL
experience, so have been struggling a bit with the Language manual which
seems to assume previous SQL experience)
Suppose I have a table, within which there is a column which contains
domain names (ie such as hadoop.apache.org). I want to perform a count
of all second-level domains, ie hadoop.apache.org and hive.apache.org
would count in the same bucket.
Now I could count things for a particular second-level domain like this:
year, month, day, hour, COUNT(1) as count
year = 2011 AND
month = 05 AND
day = 15 AND
domain RLIKE ".*[.]apache[.]org"
year, month, day, hour
however I'm not seeing a way to sum up all second-level domains rather
than a particular one. I basically want to group everything using a
regular expression along the lines of ".*[.][^.]*[.][^.]*" and then
output lines with a count for the common portion. Any pointers in the
correct direction would be welcome.