I've written my own custom SerDe to handle some log files in a custom
format and as I'd quite like to (eventually) use the JDBC driver down
the line, I'd quite like to retain the column types for the output.
Part of the reason for this is that we're using OpenCSV
(http://opencsv.sourceforge.net/) to produce them in the first place,
so it'd be good to use it again to parse the files when used for
querying in Hive.
I've implemented my own SerDe, originally using
MetadataTypedColumnsetSerDe as a basis, however whenever I run a
query, no data is returned, regardless of the amount of data I load
into the table. The load proceeds fine. I am using the version of Hive
from Cloudera's CDH3 distribution (based on 0.5.0).
My create table statement is:
CREATE TABLE my_test_table (col_name_1 STRING, col_name_2, INT, ...
etc) COMMENT 'Some comment' PARTITIONED BY (part_col_1 STRING,
ROW FORMAT SERDE "com.my.package.named.MyNewSerDe" STORED AS TEXTFILE;
I have switched on the debug logging and put a bunch of debug
statements in my code and I've found that when I do a simple query
(like "select * from my_test_table limit 10;") so that it runs
locally, it does find the class. Indeed it calls the initialize method
and calls the getObjectInspector method a number of times.
Subsequently though, it calls initialize on LazySimpleSerDe three
times. The first two times it has dummy column names (_col0) and the
correct column types in the correct order. The last time it contains
no column names or types at all.
Presumably I'm missing something fairly simple from somewhere (a class
extension missing, wrong class returned by getSerializedClass() or
perhaps constructing the ObjectInspector incorrectly?) but for the
life of me I can't spot it. The underlying files are just CSV's
constructed using the OpenCSV library above.
I'd be very grateful for any suggestions.