FAQ
I have a file in HDFS which has the following format:
c1<space>c2<space>c3<space>c4<tab>c5<space>c6<space>c7

where cX represents column X.

Can someone please show me how I can create a table in Hive for this?

I tried the following but it gave an error:
CREATE TABLE test (
c1 STRING,
c2 STRING,
c3 STRING,
c4 STRING,
c5 STRING,
c6 STRING,
c7 STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('regex'='(\w+) (\w+) (\w+) (\w+)\t(\w+) (\w+) (\w+)')
STORED AS TEXTFILE;

hive> load data inpath '/user/hadoop/test' into table test;

hive> select * from test;
OK
Failed with exception
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: This
table does not have serde property "input.regex"!

Thank you very much =)

Search Discussions

  • Zheng Shao at Sep 10, 2009 at 4:59 am
    WITH SERDEPROPERTIES ('input.regex'='(\\w+) (\\w+) (\\w+) (\\w+)\\t(\\w+)
    (\\w+) (\\w+)')

    The reason for double backslash is that Hive string constant will take one
    level of escaping, and the regular expression will take another level.

    Please let us know where you see the 'regex'='...' syntax. It's outdated. We
    need to update it.

    Zheng
    On Wed, Sep 9, 2009 at 9:15 PM, Mayuran Yogarajah wrote:

    I have a file in HDFS which has the following format:
    c1<space>c2<space>c3<space>c4<tab>c5<space>c6<space>c7

    where cX represents column X.

    Can someone please show me how I can create a table in Hive for this?

    I tried the following but it gave an error:
    CREATE TABLE test (
    c1 STRING,
    c2 STRING,
    c3 STRING,
    c4 STRING,
    c5 STRING,
    c6 STRING,
    c7 STRING )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES ('regex'='(\w+) (\w+) (\w+) (\w+)\t(\w+) (\w+) (\w+)')
    STORED AS TEXTFILE;

    hive> load data inpath '/user/hadoop/test' into table test;

    hive> select * from test;
    OK
    Failed with exception
    java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: This table
    does not have serde property "input.regex"!

    Thank you very much =)

    --
    Yours,
    Zheng
  • Mayuran Yogarajah at Sep 10, 2009 at 6:48 pm

    Zheng Shao wrote:
    WITH SERDEPROPERTIES ('input.regex'='(\\w+) (\\w+) (\\w+)
    (\\w+)\\t(\\w+) (\\w+) (\\w+)')

    The reason for double backslash is that Hive string constant will take
    one level of escaping, and the regular expression will take another level.

    Please let us know where you see the 'regex'='...' syntax. It's
    outdated. We need to update it.

    Zheng
    I really wish I had just made all the columns tab separated =/
    Thanks for the regexp, it now correctly parses the file.

    Question, RegexSerDe only works with string columns so I had to use
    strings for my columns
    even though they are all numbers. What are the implications of this? Is
    there a performance
    penalty?

    Also I notice that the file used to load the table in HDFS no longer
    exists after I issue the
    LOAD DATA INPATH command. Is this expected? Is there some way to get
    around this?

    thanks
    On Wed, Sep 9, 2009 at 9:15 PM, Mayuran Yogarajah
    wrote:

    I have a file in HDFS which has the following format:
    c1<space>c2<space>c3<space>c4<tab>c5<space>c6<space>c7

    where cX represents column X.

    Can someone please show me how I can create a table in Hive for this?

    I tried the following but it gave an error:
    CREATE TABLE test (
    c1 STRING,
    c2 STRING,
    c3 STRING,
    c4 STRING,
    c5 STRING,
    c6 STRING,
    c7 STRING )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES ('regex'='(\w+) (\w+) (\w+) (\w+)\t(\w+)
    (\w+) (\w+)')
    STORED AS TEXTFILE;

    hive> load data inpath '/user/hadoop/test' into table test;

    hive> select * from test;
    OK
    Failed with exception
    java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
    This table does not have serde property "input.regex"!

    Thank you very much =)




    --
    Yours,
    Zheng
  • Zheng Shao at Sep 10, 2009 at 7:10 pm
    1. Yes the performance will be affected, especially we are doing one regex
    match per row, as well as creating a lot of String objects. If we define
    them as int and uses the default row format, we won't create those String
    objects.

    2. Yes that's expected. You can do the following if you don't want to move
    the original files:

    CREATE EXTERTAL TABLE mytable (...) ... LOCATION "hdfs://.../"



    Zheng
    On Thu, Sep 10, 2009 at 11:48 AM, Mayuran Yogarajah wrote:

    Zheng Shao wrote:
    WITH SERDEPROPERTIES ('input.regex'='(\\w+) (\\w+) (\\w+) (\\w+)\\t(\\w+)
    (\\w+) (\\w+)')

    The reason for double backslash is that Hive string constant will take one
    level of escaping, and the regular expression will take another level.

    Please let us know where you see the 'regex'='...' syntax. It's outdated.
    We need to update it.

    Zheng

    I really wish I had just made all the columns tab separated =/
    Thanks for the regexp, it now correctly parses the file.

    Question, RegexSerDe only works with string columns so I had to use strings
    for my columns
    even though they are all numbers. What are the implications of this? Is
    there a performance
    penalty?

    Also I notice that the file used to load the table in HDFS no longer exists
    after I issue the
    LOAD DATA INPATH command. Is this expected? Is there some way to get
    around this?

    thanks


    On Wed, Sep 9, 2009 at 9:15 PM, Mayuran Yogarajah <
    mayuran.yogarajah@casalemedia.com <mailto:
    mayuran.yogarajah@casalemedia.com>> wrote:

    I have a file in HDFS which has the following format:
    c1<space>c2<space>c3<space>c4<tab>c5<space>c6<space>c7

    where cX represents column X.

    Can someone please show me how I can create a table in Hive for this?

    I tried the following but it gave an error:
    CREATE TABLE test (
    c1 STRING,
    c2 STRING,
    c3 STRING,
    c4 STRING,
    c5 STRING,
    c6 STRING,
    c7 STRING )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    WITH SERDEPROPERTIES ('regex'='(\w+) (\w+) (\w+) (\w+)\t(\w+)
    (\w+) (\w+)')
    STORED AS TEXTFILE;

    hive> load data inpath '/user/hadoop/test' into table test;

    hive> select * from test;
    OK
    Failed with exception
    java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException:
    This table does not have serde property "input.regex"!

    Thank you very much =)




    --
    Yours,
    Zheng

    --
    Yours,
    Zheng
  • Mayuran Yogarajah at Sep 11, 2009 at 4:06 am

    Zheng Shao wrote:
    1. Yes the performance will be affected, especially we are doing one
    regex match per row, as well as creating a lot of String objects. If
    we define them as int and uses the default row format, we won't create
    those String objects.
    Is there anything I can do to alleviate this without reformatting the data ?

    thanks
  • Zheng Shao at Sep 11, 2009 at 4:10 am
    You can write your own specified SerDe to make it more efficient.

    Basically, copy and paste RegexSerde, and:
    1. use your own string scan instead of Regex Match,
    2. return org.apache.hadoop.io.Text instead of java.lang.String (and reuse
    the same Text for the same field in different rows)

    Zheng
    On Thu, Sep 10, 2009 at 9:05 PM, Mayuran Yogarajah wrote:

    Zheng Shao wrote:
    1. Yes the performance will be affected, especially we are doing one regex
    match per row, as well as creating a lot of String objects. If we define
    them as int and uses the default row format, we won't create those String
    objects.

    Is there anything I can do to alleviate this without reformatting the
    data ?

    thanks


    --
    Yours,
    Zheng

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshive, hadoop
postedSep 10, '09 at 4:15a
activeSep 11, '09 at 4:10a
posts6
users2
websitehive.apache.org

2 users in discussion

Zheng Shao: 3 posts Mayuran Yogarajah: 3 posts

People

Translate

site design / logo © 2021 Grokbase