FAQ
I am working on a mapreduce application that will take input from lots of
small xml files rather than one big xml file. Each xml files has some record
that I want to parse and input data in a hbase table. How should I go about
parsing xml files and input in map functions. Should I have one mapper per
xml file or is there another way of doing this? Thanks for your help and
time.

Regards,
Vipul Sharma,

Search Discussions

  • Amandeep Khurana at Nov 3, 2009 at 12:01 am
    Are the xml's in flat files or stored in Hbase?

    1. If they are in flat files, you can use the StreamXmlRecordReader if that
    works for you.

    2. Or you can read the xml into a single string and process it however you
    want. (This can be done if its in a flat file or stored in an hbase table).
    I have xmls in hbase table and parse and process them as strings.

    One mapper per file doesnt make sense. If its in HBase, have one mapper per
    region. If they are flat files, depending on how many files you have, you
    can create mappers. You can tune this for your particular requirement and
    there is no "right" way to do it.
    On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma wrote:

    I am working on a mapreduce application that will take input from lots of
    small xml files rather than one big xml file. Each xml files has some
    record
    that I want to parse and input data in a hbase table. How should I go about
    parsing xml files and input in map functions. Should I have one mapper per
    xml file or is there another way of doing this? Thanks for your help and
    time.

    Regards,
    Vipul Sharma,
  • Vipul Sharma at Nov 3, 2009 at 12:39 am
    Okay I think I was not clear in my first post about the question. Let me try
    again.

    I have an application that gets large number of xml files every minute which
    are copied over to hdfs. Each file is around 1Mb each and contains several
    records. Files are well formed xml files with a starting tag <startingtag>
    and end tag </startingtag> in each xml file. I want to parse these files and
    put relevant output data in hbase.

    Now as an input to map function I can read all the unread files in a string
    and parse them inside map function using DOM or sometjing like that. But
    then how do I deal with multiple starting tag <startingtag>and ending tag
    </startingtag>in the string since we concatenated several files together.
    And how do I manage splits since hadoop would want to split at every default
    setting which might break the well formed structure of the xml files.

    Other way to go about would be to have a for loop in the driver class and
    provide a file at a time. I dont think it is good way since files are very
    small and we will get almost no parallelization here.

    Is there a way that I can input a list or array of files to map function and
    do parsing inside map function. How would I take care of split and the tags
    of xml if I do that.

    I hope I was more clear this time??

    Regards,
    Vipul Sharma,
    Cell: 281-217-0761

    On Mon, Nov 2, 2009 at 4:00 PM, Amandeep Khurana wrote:

    Are the xml's in flat files or stored in Hbase?

    1. If they are in flat files, you can use the StreamXmlRecordReader if that
    works for you.

    2. Or you can read the xml into a single string and process it however you
    want. (This can be done if its in a flat file or stored in an hbase table).
    I have xmls in hbase table and parse and process them as strings.

    One mapper per file doesnt make sense. If its in HBase, have one mapper per
    region. If they are flat files, depending on how many files you have, you
    can create mappers. You can tune this for your particular requirement and
    there is no "right" way to do it.
    On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma wrote:

    I am working on a mapreduce application that will take input from lots of
    small xml files rather than one big xml file. Each xml files has some
    record
    that I want to parse and input data in a hbase table. How should I go about
    parsing xml files and input in map functions. Should I have one mapper per
    xml file or is there another way of doing this? Thanks for your help and
    time.

    Regards,
    Vipul Sharma,
  • Amandeep Khurana at Nov 3, 2009 at 12:59 am

    On Mon, Nov 2, 2009 at 4:39 PM, Vipul Sharma wrote:

    Okay I think I was not clear in my first post about the question. Let me
    try
    again.

    I have an application that gets large number of xml files every minute
    which
    are copied over to hdfs. Each file is around 1Mb each and contains several
    records. Files are well formed xml files with a starting tag <startingtag>
    and end tag </startingtag> in each xml file. I want to parse these files
    and
    put relevant output data in hbase.
    Now as an input to map function I can read all the unread files in a string
    and parse them inside map function using DOM or sometjing like that. But
    then how do I deal with multiple starting tag <startingtag>and ending tag
    </startingtag>in the string since we concatenated several files together.
    And how do I manage splits since hadoop would want to split at every
    default
    setting which might break the well formed structure of the xml files.
    So you have multiple xmls in a single file and you have many such files..
    In that case, the best answer is the StreamXmlRecordReader.

    Or you can write your own InputFormat to create splits such that each split
    in an xml file in itself, or each record in a split is a complete xml
    message.

    Other way to go about would be to have a for loop in the driver class and
    provide a file at a time. I dont think it is good way since files are very
    small and we will get almost no parallelization here.

    Is there a way that I can input a list or array of files to map function
    and
    do parsing inside map function. How would I take care of split and the tags
    of xml if I do that.

    I hope I was more clear this time??

    Regards,
    Vipul Sharma,
    Cell: 281-217-0761

    On Mon, Nov 2, 2009 at 4:00 PM, Amandeep Khurana wrote:

    Are the xml's in flat files or stored in Hbase?

    1. If they are in flat files, you can use the StreamXmlRecordReader if that
    works for you.

    2. Or you can read the xml into a single string and process it however you
    want. (This can be done if its in a flat file or stored in an hbase table).
    I have xmls in hbase table and parse and process them as strings.

    One mapper per file doesnt make sense. If its in HBase, have one mapper per
    region. If they are flat files, depending on how many files you have, you
    can create mappers. You can tune this for your particular requirement and
    there is no "right" way to do it.

    On Mon, Nov 2, 2009 at 3:01 PM, Vipul Sharma <sharmavipul@gmail.com>
    wrote:
    I am working on a mapreduce application that will take input from lots
    of
    small xml files rather than one big xml file. Each xml files has some
    record
    that I want to parse and input data in a hbase table. How should I go about
    parsing xml files and input in map functions. Should I have one mapper per
    xml file or is there another way of doing this? Thanks for your help
    and
    time.

    Regards,
    Vipul Sharma,
  • Vipul sharma at Nov 3, 2009 at 10:50 pm

    So you have multiple xmls in a single file and you have many such files..
    In that case, the best answer is the StreamXmlRecordReader.
    Or you can write your own InputFormat to create splits such that each split
    in an xml file in itself, or each record in a split is a complete xml
    message.
    Thanks Amandeep! I want to use the split efficiently. If I use one split per
    xml file then it wont be very efficient since size of xml files is small ~
    1meg. What I want to do is to have multiple files in one split. I will
    update you once I am done with this.
    --
    Vipul Sharma
    sharmavipul AT gmail DOT com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 2, '09 at 11:02p
activeNov 3, '09 at 10:50p
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase