FAQ
I just started learning hadoop and got done with wordcount mapreduce
example. I also briefly looked at hadoop streaming.

Some questions
1) What should be my first step now? Are there more examples
somewhere that I can try out?
2) Second question is around pracitcal usability using xml files. Our
xml files are not big they are around 120k in size but hadoop is
really meant for big files so how do I go about processing these xml
files?
3) Are there any samples or advise on how to processing with xml files?


Looking for help and pointers.

Search Discussions

  • Aleksandr Elbakyan at May 24, 2011 at 11:25 pm
    Hello,

    We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming

    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:16 PM

    I just started learning hadoop and got done with wordcount mapreduce
    example. I also briefly looked at hadoop streaming.

    Some questions
    1) What should  be my first step now? Are there more examples
    somewhere that I can try out?
    2) Second question is around pracitcal usability using xml files. Our
    xml files are not big they are around 120k in size but hadoop is
    really meant for big files so how do I go about processing these xml
    files?
    3) Are there any samples or advise on how to processing with xml files?


    Looking for help and pointers.
  • Mohit Anchlia at May 24, 2011 at 11:42 pm

    On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan wrote:
    Hello,

    We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming
    Can you please give more info?
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.
    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?
    what are your experiences using text files instead of xml?
    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    Any readings suggested in this area?

    Our xml is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Sorry for all the questions. I am new and trying to get a grasp and
    also learn how would I actually solve our use case.
    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:16 PM

    I just started learning hadoop and got done with wordcount mapreduce
    example. I also briefly looked at hadoop streaming.

    Some questions
    1) What should  be my first step now? Are there more examples
    somewhere that I can try out?
    2) Second question is around pracitcal usability using xml files. Our
    xml files are not big they are around 120k in size but hadoop is
    really meant for big files so how do I go about processing these xml
    files?
    3) Are there any samples or advise on how to processing with xml files?


    Looking for help and pointers.
  • Aleksandr Elbakyan at May 24, 2011 at 11:55 pm
    Can you please give more info?
    We currently have off hadoop process which uses java xml parser to convert it to flat file. We have files from couple kb to 10of GB.
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.


    We currently feed to our process folder with converted files. We don't size it any way we let hadoop to handle it.

    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?


    what are your experiences using text files instead of xml?
    If you are using streaming job it is easier to build your logic if you have one file, you can actually try to parse xml in your mapper and convert it for reducer but, why you just don't write small app which will convert it?

    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    If you are using Pig there is XML reader http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html

    If you have well define schema it is easier to work with big data :)

    Any readings suggested in this area?
    Try look into Pig it has lots of useful stuff, which will make your experience with hadoop nicer

    Our xml  is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Hadoop has build in counter, did you look into word count example from hadoop?


    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Re: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:41 PM
    On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan wrote:
    Hello,

    We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming
    Can you please give more info?
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.
    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?
    what are your experiences using text files instead of xml?
    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    Any readings suggested in this area?

    Our xml  is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Sorry for all the questions. I am new and trying to get a grasp and
    also learn how would I actually solve our use case.
    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:16 PM

    I just started learning hadoop and got done with wordcount mapreduce
    example. I also briefly looked at hadoop streaming.

    Some questions
    1) What should  be my first step now? Are there more examples
    somewhere that I can try out?
    2) Second question is around pracitcal usability using xml files. Our
    xml files are not big they are around 120k in size but hadoop is
    really meant for big files so how do I go about processing these xml
    files?
    3) Are there any samples or advise on how to processing with xml files?


    Looking for help and pointers.
  • Mohit Anchlia at May 25, 2011 at 12:21 am
    Thanks some more questions :)
    On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan wrote:
    Can you please give more info?
    We currently have off hadoop process which uses java xml parser to convert it to flat file. We have files from couple kb to 10of GB.
    Do you convert it into a flat file and write it to HDFS? Do you write
    all the files to the same directory in DFS or do you group directories
    based on days for eg? So like 2011/01/01 contains 10 files. store
    results of 10 files somewhere and then on 2011/02/02 store another say
    20 files. Now analyze 20 files and use the results from 10 files to do
    the aggregation. If so then how do you do it. Or how should I do it
    since it will be overhead processing those files again.

    Please point me to examples so that you don't have to teach me Hadoop
    or pig processing :)
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.


    We currently feed to our process folder with converted files. We don't size it any way we let hadoop to handle it.
    Didn't think about it. I was just thinking in terms of using big
    files. So when using small files hadoop will automatically distribute
    the files accross cluster I am assuming based on some hashing.
    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?


    what are your experiences using text files instead of xml?
    If you are using streaming job it is easier to build your logic if you have one file, you can actually try to parse xml in your mapper and convert it for reducer but, why you just don't write small app which will convert it?
    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    If you are using Pig there is XML reader http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html
    Which one is better? Converting files to flat files or using xml as
    is? How do I make that decision?

    If you have well define schema it is easier to work with big data :)

    Any readings suggested in this area?
    Try look into Pig it has lots of useful stuff, which will make your experience with hadoop nicer
    I will download pig tutorial and see how that works. Is there any
    other xml related examples you can point me to?

    Thanks a lot!
    Our xml  is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Hadoop has build in counter, did you look into word count example from hadoop?


    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Re: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:41 PM
    On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan wrote:
    Hello,

    We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming
    Can you please give more info?
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.
    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?
    what are your experiences using text files instead of xml?
    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    Any readings suggested in this area?

    Our xml  is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Sorry for all the questions. I am new and trying to get a grasp and
    also learn how would I actually solve our use case.
    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:16 PM

    I just started learning hadoop and got done with wordcount mapreduce
    example. I also briefly looked at hadoop streaming.

    Some questions
    1) What should  be my first step now? Are there more examples
    somewhere that I can try out?
    2) Second question is around pracitcal usability using xml files. Our
    xml files are not big they are around 120k in size but hadoop is
    really meant for big files so how do I go about processing these xml
    files?
    3) Are there any samples or advise on how to processing with xml files?


    Looking for help and pointers.
  • Aleksandr Elbakyan at May 25, 2011 at 1:24 am
    Hello,

    We currently have complicated process which has more then 20 jobs piped to each other.
    We are using shell script to control the flow, I saw some other company they were using spring batch. We use pig, streaming and hive

    Not one thing if you are using ec2 for your jobs all local files need to be stored in /mnt Currently our cluster is organized this way in hdfs, and we process our data hourly and rotate final result to the beginning of pipeline for next process. Each process output is next process input, so we keep all data for current execution in the same dated folder so if you run daily it will be eg 20111212 if hourly 201112121416, and add subfolder for each subprocess into it
    Example
    /user/{domain}/{date}/input
    /user/{domain}/{date}/process1
    /user/{domain}/{date}/process2
    /user/{domain}/{date}/process3
    /user/{domain}/{date}/process4
    our process1 takes input for current converted files and output from last process.
    After we start the job we load converted files into input location and move them out from local space so we will not reprocess them.

    Not sure if there is examples, this all will depend on architecture of the project you are doing, I bet if you will put all you need to do on whiteboard you will find best folder structure for yourself :)

    Regards,
    Aleksandr


    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Re: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 5:20 PM

    Thanks some more questions :)
    On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan wrote:
    Can you please give more info?
    We currently have off hadoop process which uses java xml parser to convert it to flat file. We have files from couple kb to 10of GB.
    Do you convert it into a flat file and write it to HDFS? Do you write
    all the files to the same directory in DFS or do you group directories
    based on days for eg? So like 2011/01/01 contains 10 files. store
    results of 10 files somewhere and then on 2011/02/02 store another say
    20 files. Now analyze 20 files and use the results from 10 files to do
    the aggregation. If so then how do you do it. Or how should I do it
    since it will be overhead processing those files again.

    Please point me to examples so that you don't have to teach me Hadoop
    or pig processing :)
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.


    We currently feed to our process folder with converted files. We don't size it any way we let hadoop to handle it.
    Didn't think about it. I was just thinking in terms of using big
    files. So when using small files hadoop will automatically distribute
    the files accross cluster I am assuming based on some hashing.
    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?


    what are your experiences using text files instead of xml?
    If you are using streaming job it is easier to build your logic if you have one file, you can actually try to parse xml in your mapper and convert it for reducer but, why you just don't write small app which will convert it?
    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    If you are using Pig there is XML reader http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html
    Which one is better? Converting files to flat files or using xml as
    is? How do I make that decision?

    If you have well define schema it is easier to work with big data :)

    Any readings suggested in this area?
    Try look into Pig it has lots of useful stuff, which will make your experience with hadoop nicer
    I will download pig tutorial and see how that works. Is there any
    other xml related examples you can point me to?

    Thanks a lot!
    Our xml  is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Hadoop has build in counter, did you look into word count example from hadoop?


    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Re: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:41 PM
    On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan wrote:
    Hello,

    We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming
    Can you please give more info?
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.
    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?
    what are your experiences using text files instead of xml?
    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    Any readings suggested in this area?

    Our xml  is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Sorry for all the questions. I am new and trying to get a grasp and
    also learn how would I actually solve our use case.
    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:16 PM

    I just started learning hadoop and got done with wordcount mapreduce
    example. I also briefly looked at hadoop streaming.

    Some questions
    1) What should  be my first step now? Are there more examples
    somewhere that I can try out?
    2) Second question is around pracitcal usability using xml files. Our
    xml files are not big they are around 120k in size but hadoop is
    really meant for big files so how do I go about processing these xml
    files?
    3) Are there any samples or advise on how to processing with xml files?


    Looking for help and pointers.
  • Mohit Anchlia at May 25, 2011 at 4:42 pm

    On Tue, May 24, 2011 at 6:23 PM, Aleksandr Elbakyan wrote:
    Hello,

    We currently have complicated process which has more then 20 jobs piped to each other.
    We are using shell script to control the flow, I saw some other company they were using spring batch. We use pig, streaming and hive

    Not one thing if you are using ec2 for your jobs all local files need to be stored in /mnt Currently our cluster is organized this way in hdfs, and we process our data hourly and rotate final result to the beginning of pipeline for next process. Each process output is next process input, so we keep all data for current execution in the same dated folder so if you run daily it will be eg 20111212 if hourly 201112121416, and add subfolder for each subprocess into it
    Example
    /user/{domain}/{date}/input
    /user/{domain}/{date}/process1
    /user/{domain}/{date}/process2
    /user/{domain}/{date}/process3
    /user/{domain}/{date}/process4
    our process1 takes input for current converted files and output from last process.
    After we start the job we load converted files into input location and move them out from local space so we will not reprocess them.

    Not sure if there is examples, this all will depend on architecture of the project you are doing, I bet if you will put all you need to do on whiteboard you will find best folder structure for yourself :)
    Thanks for the info! Is there a way to point me to an example where
    output of one process is then piped to other. For eg: Results of day 1
    is stored in the day1/output directory then how do I input these
    results to day2's job? I am assuming these results will directly go to
    the reducer since it has already gone through map-reduce cycle. Just
    looking for some example to get little more feel of how to do it. So
    far I have installed hadoop and gone through basic hadoop tutorial of
    word count but I still lack knowledge of some important features.
    Regards,
    Aleksandr


    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Re: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 5:20 PM

    Thanks some more questions :)
    On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan wrote:
    Can you please give more info?
    We currently have off hadoop process which uses java xml parser to convert it to flat file. We have files from couple kb to 10of GB.
    Do you convert it into a flat file and write it to HDFS? Do you write
    all the files to the same directory in DFS or do you group directories
    based on days for eg? So like 2011/01/01 contains 10 files. store
    results of 10 files somewhere and then on 2011/02/02 store another say
    20 files. Now analyze 20 files and use the results from 10 files to do
    the aggregation. If so then how do you do it. Or how should I do it
    since it will be overhead processing those files again.

    Please point me to examples so that you don't have to teach me Hadoop
    or pig processing :)
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.


    We currently feed to our process folder with converted files. We don't size it any way we let hadoop to handle it.
    Didn't think about it. I was just thinking in terms of using big
    files. So when using small files hadoop will automatically distribute
    the files accross cluster I am assuming based on some hashing.
    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?


    what are your experiences using text files instead of xml?
    If you are using streaming job it is easier to build your logic if you have one file, you can actually try to parse xml in your mapper and convert it for reducer but, why you just don't write small app which will convert it?
    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    If you are using Pig there is XML reader http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html
    Which one is better? Converting files to flat files or using xml as
    is? How do I make that decision?

    If you have well define schema it is easier to work with big data :)

    Any readings suggested in this area?
    Try look into Pig it has lots of useful stuff, which will make your experience with hadoop nicer
    I will download pig tutorial and see how that works. Is there any
    other xml related examples you can point me to?

    Thanks a lot!
    Our xml  is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Hadoop has build in counter, did you look into word count example from hadoop?


    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Re: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:41 PM
    On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan wrote:
    Hello,

    We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming
    Can you please give more info?
    Do you append multiple xml files data as a line into one file? Or
    someother way? If so then how big do you let files to be.
    how do you create these files assuming your xml is stored somewhere
    else in the DB or filesystem? read them one by one?
    what are your experiences using text files instead of xml?
    Reason why xml files can't be directly used in hadoop or shouldn't be used?
    Any performance implications?
    Any readings suggested in this area?

    Our xml  is something like:

    <column id="Name" security="sensitive" xsi:type="Text">
    <value>free a last</value>
    </column>
    <column id="age" security="no" xsi:type="Text">
    <value>40</value>
    </column>

    And we would for eg want to know how many customers above certain age
    or certain age with certain income etc.

    Sorry for all the questions. I am new and trying to get a grasp and
    also learn how would I actually solve our use case.
    Regards,
    Aleksandr

    --- On Tue, 5/24/11, Mohit Anchlia wrote:

    From: Mohit Anchlia <mohitanchlia@gmail.com>
    Subject: Processing xml files
    To: common-user@hadoop.apache.org
    Date: Tuesday, May 24, 2011, 4:16 PM

    I just started learning hadoop and got done with wordcount mapreduce
    example. I also briefly looked at hadoop streaming.

    Some questions
    1) What should  be my first step now? Are there more examples
    somewhere that I can try out?
    2) Second question is around pracitcal usability using xml files. Our
    xml files are not big they are around 120k in size but hadoop is
    really meant for big files so how do I go about processing these xml
    files?
    3) Are there any samples or advise on how to processing with xml files?


    Looking for help and pointers.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 24, '11 at 11:17p
activeMay 25, '11 at 4:42p
posts7
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase