FAQ
Hi,

Does hadoop and map / reduce will allow me to parse large quantity of open
xml files distributed inside the same filesystem but using multipe jobs ?

Thx

Alexandre Jaquet

Search Discussions

  • Alex Loddengaard at Jun 13, 2009 at 12:24 am
    When you refer to "filesystem," do you mean HDFS?

    It's very common to store lots of text files in HDFS and run multiple jobs
    to process / learn about those text files. As for XML support, you can use
    Java libraries (or Python libraries if you're using Hadoop streaming) to
    parse the XML; Hadoop itself doesn't have much XML support. I hope this
    answers your question.

    Alex
    On Fri, Jun 12, 2009 at 1:31 PM, Alexandre Jaquet wrote:

    Hi,

    Does hadoop and map / reduce will allow me to parse large quantity of open
    xml files distributed inside the same filesystem but using multipe jobs ?

    Thx

    Alexandre Jaquet
  • Alexandre Jaquet at Jun 13, 2009 at 8:42 am
    Thanks Alex,

    Parsing the documents is a task done within the reducer ? we collect the
    datas (document input) within a mapper and then parse it ?

    Thanks in advance

    Alexandre Jaquet

    2009/6/13 Alex Loddengaard <alex@cloudera.com>
    When you refer to "filesystem," do you mean HDFS?

    It's very common to store lots of text files in HDFS and run multiple jobs
    to process / learn about those text files. As for XML support, you can use
    Java libraries (or Python libraries if you're using Hadoop streaming) to
    parse the XML; Hadoop itself doesn't have much XML support. I hope this
    answers your question.

    Alex

    On Fri, Jun 12, 2009 at 1:31 PM, Alexandre Jaquet <alexjaquet@gmail.com
    wrote:
    Hi,

    Does hadoop and map / reduce will allow me to parse large quantity of open
    xml files distributed inside the same filesystem but using multipe jobs ?

    Thx

    Alexandre Jaquet
  • Alex Loddengaard at Jun 15, 2009 at 5:29 pm
    Well, you define what your job does, but I expect that nearly all MR jobs do
    their parsing in the mapper, not in the reducer. You may find these two
    videos useful:

    <http://www.cloudera.com/hadoop-training-mapreduce-hdfs>
    <http://www.cloudera.com/hadoop-training-programming-with-hadoop>

    Hope this helps!

    Alex
    On Sat, Jun 13, 2009 at 1:42 AM, Alexandre Jaquet wrote:

    Thanks Alex,

    Parsing the documents is a task done within the reducer ? we collect the
    datas (document input) within a mapper and then parse it ?

    Thanks in advance

    Alexandre Jaquet

    2009/6/13 Alex Loddengaard <alex@cloudera.com>
    When you refer to "filesystem," do you mean HDFS?

    It's very common to store lots of text files in HDFS and run multiple jobs
    to process / learn about those text files. As for XML support, you can use
    Java libraries (or Python libraries if you're using Hadoop streaming) to
    parse the XML; Hadoop itself doesn't have much XML support. I hope this
    answers your question.

    Alex

    On Fri, Jun 12, 2009 at 1:31 PM, Alexandre Jaquet <alexjaquet@gmail.com
    wrote:
    Hi,

    Does hadoop and map / reduce will allow me to parse large quantity of open
    xml files distributed inside the same filesystem but using multipe jobs
    ?
    Thx

    Alexandre Jaquet
  • Alexandre Jaquet at Jun 15, 2009 at 6:46 pm
    Hi Alex,

    First thanks again for responding, I saw that katta within their search
    engine already allow to do full text search within pdf box to search and
    index pdf files ;) I will study your video training tonigth to learn how to
    implement the job for xml within your video :))

    2009/6/15 Alex Loddengaard <alex@cloudera.com>
    Well, you define what your job does, but I expect that nearly all MR jobs
    do
    their parsing in the mapper, not in the reducer. You may find these two
    videos useful:

    <http://www.cloudera.com/hadoop-training-mapreduce-hdfs>
    <http://www.cloudera.com/hadoop-training-programming-with-hadoop>

    Hope this helps!

    Alex

    On Sat, Jun 13, 2009 at 1:42 AM, Alexandre Jaquet <alexjaquet@gmail.com
    wrote:
    Thanks Alex,

    Parsing the documents is a task done within the reducer ? we collect the
    datas (document input) within a mapper and then parse it ?

    Thanks in advance

    Alexandre Jaquet

    2009/6/13 Alex Loddengaard <alex@cloudera.com>
    When you refer to "filesystem," do you mean HDFS?

    It's very common to store lots of text files in HDFS and run multiple jobs
    to process / learn about those text files. As for XML support, you can use
    Java libraries (or Python libraries if you're using Hadoop streaming)
    to
    parse the XML; Hadoop itself doesn't have much XML support. I hope
    this
    answers your question.

    Alex

    On Fri, Jun 12, 2009 at 1:31 PM, Alexandre Jaquet <
    alexjaquet@gmail.com
    wrote:
    Hi,

    Does hadoop and map / reduce will allow me to parse large quantity of open
    xml files distributed inside the same filesystem but using multipe
    jobs
    ?
    Thx

    Alexandre Jaquet

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 12, '09 at 8:31p
activeJun 15, '09 at 6:46p
posts5
users2
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase