FAQ
Hi,
I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
in a single node cluster. I have already run mapreduce programs for
wordcount and building inverted index.

I am trying to run the wordcount program in a wikipedia dump. It is a
single XML file with Wikipedia pages' data in the following form:

<page>
<title>Amr El Halwani</title>
<id>16000008</id>
<revision>
<id>368385014</id>
<timestamp>2010-06-16T13:32:28Z</timestamp>
<text xml:space="preserve">
Some multi-line text goes here.
</text>
</page>


I want to do wordcount of the text contained inside the tags <text>
and </text>. Please let me know what is the correct way of doing this.

What works:
------------------
$HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

Straight out of documentation, the following also works:
---------------------------------------------------------------------------------
$HADOOP_HOME/bin/hadoop jar
contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
"StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
-output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc

What I am interested in doing is:
-------------------------------------------------
1. use my java classes in WordCount.jar (or something similar) as
mapper and reducer (and driver).
2. if possible, pass the configuration options, like begin and end
tags of XML from inside my Java program itself.
3. if possible, specify my intent to use StreamXmlRecordReader from
inside the java program itself.

Please let me know what I should read/do to solve these issues.

Bibek
Bibek

Search Discussions

  • Steve Lewis at Oct 12, 2010 at 5:11 pm
    Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader
    and org.apache.hadoop.mapreduce.lib.input.TextInputFormat

    What you need to do is copy those and change the LineRecordReader to look
    for the <page> tag
    On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel wrote:

    Hi,
    I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
    in a single node cluster. I have already run mapreduce programs for
    wordcount and building inverted index.

    I am trying to run the wordcount program in a wikipedia dump. It is a
    single XML file with Wikipedia pages' data in the following form:

    <page>
    <title>Amr El Halwani</title>
    <id>16000008</id>
    <revision>
    <id>368385014</id>
    <timestamp>2010-06-16T13:32:28Z</timestamp>
    <text xml:space="preserve">
    Some multi-line text goes here.
    </text>
    </page>


    I want to do wordcount of the text contained inside the tags <text>
    and </text>. Please let me know what is the correct way of doing this.

    What works:
    ------------------
    $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

    Straight out of documentation, the following also works:

    ---------------------------------------------------------------------------------
    $HADOOP_HOME/bin/hadoop jar
    contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
    "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
    -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc

    What I am interested in doing is:
    -------------------------------------------------
    1. use my java classes in WordCount.jar (or something similar) as
    mapper and reducer (and driver).
    2. if possible, pass the configuration options, like begin and end
    tags of XML from inside my Java program itself.
    3. if possible, specify my intent to use StreamXmlRecordReader from
    inside the java program itself.

    Please let me know what I should read/do to solve these issues.

    Bibek
    Bibek


    --
    Steven M. Lewis PhD
    4221 105th Ave Ne
    Kirkland, WA 98033
    206-384-1340 (cell)
    Institute for Systems Biology
    Seattle WA
  • Paul Ingles at Oct 12, 2010 at 5:29 pm
    I found that we needed to 'borrow' mahout's xmlinputformat to work correctly, I posted a small blog article on it a while back: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

    You could either add the dependency on the mahout jars or copy the class source and compile in your tree.

    Hth,
    Paul

    Sent from my iPhone
    On 12 Oct 2010, at 18:10, Steve Lewis wrote:

    Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader
    and org.apache.hadoop.mapreduce.lib.input.TextInputFormat

    What you need to do is copy those and change the LineRecordReader to look
    for the <page> tag
    On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel wrote:

    Hi,
    I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
    in a single node cluster. I have already run mapreduce programs for
    wordcount and building inverted index.

    I am trying to run the wordcount program in a wikipedia dump. It is a
    single XML file with Wikipedia pages' data in the following form:

    <page>
    <title>Amr El Halwani</title>
    <id>16000008</id>
    <revision>
    <id>368385014</id>
    <timestamp>2010-06-16T13:32:28Z</timestamp>
    <text xml:space="preserve">
    Some multi-line text goes here.
    </text>
    </page>


    I want to do wordcount of the text contained inside the tags <text>
    and </text>. Please let me know what is the correct way of doing this.

    What works:
    ------------------
    $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

    Straight out of documentation, the following also works:

    ---------------------------------------------------------------------------------
    $HADOOP_HOME/bin/hadoop jar
    contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
    "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
    -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc

    What I am interested in doing is:
    -------------------------------------------------
    1. use my java classes in WordCount.jar (or something similar) as
    mapper and reducer (and driver).
    2. if possible, pass the configuration options, like begin and end
    tags of XML from inside my Java program itself.
    3. if possible, specify my intent to use StreamXmlRecordReader from
    inside the java program itself.

    Please let me know what I should read/do to solve these issues.

    Bibek
    Bibek


    --
    Steven M. Lewis PhD
    4221 105th Ave Ne
    Kirkland, WA 98033
    206-384-1340 (cell)
    Institute for Systems Biology
    Seattle WA
  • Bibek Paudel at Oct 12, 2010 at 8:57 pm

    On Tue, Oct 12, 2010 at 7:28 PM, Paul Ingles wrote:
    I found that we needed to 'borrow' mahout's xmlinputformat to work correctly, I posted a small blog article on it a while back: http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

    You could either add the dependency on the mahout jars or copy the class source and compile in your tree.
    I have read your post. Thanks.

    Could you please tell me how i can "add the dependency on the mahout
    jars"? is it by using the "-libjar" option in the commandline?

    Thanks,
    Bibek
    Hth,
    Paul

    Sent from my iPhone
    On 12 Oct 2010, at 18:10, Steve Lewis wrote:

    Look at the classes org.apache.hadoop.mapreduce.lib.input.LineRecordReader
    and org.apache.hadoop.mapreduce.lib.input.TextInputFormat

    What you need to do  is copy those and change the LineRecordReader to look
    for the <page> tag
    On Tue, Oct 12, 2010 at 5:02 AM, Bibek Paudel wrote:

    Hi,
    I use Hadoop 0.20.3-dev on Ubuntu. I use it in pseudo-distributed mode
    in a single node cluster. I have already run mapreduce programs for
    wordcount and building inverted index.

    I am trying to run the wordcount program in a wikipedia dump. It is a
    single XML file with Wikipedia pages' data in the following form:

    <page>
    <title>Amr El Halwani</title>
    <id>16000008</id>
    <revision>
    <id>368385014</id>
    <timestamp>2010-06-16T13:32:28Z</timestamp>
    <text xml:space="preserve">
    Some multi-line text goes here.
    </text>
    </page>


    I want to do wordcount of the text contained inside the tags <text>
    and </text>. Please let me know what is the correct way of doing this.

    What works:
    ------------------
    $HADOOP_HOME/bin/hadoop jar WordCount.jar WordCount wikixml wikixml-op2

    Straight out of documentation, the following also works:

    ---------------------------------------------------------------------------------
    $HADOOP_HOME/bin/hadoop jar
    contrib/streaming/hadoop-0.20.2-streaming.jar -inputreader
    "StreamXmlRecordReader,begin=<text>,end=</text>" -input wiki_head
    -output wiki_head_op -mapper /bin/cat -reducer /usr/bin/wc

    What I am interested in doing is:
    -------------------------------------------------
    1. use my java classes in WordCount.jar (or something similar) as
    mapper and reducer (and driver).
    2. if possible, pass the configuration options, like begin and end
    tags of XML from inside my Java program itself.
    3. if possible, specify my intent to use StreamXmlRecordReader from
    inside the java program itself.

    Please let me know what I should read/do to solve these issues.

    Bibek
    Bibek


    --
    Steven M. Lewis PhD
    4221 105th Ave Ne
    Kirkland, WA 98033
    206-384-1340 (cell)
    Institute for Systems Biology
    Seattle WA

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 12, '10 at 12:03p
activeOct 12, '10 at 8:57p
posts4
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase