FAQ
I am trying to run Hadoop on a cluster of 3 nodes. The namenode and the
jobtracker web UI work. I have the namenode running on node A and job
tracker running on node B. Is it true that namenode and jobtracker cannot
run on the same box ? Also if I want to run the examples on the cluster is
there anything special that needs to be done. When I run the example
WordCount on a machine C (which is a task tracker and not a job tracker) the
LocalJobRunner is invoked all the time. I am guessing this means that the
map tasks are running locally. How can I distribute this on the cluster ?
Please advice.

Thanks
Avinash

Search Discussions

  • Dennis Kubes at May 24, 2007 at 11:39 pm

    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode and the
    jobtracker web UI work. I have the namenode running on node A and job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the example
    WordCount on a machine C (which is a task tracker and not a job tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means that the
    map tasks are running locally. How can I distribute this on the cluster ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they pointing to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode server.

    It would be helpful to have more information about what is happening and
    your setup as that would help myself and others on the list debug what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Phantom at May 24, 2007 at 11:51 pm
    Yes the files are the same and I am starting the tasks on the namenode
    server. I also figured what my problem was with respect to not being able to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the LocalJobRunner
    is being used. Do I need to specify the config file to the running instance
    of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode and the
    jobtracker web UI work. I have the namenode running on node A and job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the example
    WordCount on a machine C (which is a task tracker and not a job tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means that the
    map tasks are running locally. How can I distribute this on the cluster ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they pointing to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode server.

    It would be helpful to have more information about what is happening and
    your setup as that would help myself and others on the list debug what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Mahadev Konar at May 25, 2007 at 12:24 am
    Hi,
    When you run the job, you need to set the environment variable
    HADOOP_CONF_DIR to the configuration directory that has the configuration
    file pointing to the right jobtracker.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Thursday, May 24, 2007 4:51 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Yes the files are the same and I am starting the tasks on the namenode
    server. I also figured what my problem was with respect to not being able
    to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the
    LocalJobRunner
    is being used. Do I need to specify the config file to the running
    instance
    of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
    the
    jobtracker web UI work. I have the namenode running on node A and job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the example
    WordCount on a machine C (which is a task tracker and not a job
    tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means that the
    map tasks are running locally. How can I distribute this on the
    cluster
    ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they pointing to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode server.

    It would be helpful to have more information about what is happening and
    your setup as that would help myself and others on the list debug what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Vishal Shah at May 25, 2007 at 7:22 am
    Hi Avinash,

    Can you share your hadoop-site.xml, mapred-default.xml and slaves files?
    Most probably, you have not set the jobtracker properly in the
    hadoop-site.xml conf file. Check the property mapred.job.tracker property in
    your file. It should look something like this:

    <property>
    <name>mapred.job.tracker</name>
    <value>fully.qualified.domainname:40000</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    -vishal.

    -----Original Message-----
    From: Mahadev Konar
    Sent: Friday, May 25, 2007 5:54 AM
    To: hadoop-user@lucene.apache.org
    Subject: RE: Configuration and Hadoop cluster setup

    Hi,
    When you run the job, you need to set the environment variable
    HADOOP_CONF_DIR to the configuration directory that has the configuration
    file pointing to the right jobtracker.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Thursday, May 24, 2007 4:51 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Yes the files are the same and I am starting the tasks on the namenode
    server. I also figured what my problem was with respect to not being able
    to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the
    LocalJobRunner
    is being used. Do I need to specify the config file to the running
    instance
    of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
    the
    jobtracker web UI work. I have the namenode running on node A and job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the example
    WordCount on a machine C (which is a task tracker and not a job
    tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means that the
    map tasks are running locally. How can I distribute this on the
    cluster
    ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they pointing to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode server.

    It would be helpful to have more information about what is happening and
    your setup as that would help myself and others on the list debug what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Phantom at May 25, 2007 at 8:37 pm
    Here is a copy of my hadoop-site.xml. What am I doing wrong ?

    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>dev030.sctm.com:9000</value>
    </property>

    <property>
    <name>dfs.name.dir</name>
    <value>/tmp/hadoop</value>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>dev030.sctm.com:50029</value>
    </property>

    <property>
    <name>mapred.job.tracker.info.port</name>
    <value>50030</value>
    </property>

    <property>
    <name>mapred.min.split.size</name>
    <value>65536</value>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>

    </configuration>

    On 5/25/07, Vishal Shah wrote:

    Hi Avinash,

    Can you share your hadoop-site.xml, mapred-default.xml and slaves files?
    Most probably, you have not set the jobtracker properly in the
    hadoop-site.xml conf file. Check the property mapred.job.tracker property
    in
    your file. It should look something like this:

    <property>
    <name>mapred.job.tracker</name>
    <value>fully.qualified.domainname:40000</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    -vishal.

    -----Original Message-----
    From: Mahadev Konar
    Sent: Friday, May 25, 2007 5:54 AM
    To: hadoop-user@lucene.apache.org
    Subject: RE: Configuration and Hadoop cluster setup

    Hi,
    When you run the job, you need to set the environment variable
    HADOOP_CONF_DIR to the configuration directory that has the configuration
    file pointing to the right jobtracker.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Thursday, May 24, 2007 4:51 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Yes the files are the same and I am starting the tasks on the namenode
    server. I also figured what my problem was with respect to not being able
    to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the
    LocalJobRunner
    is being used. Do I need to specify the config file to the running
    instance
    of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
    the
    jobtracker web UI work. I have the namenode running on node A and
    job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same
    box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the example
    WordCount on a machine C (which is a task tracker and not a job
    tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means
    that
    the
    map tasks are running locally. How can I distribute this on the
    cluster
    ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they pointing
    to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode
    server.
    It would be helpful to have more information about what is happening
    and
    your setup as that would help myself and others on the list debug what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Hairong Kuang at May 25, 2007 at 9:03 pm
    Have you tried Mahadev's suggestion? You need to set HADOOP_CONF_DIR to be
    the directory in which your hadoop-site.xml is located at, or try to use
    hadoop --config <conf_dir> to submit your job.

    Hairong

    -----Original Message-----
    From: Phantom
    Sent: Friday, May 25, 2007 1:37 PM
    To: hadoop-user@lucene.apache.org; vishals@rediff.co.in
    Subject: Re: Configuration and Hadoop cluster setup

    Here is a copy of my hadoop-site.xml. What am I doing wrong ?

    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>dev030.sctm.com:9000</value>
    </property>

    <property>
    <name>dfs.name.dir</name>
    <value>/tmp/hadoop</value>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>dev030.sctm.com:50029</value>
    </property>

    <property>
    <name>mapred.job.tracker.info.port</name>
    <value>50030</value>
    </property>

    <property>
    <name>mapred.min.split.size</name>
    <value>65536</value>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>

    </configuration>

    On 5/25/07, Vishal Shah wrote:

    Hi Avinash,

    Can you share your hadoop-site.xml, mapred-default.xml and slaves files?
    Most probably, you have not set the jobtracker properly in the
    hadoop-site.xml conf file. Check the property mapred.job.tracker
    property in your file. It should look something like this:

    <property>
    <name>mapred.job.tracker</name>
    <value>fully.qualified.domainname:40000</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    -vishal.

    -----Original Message-----
    From: Mahadev Konar
    Sent: Friday, May 25, 2007 5:54 AM
    To: hadoop-user@lucene.apache.org
    Subject: RE: Configuration and Hadoop cluster setup

    Hi,
    When you run the job, you need to set the environment variable
    HADOOP_CONF_DIR to the configuration directory that has the
    configuration file pointing to the right jobtracker.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Thursday, May 24, 2007 4:51 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Yes the files are the same and I am starting the tasks on the
    namenode server. I also figured what my problem was with respect to
    not being able
    to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the
    LocalJobRunner is being used. Do I need to specify the config file
    to the running instance of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode
    and
    the
    jobtracker web UI work. I have the namenode running on node A
    and
    job
    tracker running on node B. Is it true that namenode and
    jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the
    same
    box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the
    example WordCount on a machine C (which is a task tracker and
    not a job
    tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means
    that
    the
    map tasks are running locally. How can I distribute this on the
    cluster
    ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they
    pointing
    to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode
    server.
    It would be helpful to have more information about what is
    happening
    and
    your setup as that would help myself and others on the list debug
    what may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Phantom at May 25, 2007 at 9:19 pm
    At last I managed to get this working along the lines of what I would want
    it to do. I had to modify the sample to set the property explicitly. I did
    jobConf.set("mapred.job.tracker", "<host:port>").

    If my Map job is going to process a file does it have to be in HDFS and if
    so how do I get it there ? Any resource I can read to get a better
    understanding.

    Thanks
    Avinash
    On 5/25/07, Phantom wrote:

    Here is a copy of my hadoop-site.xml. What am I doing wrong ?

    <configuration>
    <property>
    <name>fs.default.name</name>
    <value> dev030.sctm.com:9000</value>
    </property>

    <property>
    <name> dfs.name.dir</name>
    <value>/tmp/hadoop</value>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value> dev030.sctm.com:50029 </value>
    </property>

    <property>
    <name>mapred.job.tracker.info.port</name>
    <value>50030</value>
    </property>

    <property>
    <name>mapred.min.split.size</name>
    <value>65536</value>
    </property>

    <property>
    <name> dfs.replication</name>
    <value>1</value>
    </property>

    </configuration>

    On 5/25/07, Vishal Shah wrote:

    Hi Avinash,

    Can you share your hadoop-site.xml, mapred-default.xml and slaves
    files?
    Most probably, you have not set the jobtracker properly in the
    hadoop-site.xml conf file. Check the property mapred.job.trackerproperty in
    your file. It should look something like this:

    <property>
    <name>mapred.job.tracker</name>
    <value>fully.qualified.domainname:40000</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    -vishal.

    -----Original Message-----
    From: Mahadev Konar [mailto: mahadev@yahoo-inc.com]
    Sent: Friday, May 25, 2007 5:54 AM
    To: hadoop-user@lucene.apache.org
    Subject: RE: Configuration and Hadoop cluster setup

    Hi,
    When you run the job, you need to set the environment variable
    HADOOP_CONF_DIR to the configuration directory that has the
    configuration
    file pointing to the right jobtracker.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Thursday, May 24, 2007 4:51 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Yes the files are the same and I am starting the tasks on the namenode
    server. I also figured what my problem was with respect to not being able
    to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the
    LocalJobRunner
    is being used. Do I need to specify the config file to the running
    instance
    of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode
    and
    the
    jobtracker web UI work. I have the namenode running on node A and
    job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same
    box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the
    example
    WordCount on a machine C (which is a task tracker and not a job
    tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means
    that
    the
    map tasks are running locally. How can I distribute this on the
    cluster
    ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they
    pointing to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode
    server.
    It would be helpful to have more information about what is happening
    and
    your setup as that would help myself and others on the list debug
    what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Doug Cutting at May 25, 2007 at 9:32 pm

    Phantom wrote:
    If my Map job is going to process a file does it have to be in HDFS
    No, but they usually are. Job inputs are resolved relative to the
    default filesystem. So, if you've configured the default filesystem to
    be HDFS, and you pass a filename that's not qualified by a filesystem as
    the input to your job, then your input should be in HDFS.

    But inputs don't have to be in the default filesystem nor must they be
    in HDFS. They need to be in a filesystem that's available to all nodes.
    They could be in NFS, S3, or Ceph instead of HDFS. They could even be
    in a non-default HDFS system.
    and if so how do I get it there ?
    If HDFS is configured as your default filesystem:

    bin/hadoop fs -put localFileName nameInHdfs

    Doug
  • Avinash Lakshman at May 25, 2007 at 9:47 pm
    I am trying to run the WordCount sample against a file on my local file
    system. So I kick start my program as
    "java -D/home/alakshman/hadoop-0.12.3/conf
    org.apache.hadoop.examples.WordCount -m 10 -r 4 ~/test2.dat /tmp/out-dir".
    When I run this, I get the following in the jobtracker log file (what should
    I be doing to fix this):

    2007-05-25 14:41:32,733 INFO org.apache.hadoop.mapred.TaskInProgress: Error
    from task_0001_m_000000_3: java.lang.IllegalArgumentException: Wrong FS:
    file:/home/alakshman/test2.dat, expected:
    hdfs://dev030.sctm.facebook.com:9000
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:216)
    at
    org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.getPath
    (DistributedFileSystem.java:110)
    at
    org.apache.hadoop.dfs.DistributedFileSystem$RawDistributedFileSystem.exists(
    DistributedFileSystem.java:170)
    at
    org.apache.hadoop.fs.FilterFileSystem.exists(FilterFileSystem.java:168)
    at
    org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:331)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:245)
    at
    org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.jav
    a:54)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:139)
    at
    org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)


    On 5/25/07 2:31 PM, "Doug Cutting" wrote:

    Phantom wrote:
    If my Map job is going to process a file does it have to be in HDFS
    No, but they usually are. Job inputs are resolved relative to the
    default filesystem. So, if you've configured the default filesystem to
    be HDFS, and you pass a filename that's not qualified by a filesystem as
    the input to your job, then your input should be in HDFS.

    But inputs don't have to be in the default filesystem nor must they be
    in HDFS. They need to be in a filesystem that's available to all nodes.
    They could be in NFS, S3, or Ceph instead of HDFS. They could even be
    in a non-default HDFS system.
    and if so how do I get it there ?
    If HDFS is configured as your default filesystem:

    bin/hadoop fs -put localFileName nameInHdfs

    Doug
  • Koji Noguchi at May 25, 2007 at 11:04 pm
    Doug,

    I may be wrong, but last time I tried (on 0.12.3), MapRed didn't work
    for non-default filesystem as an input.
    (output worked fine.)

    https://issues.apache.org/jira/browse/HADOOP-71
    https://issues.apache.org/jira/browse/HADOOP-1107

    Mine failed with "org.apache.hadoop.mapred.InvalidInputException: Input
    path does not exist".
    It basically checked the default file system instead of the one passed in.

    Koji


    Doug Cutting wrote:
    But inputs don't have to be in the default filesystem nor must they be
    in HDFS. They need to be in a filesystem that's available to all
    nodes. They could be in NFS, S3, or Ceph instead of HDFS. They could
    even be in a non-default HDFS system.
  • Phantom at May 28, 2007 at 8:13 pm
    Is there a workaround ? I want to run the WordCount sample against a file on
    my local filesystem. If this is not possible do I need to put my file into
    HDFS and then point that location to my program ?

    Thanks
    Avinash
    On 5/25/07, Koji Noguchi wrote:

    Doug,

    I may be wrong, but last time I tried (on 0.12.3), MapRed didn't work
    for non-default filesystem as an input.
    (output worked fine.)

    https://issues.apache.org/jira/browse/HADOOP-71
    https://issues.apache.org/jira/browse/HADOOP-1107

    Mine failed with "org.apache.hadoop.mapred.InvalidInputException: Input
    path does not exist".
    It basically checked the default file system instead of the one passed in.

    Koji


    Doug Cutting wrote:
    But inputs don't have to be in the default filesystem nor must they be
    in HDFS. They need to be in a filesystem that's available to all
    nodes. They could be in NFS, S3, or Ceph instead of HDFS. They could
    even be in a non-default HDFS system.
  • Doug Cutting at May 29, 2007 at 5:56 pm

    Phantom wrote:
    Is there a workaround ? I want to run the WordCount sample against a
    file on
    my local filesystem. If this is not possible do I need to put my file into
    HDFS and then point that location to my program ?
    Is your local filesystem accessible to all nodes in your system?

    Doug
  • Phantom at May 29, 2007 at 6:01 pm
    Yes it is.

    Thanks
    A

    On 5/29/07, Doug Cutting wrote:

    Phantom wrote:
    Is there a workaround ? I want to run the WordCount sample against a
    file on
    my local filesystem. If this is not possible do I need to put my file into
    HDFS and then point that location to my program ?
    Is your local filesystem accessible to all nodes in your system?

    Doug
  • Phantom at May 29, 2007 at 6:53 pm
    Either I am totally confused or this configuration stuff is confusing the
    hell out of me. I am pretty sure it is the former. Please I am looking for
    advice here as to how I should do this. I have my fs.default.name set to
    hdfs://<host>:<port>. In my JobConf setup I set the set same value for my
    fs.default.name. Now I have two options and I would appreciate if some
    expert could tell me which option I should take and why ?

    (1) Set my fs.default.name set to hdfs://<host>:<port> and also specify it
    in the JobConf configuration. Copy my sample input file into HDFS using
    "bin/hadoop fd -put" from my local file system. I then need to specify this
    file to my WordCount sample as input. Should I specify this file with the
    hdfs:// directive ?

    (2) Set my fs.default.name set to file://<host>:<port> and also specify it
    in the JobConf configuration. Just specify the input path to the WordCount
    sample and everything should work if the path is available to all machines
    in the cluster ?

    Which way should I go ?

    Thanks
    Avinash
    On 5/29/07, Phantom wrote:

    Yes it is.

    Thanks
    A

    On 5/29/07, Doug Cutting wrote:

    Phantom wrote:
    Is there a workaround ? I want to run the WordCount sample against a
    file on
    my local filesystem. If this is not possible do I need to put my file into
    HDFS and then point that location to my program ?
    Is your local filesystem accessible to all nodes in your system?

    Doug
  • Mahadev Konar at May 29, 2007 at 8:15 pm
    Hi Avinash,
    The way Map Reduce in distributed environment works is:
    1) set up the cluster in distributed fashion as specified in the wiki.
    http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop
    2) run mapreduce jobs with the command:
    Bin/hadoop jar job.jar
    Before doing this you need to set the HADOOP_CONF_DIR env variable pointing
    to the conf directory that contains the distributed configuration.

    The input files need to be uploaded to HDFS first and then in your jobconf
    you need to set job.setInputPath(tempDir); -- where tempdir is the
    inputdirectory for the mapreduce job and the directory where you uploaded
    the files. You can take a look at the examples in Hadoop examples directory
    for this.
    Hope this helps.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Tuesday, May 29, 2007 11:53 AM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Either I am totally confused or this configuration stuff is confusing the
    hell out of me. I am pretty sure it is the former. Please I am looking for
    advice here as to how I should do this. I have my fs.default.name set to
    hdfs://<host>:<port>. In my JobConf setup I set the set same value for my
    fs.default.name. Now I have two options and I would appreciate if some
    expert could tell me which option I should take and why ?

    (1) Set my fs.default.name set to hdfs://<host>:<port> and also specify it
    in the JobConf configuration. Copy my sample input file into HDFS using
    "bin/hadoop fd -put" from my local file system. I then need to specify
    this
    file to my WordCount sample as input. Should I specify this file with the
    hdfs:// directive ?

    (2) Set my fs.default.name set to file://<host>:<port> and also specify it
    in the JobConf configuration. Just specify the input path to the WordCount
    sample and everything should work if the path is available to all machines
    in the cluster ?

    Which way should I go ?

    Thanks
    Avinash
    On 5/29/07, Phantom wrote:

    Yes it is.

    Thanks
    A

    On 5/29/07, Doug Cutting wrote:

    Phantom wrote:
    Is there a workaround ? I want to run the WordCount sample against a
    file on
    my local filesystem. If this is not possible do I need to put my
    file
    into
    HDFS and then point that location to my program ?
    Is your local filesystem accessible to all nodes in your system?

    Doug
  • Doug Cutting at May 29, 2007 at 8:30 pm

    Phantom wrote:
    (1) Set my fs.default.name set to hdfs://<host>:<port> and also specify it
    in the JobConf configuration. Copy my sample input file into HDFS using
    "bin/hadoop fd -put" from my local file system. I then need to specify this
    file to my WordCount sample as input. Should I specify this file with the
    hdfs:// directive ?

    (2) Set my fs.default.name set to file://<host>:<port> and also specify it
    in the JobConf configuration. Just specify the input path to the WordCount
    sample and everything should work if the path is available to all machines
    in the cluster ?

    Which way should I go ?
    Either should work. So should a third option, which is to have your job
    input in the non-default filesystem, but there's currently a bug that
    prevents that from working. But the above two should work. The second
    assumes that the input is available on the same path in the native
    filesystem on all nodes.

    When naming files in the default filesystem you do not need to specify
    their filesystem, since it is the default, but it is not an error to
    specify it.

    The most common mode of distributed operation is (1): use an HDFS
    filesytem as your fs.default.name, copy your initial input into that
    filesystem with 'bin/hadoop fs -put localPath hdfsPath', then specify
    'hdfsPath' as your job's input. The "hdfs://host:port" is not required
    at this point, since it is the default.

    Doug
  • Avinash Lakshman at May 29, 2007 at 8:36 pm
    I did run it the way you suggested. But I am running into a slew of
    ClassNotFoundException¹s for the MapClass. Exporting the CLASSPATH doesn¹t
    seem to fix it. How do I get around it ?

    Thanks
    Avinash

    On 5/29/07 1:30 PM, "Doug Cutting" wrote:

    Phantom wrote:
    (1) Set my fs.default.name set to hdfs://<host>:<port> and also specify it
    in the JobConf configuration. Copy my sample input file into HDFS using
    "bin/hadoop fd -put" from my local file system. I then need to specify this
    file to my WordCount sample as input. Should I specify this file with the
    hdfs:// directive ?

    (2) Set my fs.default.name set to file://<host>:<port> and also specify it
    in the JobConf configuration. Just specify the input path to the WordCount
    sample and everything should work if the path is available to all machines
    in the cluster ?

    Which way should I go ?
    Either should work. So should a third option, which is to have your job
    input in the non-default filesystem, but there's currently a bug that
    prevents that from working. But the above two should work. The second
    assumes that the input is available on the same path in the native
    filesystem on all nodes.

    When naming files in the default filesystem you do not need to specify
    their filesystem, since it is the default, but it is not an error to
    specify it.

    The most common mode of distributed operation is (1): use an HDFS
    filesytem as your fs.default.name, copy your initial input into that
    filesystem with 'bin/hadoop fs -put localPath hdfsPath', then specify
    'hdfsPath' as your job's input. The "hdfs://host:port" is not required
    at this point, since it is the default.

    Doug


  • Doug Cutting at May 29, 2007 at 5:54 pm

    Koji Noguchi wrote:
    I may be wrong, but last time I tried (on 0.12.3), MapRed didn't work
    for non-default filesystem as an input.
    (output worked fine.)

    https://issues.apache.org/jira/browse/HADOOP-71
    https://issues.apache.org/jira/browse/HADOOP-1107
    You're probably right. That is a bug. It's partly fixed by:

    https://issues.apache.org/jira/browse/HADOOP-1226

    This causes all paths from DFS to be fully qualified, fixing
    HADOOP-1107, I think. The SequenceFile bug may still be outstanding.
    We should try to fix that too for 0.14.

    Doug
  • Yu-yang chen at May 25, 2007 at 11:00 pm
    I think you are not include your nodes A, B, C in your bin/slaves file,
    that may be why.
    your hadoop-site.xml seems ok to me

    yu-yang

    Phantom wrote:
    Here is a copy of my hadoop-site.xml. What am I doing wrong ?

    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>dev030.sctm.com:9000</value>
    </property>

    <property>
    <name>dfs.name.dir</name>
    <value>/tmp/hadoop</value>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value>dev030.sctm.com:50029</value>
    </property>

    <property>
    <name>mapred.job.tracker.info.port</name>
    <value>50030</value>
    </property>

    <property>
    <name>mapred.min.split.size</name>
    <value>65536</value>
    </property>

    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>

    </configuration>

    On 5/25/07, Vishal Shah wrote:

    Hi Avinash,

    Can you share your hadoop-site.xml, mapred-default.xml and slaves
    files?
    Most probably, you have not set the jobtracker properly in the
    hadoop-site.xml conf file. Check the property mapred.job.tracker
    property
    in
    your file. It should look something like this:

    <property>
    <name>mapred.job.tracker</name>
    <value>fully.qualified.domainname:40000</value>
    <description>The host and port that the MapReduce job tracker runs
    at. If "local", then jobs are run in-process as a single map
    and reduce task.
    </description>
    </property>

    -vishal.

    -----Original Message-----
    From: Mahadev Konar
    Sent: Friday, May 25, 2007 5:54 AM
    To: hadoop-user@lucene.apache.org
    Subject: RE: Configuration and Hadoop cluster setup

    Hi,
    When you run the job, you need to set the environment variable
    HADOOP_CONF_DIR to the configuration directory that has the
    configuration
    file pointing to the right jobtracker.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Thursday, May 24, 2007 4:51 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Yes the files are the same and I am starting the tasks on the namenode
    server. I also figured what my problem was with respect to not being able
    to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the
    LocalJobRunner
    is being used. Do I need to specify the config file to the running
    instance
    of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode
    and
    the
    jobtracker web UI work. I have the namenode running on node A and
    job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same
    box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the
    example
    WordCount on a machine C (which is a task tracker and not a job
    tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means
    that
    the
    map tasks are running locally. How can I distribute this on the
    cluster
    ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they
    pointing
    to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode
    server.
    It would be helpful to have more information about what is happening
    and
    your setup as that would help myself and others on the list debug
    what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Phantom at May 25, 2007 at 7:24 pm
    I tried this. So before running the WordCount sample I did an export
    HADOOP_CONF_DIR=<my conf dir>. Doesn't seem to help. I still see the
    LocalJobRunner being used.

    Thanks
    Avinash
    On 5/24/07, Mahadev Konar wrote:

    Hi,
    When you run the job, you need to set the environment variable
    HADOOP_CONF_DIR to the configuration directory that has the configuration
    file pointing to the right jobtracker.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Thursday, May 24, 2007 4:51 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Yes the files are the same and I am starting the tasks on the namenode
    server. I also figured what my problem was with respect to not being able
    to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the
    LocalJobRunner
    is being used. Do I need to specify the config file to the running
    instance
    of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
    the
    jobtracker web UI work. I have the namenode running on node A and
    job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same
    box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the example
    WordCount on a machine C (which is a task tracker and not a job
    tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means
    that
    the
    map tasks are running locally. How can I distribute this on the
    cluster
    ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they pointing
    to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode
    server.
    It would be helpful to have more information about what is happening
    and
    your setup as that would help myself and others on the list debug what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash
  • Dennis Kubes at May 25, 2007 at 11:32 pm
    I don't know if this will make a difference or not:

    <property>
    <name>fs.default.name</name>
    <value> dev030.sctm.com:9000</value>
    </property>

    <property>
    <name>mapred.job.tracker</name>
    <value> dev030.sctm.com:50029 </value>
    </property>

    Your fs.default.name and mapred.job.tracker variables both seem to have
    spaces (or an unprintable character) in front of the values. Can you
    try removing these and seeing if the WordCount works correctly?

    Dennis Kubes

    Phantom wrote:
    I tried this. So before running the WordCount sample I did an export
    HADOOP_CONF_DIR=<my conf dir>. Doesn't seem to help. I still see the
    LocalJobRunner being used.

    Thanks
    Avinash
    On 5/24/07, Mahadev Konar wrote:

    Hi,
    When you run the job, you need to set the environment variable
    HADOOP_CONF_DIR to the configuration directory that has the configuration
    file pointing to the right jobtracker.

    Regards
    Mahadev
    -----Original Message-----
    From: Phantom
    Sent: Thursday, May 24, 2007 4:51 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: Configuration and Hadoop cluster setup

    Yes the files are the same and I am starting the tasks on the namenode
    server. I also figured what my problem was with respect to not being able
    to
    start the namenode and job tracker on the same machine. I had to reformat
    the file system. But the all this still doesn't cause the WordCount sample
    to run in a distributed fashion. I can tell this because the
    LocalJobRunner
    is being used. Do I need to specify the config file to the running
    instance
    of the program ? If so how do I do that ?

    Thanks
    A
    On 5/24/07, Dennis Kubes wrote:



    Phantom wrote:
    I am trying to run Hadoop on a cluster of 3 nodes. The namenode and
    the
    jobtracker web UI work. I have the namenode running on node A and
    job
    tracker running on node B. Is it true that namenode and jobtracker cannot
    run on the same box ?
    The namenode and the jobtracker can most definitely run on the same
    box.
    As far as I know this is the preferred configuration.

    Also if I want to run the examples on the cluster is
    there anything special that needs to be done. When I run the
    example
    WordCount on a machine C (which is a task tracker and not a job
    tracker)
    the
    LocalJobRunner is invoked all the time. I am guessing this means
    that
    the
    map tasks are running locally. How can I distribute this on the
    cluster
    ?
    Please advice.
    Are the conf files on machine C the same as the namenode/jobtracker?
    Are they pointing to the namenode and jobtracker or are they pointing
    to
    local in the hadoop-site.xml file. Also we have found it easier
    (although not necessarily better) to start tasks on the namenode
    server.
    It would be helpful to have more information about what is happening
    and
    your setup as that would help myself and others on the list debug
    what
    may be occurring.

    Dennis Kubes
    Thanks
    Avinash

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMay 24, '07 at 10:21p
activeMay 29, '07 at 8:36p
posts22
users9
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase