FAQ
I want to stream data from logs into the HDFS in production but I do NOT
want my production machine to be apart of the computation cluster. The
reason I want to do it in this way is to take advantage of HDFS without
putting computation load on my production machine. Is this possible*?*
Furthermore, is this unnecessary because the computation would not put a
significant load on my production box (obviously depends on the map/reduce
implementation but I'm asking in general)*?*

I should note that our prod machine hosts our core web application and
database (saving up for another box :-).

Thanks,
Shahab

Search Discussions

  • Edward Capriolo at Oct 31, 2008 at 6:52 pm
    Shahab,

    This can be done.
    If you client speaks java you can connect to hadoop and write as a stream.

    If you client does not have java. The thrift api will generate stubs
    in a variety of languages

    Thrift API: http://wiki.apache.org/hadoop/HDFS-APIs

    Shameless plug -- If you just want to stream data I created a simple
    socket server-
    http://www.jointhegrid.com/jtgweb/lhadoopserver/index.jsp

    So you do not have to be part of the cluster to write to it.
  • Shahab mehmandoust at Oct 31, 2008 at 6:58 pm
    Definitely speaking java.... Do you think I'm being paranoid about the
    possible load?

    Shahab
    On Fri, Oct 31, 2008 at 11:52 AM, Edward Capriolo wrote:

    Shahab,

    This can be done.
    If you client speaks java you can connect to hadoop and write as a stream.

    If you client does not have java. The thrift api will generate stubs
    in a variety of languages

    Thrift API: http://wiki.apache.org/hadoop/HDFS-APIs

    Shameless plug -- If you just want to stream data I created a simple
    socket server-
    http://www.jointhegrid.com/jtgweb/lhadoopserver/index.jsp

    So you do not have to be part of the cluster to write to it.
  • Norbert Burger at Oct 31, 2008 at 8:47 pm
    What are you using to "stream logs into the HDFS"?

    If the command-line tools (ie., "hadoop dfs put") work for you, then all you
    need is a Hadoop install. Your production node doesn't need to be a
    datanode.
    On Fri, Oct 31, 2008 at 2:35 PM, shahab mehmandoust wrote:

    I want to stream data from logs into the HDFS in production but I do NOT
    want my production machine to be apart of the computation cluster. The
    reason I want to do it in this way is to take advantage of HDFS without
    putting computation load on my production machine. Is this possible*?*
    Furthermore, is this unnecessary because the computation would not put a
    significant load on my production box (obviously depends on the map/reduce
    implementation but I'm asking in general)*?*

    I should note that our prod machine hosts our core web application and
    database (saving up for another box :-).

    Thanks,
    Shahab
  • Jerome Boulon at Oct 31, 2008 at 8:58 pm
    Hi,
    We have deployed a new monitoring system Chukwa (
    http://wiki.apache.org/hadoop/Chukwa) that is doing exactly that.
    Also this system provide an easy way to post-process you log file and
    extract useful information using M/R.

    /Jerome.

    On 10/31/08 1:46 PM, "Norbert Burger" wrote:

    What are you using to "stream logs into the HDFS"?

    If the command-line tools (ie., "hadoop dfs put") work for you, then all you
    need is a Hadoop install. Your production node doesn't need to be a
    datanode.
    On Fri, Oct 31, 2008 at 2:35 PM, shahab mehmandoust wrote:

    I want to stream data from logs into the HDFS in production but I do NOT
    want my production machine to be apart of the computation cluster. The
    reason I want to do it in this way is to take advantage of HDFS without
    putting computation load on my production machine. Is this possible*?*
    Furthermore, is this unnecessary because the computation would not put a
    significant load on my production box (obviously depends on the map/reduce
    implementation but I'm asking in general)*?*

    I should note that our prod machine hosts our core web application and
    database (saving up for another box :-).

    Thanks,
    Shahab
  • Shahab mehmandoust at Oct 31, 2008 at 10:04 pm
    Currently, I'm just researching so I'm just playing with the idea of
    streaming log data into the HDFS.

    I'm confused about: "...all you need is a Hadoop install. Your production
    node doesn't need to be a
    datanode." If my production node is *not* a dataNode then how can I do
    "hadoop dfs put?"

    I was under the impression that when I install HDFS on a cluster each node
    in the cluster is a dataNode.

    Shahab
    On Fri, Oct 31, 2008 at 1:46 PM, Norbert Burger wrote:

    What are you using to "stream logs into the HDFS"?

    If the command-line tools (ie., "hadoop dfs put") work for you, then all
    you
    need is a Hadoop install. Your production node doesn't need to be a
    datanode.

    On Fri, Oct 31, 2008 at 2:35 PM, shahab mehmandoust <shahab53@gmail.com
    wrote:
    I want to stream data from logs into the HDFS in production but I do NOT
    want my production machine to be apart of the computation cluster. The
    reason I want to do it in this way is to take advantage of HDFS without
    putting computation load on my production machine. Is this possible*?*
    Furthermore, is this unnecessary because the computation would not put a
    significant load on my production box (obviously depends on the
    map/reduce
    implementation but I'm asking in general)*?*

    I should note that our prod machine hosts our core web application and
    database (saving up for another box :-).

    Thanks,
    Shahab
  • Norbert Burger at Nov 3, 2008 at 2:34 am
    You don't need to run the TaskTracker/DataNode JVMs in order to access your
    HDFS. All you need is a Hadoop installation with conf/hadoop-site.xml
    pointing to your cluster. In other words, install Hadoop locally, copy
    conf/hadoop-site.xml from one of your datanodes, and then you'll be able to
    run "hadoop dfs put" from outside your cluster.
    On 10/31/08, shahab mehmandoust wrote:

    Currently, I'm just researching so I'm just playing with the idea of
    streaming log data into the HDFS.

    I'm confused about: "...all you need is a Hadoop install. Your production

    node doesn't need to be a

    datanode." If my production node is *not* a dataNode then how can I do
    "hadoop dfs put?"

    I was under the impression that when I install HDFS on a cluster each node
    in the cluster is a dataNode.

    Shahab

    On Fri, Oct 31, 2008 at 1:46 PM, Norbert Burger <norbert.burger@gmail.com
    wrote:
    What are you using to "stream logs into the HDFS"?

    If the command-line tools (ie., "hadoop dfs put") work for you, then all
    you
    need is a Hadoop install. Your production node doesn't need to be a
    datanode.

    On Fri, Oct 31, 2008 at 2:35 PM, shahab mehmandoust <shahab53@gmail.com
    wrote:
    I want to stream data from logs into the HDFS in production but I do
    NOT
    want my production machine to be apart of the computation cluster. The
    reason I want to do it in this way is to take advantage of HDFS without
    putting computation load on my production machine. Is this possible*?*
    Furthermore, is this unnecessary because the computation would not put
    a
    significant load on my production box (obviously depends on the
    map/reduce
    implementation but I'm asking in general)*?*

    I should note that our prod machine hosts our core web application and
    database (saving up for another box :-).

    Thanks,
    Shahab

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 31, '08 at 6:41p
activeNov 3, '08 at 2:34a
posts7
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase