FAQ
Hi all,

I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as DataWare house, Sqoop as importing tool). I worked and learned on single node cluster with demo data.

As Hadoop suits best on Unix platform. Please help me to understand the requirement form start to finish to use Hadoop in production.

What would be the things to use Hadoop on real time project.

like Hadoop automation on Unix, alert of failure process.

Please put some light on using Hadoop on real time and what objectives are recommended.


Thanks & Regards
Yogesh Kumar

Search Discussions

  • Ruslan Al-Fakikh at Oct 2, 2012 at 11:48 am
    Hi,

    There are too many issues to discuss I guess. I would recommend
    reading Hadoop The Definitive Guide by Tom White. There are some
    chapters for the answers.
    Also what did you mean my 'real time"? Hadoop is not designed for
    giving real time results of queries. It is rather for offline data
    analysis, because each query can take minutes or hours to finish.
    AFAIK, HBase provides some real time functionality though.
    For Hadoop automation, you can try Oozie. We are using opswise in our company

    Best Regards
    On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari wrote:
    Hi all,

    I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
    DataWare house, Sqoop as importing tool). I worked and learned on single
    node cluster with demo data.

    As Hadoop suits best on Unix platform. Please help me to understand the
    requirement form start to finish to use Hadoop in production.

    What would be the things to use Hadoop on real time project.

    like Hadoop automation on Unix, alert of failure process.

    Please put some light on using Hadoop on real time and what objectives are
    recommended.


    Thanks & Regards
    Yogesh Kumar
  • Michael Segel at Oct 2, 2012 at 12:04 pm
    Funny that the OP asks about 'real time'...

    This comes up quiet often and its always misunderstood.

    First, when we say 'real time' many take it to mean subjective real time. Real 'real time' would require some sort of RTOS underneath.

    Second Hadoop is a parallelized framework. You have several components that make up Hadoop. A distributed scheduler, a distributed disk and tools to manipulate the data.

    You can use Hadoop in subjective real time scenarios.

    One common pattern is to use M/R to process the data, and HBase to deliver ad-hoc access to records returning a result in sub second response time.

    I think that there's an upcoming talk at Strata in NY on using Hadoop, (HBase and SOLR) to provide real time access.

    Out side of that, yeah Tom White's book is a great start, however, some of the feedback I've heard it that its a dry read.
    But then again, most technical books are. :-)

    On Oct 2, 2012, at 6:47 AM, Ruslan Al-Fakikh wrote:

    Hi,

    There are too many issues to discuss I guess. I would recommend
    reading Hadoop The Definitive Guide by Tom White. There are some
    chapters for the answers.
    Also what did you mean my 'real time"? Hadoop is not designed for
    giving real time results of queries. It is rather for offline data
    analysis, because each query can take minutes or hours to finish.
    AFAIK, HBase provides some real time functionality though.
    For Hadoop automation, you can try Oozie. We are using opswise in our company

    Best Regards
    On Mon, Oct 1, 2012 at 5:36 PM, yogesh dhari wrote:
    Hi all,

    I have understood the Hadoop and Hadoop Ecosystem(Pig as ETL, Hive as
    DataWare house, Sqoop as importing tool). I worked and learned on single
    node cluster with demo data.

    As Hadoop suits best on Unix platform. Please help me to understand the
    requirement form start to finish to use Hadoop in production.

    What would be the things to use Hadoop on real time project.

    like Hadoop automation on Unix, alert of failure process.

    Please put some light on using Hadoop on real time and what objectives are
    recommended.


    Thanks & Regards
    Yogesh Kumar
  • Hank Cohen at Oct 2, 2012 at 6:06 pm
    There is an important difference between real time and real fast

    Real time means that system response must meet a fixed schedule.
    Real fast just means sooner is better.

    Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or seconds for even slower moving processes. In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

    Most transaction processing is happy with real fast
    The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
    If real fast systems miss a schedule then someone loses money.

    The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states. Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

    Hank Cohen
  • Ted Dunning at Oct 2, 2012 at 11:13 pm

    On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen wrote:

    There is an important difference between real time and real fast

    Real time means that system response must meet a fixed schedule.
    Real fast just means sooner is better.
    Good thought, but real-time can also include a fixed schedule and a
    specified list of exceptional conditions which would prevent meeting the
    schedule.

    It may also include a fixed schedule that must be met some fraction of the
    time (usually very near 100% of the time).

    Without providing exceptions, you basically force the designer to lie about
    how reliable their system is.

    Real time systems always have hard schedules. The schedule could be in
    microseconds to control a laser for making masks for semiconductor
    manufacturing, milliseconds to control the ignition in your car or flight
    controls on an F-22 or seconds for even slower moving processes. In real
    time system missing the schedule can mean that very bad things happen:
    planes fall from the sky, your laser printer fries it's imaging drum,
    factories explode etc.
    It can mean that. But if you specify the exceptional situations you can
    specifically mitigate for them.

    Most transaction processing is happy with real fast
    The folks doing high velocity trading are pretty close to real time but
    they probably will be happy with real fast.
    If real fast systems miss a schedule then someone loses money.
    Yeah. And if you talk to these guys, they know the difference and ask for
    real-time.


    The reason that RTOS type operating systems are popular for real time
    applications is that they don't allow operations to spend indeterminate
    amounts of time in uninterruptable states. Java will never qualify as a
    real time system because it has garbage collection and garbage collection
    can lock up a system for an indefinite amount of time while it goes through
    marking and counting.
    You are behind the times on a few counts.

    - Java's collectors don't "count".

    - Java can be real-time:

    http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

    - Garbage collection can be deterministic and real-time:

    http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
    http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html
  • Gauthier, Alexander at Oct 3, 2012 at 12:31 am
    Owned.

    From: Ted Dunning
    Sent: Tuesday, October 02, 2012 4:13 PM
    To: user@hadoop.apache.org
    Subject: Re: HADOOP in Production


    On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen wrote:
    There is an important difference between real time and real fast

    Real time means that system response must meet a fixed schedule.
    Real fast just means sooner is better.

    Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

    It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

    Without providing exceptions, you basically force the designer to lie about how reliable their system is.

    Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or seconds for even slower moving processes. In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

    It can mean that. But if you specify the exceptional situations you can specifically mitigate for them.

    Most transaction processing is happy with real fast
    The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
    If real fast systems miss a schedule then someone loses money.

    Yeah. And if you talk to these guys, they know the difference and ask for real-time.


    The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states. Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

    You are behind the times on a few counts.

    - Java's collectors don't "count".

    - Java can be real-time:

    http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

    - Garbage collection can be deterministic and real-time:

    http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
    http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html
  • Hank Cohen at Oct 3, 2012 at 3:02 am
    Good points.

    I'm not trying to be exhaustive in the discussion of real time systems.
    My only intent was to point out the difference between real time and fast response.
    There are lots of real time requirements that do not require a particularly fast response but the response needs to be on time. Timing errors can introduce noise at a pretty astounding rate.
    Real fast often uses the same technology as real time but the consequences of missing a deadline are different.

    As for the high velocity traders, I'm an advocate of the Robin Hood Tax.
    A little friction might help the world.

    Hank Cohen

    From: Ted Dunning
    Sent: Tuesday, October 02, 2012 4:13 PM
    To: user@hadoop.apache.org
    Subject: Re: HADOOP in Production


    On Tue, Oct 2, 2012 at 7:05 PM, Hank Cohen wrote:
    There is an important difference between real time and real fast

    Real time means that system response must meet a fixed schedule.
    Real fast just means sooner is better.

    Good thought, but real-time can also include a fixed schedule and a specified list of exceptional conditions which would prevent meeting the schedule.

    It may also include a fixed schedule that must be met some fraction of the time (usually very near 100% of the time).

    Without providing exceptions, you basically force the designer to lie about how reliable their system is.

    Real time systems always have hard schedules. The schedule could be in microseconds to control a laser for making masks for semiconductor manufacturing, milliseconds to control the ignition in your car or flight controls on an F-22 or seconds for even slower moving processes. In real time system missing the schedule can mean that very bad things happen: planes fall from the sky, your laser printer fries it's imaging drum, factories explode etc.

    It can mean that. But if you specify the exceptional situations you can specifically mitigate for them.

    Most transaction processing is happy with real fast
    The folks doing high velocity trading are pretty close to real time but they probably will be happy with real fast.
    If real fast systems miss a schedule then someone loses money.

    Yeah. And if you talk to these guys, they know the difference and ask for real-time.


    The reason that RTOS type operating systems are popular for real time applications is that they don't allow operations to spend indeterminate amounts of time in uninterruptable states. Java will never qualify as a real time system because it has garbage collection and garbage collection can lock up a system for an indefinite amount of time while it goes through marking and counting.

    You are behind the times on a few counts.

    - Java's collectors don't "count".

    - Java can be real-time:

    http://rtjava.blogspot.co.uk/2009/07/real-time-java-vms.html

    - Garbage collection can be deterministic and real-time:

    http://www.ibm.com/developerworks/java/library/j-rtj4/index.html
    http://docs.oracle.com/cd/E13221_01/wlrt/docs30/intro_wlrt/tuning.html

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
grouphdfs-user @
categorieshadoop
postedOct 1, '12 at 1:37p
activeOct 3, '12 at 3:02a
posts7
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2021 Grokbase