FAQ
Hi,

I have a setup where logs are periodically bundled up and dumped into
hadoop dfs as large sequence file.

It works fine for all my map reduce jobs.

Now i need to handle adhoc queries for pulling out logs based on user
and time range.

I really dont need a full indexer (like lucene) for this purpose.

My first thought is to run a periodic mapreduce to generate a large
text file sorted by user id.

The text file will have (sequence file name, offset) to retrieve the logs ....


I am guessing many of you ran into similar requirements... Any
suggestions on doing this better?

ishwar

Search Discussions

  • Habermaas, William at Oct 1, 2009 at 6:06 pm
    I had a similar requirement and wrote my reducer output to hbase. I used
    hbase versions to segregate the data by timestamps and formed the hbase
    keys to satisfy my retrieval requirements.

    Bill

    -----Original Message-----
    From: ishwar ramani
    Sent: Thursday, October 01, 2009 1:48 PM
    To: common-user
    Subject: indexing log files for adhoc queries - suggestions?

    Hi,

    I have a setup where logs are periodically bundled up and dumped into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the
    logs ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
  • Mayuran Yogarajah at Oct 1, 2009 at 6:53 pm

    ishwar ramani wrote:
    Hi,

    I have a setup where logs are periodically bundled up and dumped into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the logs ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
    Have you looked into Hive? Its perfect for ad hoc queries..

    M
  • Amandeep Khurana at Oct 2, 2009 at 10:29 pm
    Hive is an sql-like abstraction over map reduce. It just enables you
    to execute sql-like queries over data without actually having to write
    the MR job. However it converts the query into a job at the back.

    Hbase might be what you are looking for. You can put your logs into
    hbase and query them as well as run MR jobs over them...
    On 10/1/09, Mayuran Yogarajah wrote:
    ishwar ramani wrote:
    Hi,

    I have a setup where logs are periodically bundled up and dumped into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the logs
    ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
    Have you looked into Hive? Its perfect for ad hoc queries..

    M

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Otis Gospodnetic at Oct 3, 2009 at 4:54 am
    Use Pig or Hive. Lots of overlap, some differences, but it looks like both projects' future plans mean even more overlap, though I didn't hear any mentions of convergence and merging.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana <amansk@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 6:28:51 PM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    Hive is an sql-like abstraction over map reduce. It just enables you
    to execute sql-like queries over data without actually having to write
    the MR job. However it converts the query into a job at the back.

    Hbase might be what you are looking for. You can put your logs into
    hbase and query them as well as run MR jobs over them...
    On 10/1/09, Mayuran Yogarajah wrote:
    ishwar ramani wrote:
    Hi,

    I have a setup where logs are periodically bundled up and dumped into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the logs
    ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
    Have you looked into Hive? Its perfect for ad hoc queries..

    M

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Amandeep Khurana at Oct 3, 2009 at 5:19 am
    There's another option - cascading.

    With pig and cascading you can use hbase as a backend. So that might
    be something you can explore too... The choice will depend on what
    kind of querying you want to do - real time or batch processed.
    On 10/2/09, Otis Gospodnetic wrote:
    Use Pig or Hive. Lots of overlap, some differences, but it looks like both
    projects' future plans mean even more overlap, though I didn't hear any
    mentions of convergence and merging.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana <amansk@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 6:28:51 PM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    Hive is an sql-like abstraction over map reduce. It just enables you
    to execute sql-like queries over data without actually having to write
    the MR job. However it converts the query into a job at the back.

    Hbase might be what you are looking for. You can put your logs into
    hbase and query them as well as run MR jobs over them...
    On 10/1/09, Mayuran Yogarajah wrote:
    ishwar ramani wrote:
    Hi,

    I have a setup where logs are periodically bundled up and dumped into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the
    logs
    ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
    Have you looked into Hive? Its perfect for ad hoc queries..

    M

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Otis Gospodnetic at Oct 3, 2009 at 5:23 am
    My understanding is that *no* tools built on top of MapReduce (Hive, Pig, Cascading, CloudBase...) can be real-time where real-time is something that processes the data and produces output in under 5 seconds or so.

    I believe Hive can read HBase now, too.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana <amansk@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Saturday, October 3, 2009 1:18:57 AM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    There's another option - cascading.

    With pig and cascading you can use hbase as a backend. So that might
    be something you can explore too... The choice will depend on what
    kind of querying you want to do - real time or batch processed.
    On 10/2/09, Otis Gospodnetic wrote:
    Use Pig or Hive. Lots of overlap, some differences, but it looks like both
    projects' future plans mean even more overlap, though I didn't hear any
    mentions of convergence and merging.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 6:28:51 PM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    Hive is an sql-like abstraction over map reduce. It just enables you
    to execute sql-like queries over data without actually having to write
    the MR job. However it converts the query into a job at the back.

    Hbase might be what you are looking for. You can put your logs into
    hbase and query them as well as run MR jobs over them...
    On 10/1/09, Mayuran Yogarajah wrote:
    ishwar ramani wrote:
    Hi,

    I have a setup where logs are periodically bundled up and dumped into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the
    logs
    ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
    Have you looked into Hive? Its perfect for ad hoc queries..

    M

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Amandeep Khurana at Oct 3, 2009 at 8:07 am
    Hbase is built on hdfs but just to read records from it, you don't
    need map reduce. So, its possible to access it real time. The .20
    release compares to mysql as far as random reads go...

    I haven't heard of hive talking to hbase yet. But that'll be a good
    feature to have for sure.
    On 10/2/09, Otis Gospodnetic wrote:
    My understanding is that *no* tools built on top of MapReduce (Hive, Pig,
    Cascading, CloudBase...) can be real-time where real-time is something that
    processes the data and produces output in under 5 seconds or so.

    I believe Hive can read HBase now, too.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana <amansk@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Saturday, October 3, 2009 1:18:57 AM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    There's another option - cascading.

    With pig and cascading you can use hbase as a backend. So that might
    be something you can explore too... The choice will depend on what
    kind of querying you want to do - real time or batch processed.
    On 10/2/09, Otis Gospodnetic wrote:
    Use Pig or Hive. Lots of overlap, some differences, but it looks like
    both
    projects' future plans mean even more overlap, though I didn't hear any
    mentions of convergence and merging.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 6:28:51 PM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    Hive is an sql-like abstraction over map reduce. It just enables you
    to execute sql-like queries over data without actually having to write
    the MR job. However it converts the query into a job at the back.

    Hbase might be what you are looking for. You can put your logs into
    hbase and query them as well as run MR jobs over them...
    On 10/1/09, Mayuran Yogarajah wrote:
    ishwar ramani wrote:
    Hi,

    I have a setup where logs are periodically bundled up and dumped
    into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on
    user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the
    logs
    ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
    Have you looked into Hive? Its perfect for ad hoc queries..

    M

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Omer Trajman at Oct 3, 2009 at 12:43 pm
    You might consider loading logs to a parallel database for the ad-hoc queries (full disclosure, I work for a database company).

    For repeated ad-hoc queries, a distributed database will give you the scalability of hdfs and also structure the data to handle fast predicates and relational aggregates.

    -Omer


    -----Original Message-----
    From: Amandeep Khurana <amansk@gmail.com>
    Sent: Saturday, October 03, 2009 04:07
    To: common-user@hadoop.apache.org <common-user@hadoop.apache.org>
    Subject: Re: indexing log files for adhoc queries - suggestions?

    Hbase is built on hdfs but just to read records from it, you don't
    need map reduce. So, its possible to access it real time. The .20
    release compares to mysql as far as random reads go...

    I haven't heard of hive talking to hbase yet. But that'll be a good
    feature to have for sure.
    On 10/2/09, Otis Gospodnetic wrote:
    My understanding is that *no* tools built on top of MapReduce (Hive, Pig,
    Cascading, CloudBase...) can be real-time where real-time is something that
    processes the data and produces output in under 5 seconds or so.

    I believe Hive can read HBase now, too.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana <amansk@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Saturday, October 3, 2009 1:18:57 AM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    There's another option - cascading.

    With pig and cascading you can use hbase as a backend. So that might
    be something you can explore too... The choice will depend on what
    kind of querying you want to do - real time or batch processed.
    On 10/2/09, Otis Gospodnetic wrote:
    Use Pig or Hive. Lots of overlap, some differences, but it looks like
    both
    projects' future plans mean even more overlap, though I didn't hear any
    mentions of convergence and merging.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 6:28:51 PM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    Hive is an sql-like abstraction over map reduce. It just enables you
    to execute sql-like queries over data without actually having to write
    the MR job. However it converts the query into a job at the back.

    Hbase might be what you are looking for. You can put your logs into
    hbase and query them as well as run MR jobs over them...
    On 10/1/09, Mayuran Yogarajah wrote:
    ishwar ramani wrote:
    Hi,

    I have a setup where logs are periodically bundled up and dumped
    into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on
    user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the
    logs
    ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
    Have you looked into Hive? Its perfect for ad hoc queries..

    M

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz
  • Ishwar ramani at Oct 5, 2009 at 9:32 pm
    thanks for all the suggestions ...
    From my understanding so far,
    as a primitive first cut i can use Hbase for indexing a client id
    <version as timestamp> -> log location

    If i have future requirements of more complex queries i can extend
    hive or pig over hbase ...

    ishwar
    On Sat, Oct 3, 2009 at 5:43 AM, Omer Trajman wrote:
    You might consider loading logs to a parallel database for the ad-hoc queries (full disclosure, I work for a database company).

    For repeated ad-hoc queries, a distributed database will give you the scalability of hdfs and also structure the data to handle fast predicates and relational aggregates.

    -Omer


    -----Original Message-----
    From: Amandeep Khurana <amansk@gmail.com>
    Sent: Saturday, October 03, 2009 04:07
    To: common-user@hadoop.apache.org <common-user@hadoop.apache.org>
    Subject: Re: indexing log files for adhoc queries - suggestions?

    Hbase is built on hdfs but just to read records from it, you don't
    need map reduce. So, its possible to access it real time. The .20
    release compares to mysql as far as random reads go...

    I haven't heard of hive talking to hbase yet. But that'll be a good
    feature to have for sure.
    On 10/2/09, Otis Gospodnetic wrote:
    My understanding is that *no* tools built on top of MapReduce (Hive, Pig,
    Cascading, CloudBase...) can be real-time where real-time is something that
    processes the data and produces output in under 5 seconds or so.

    I believe Hive can read HBase now, too.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana <amansk@gmail.com>
    To: common-user@hadoop.apache.org
    Sent: Saturday, October 3, 2009 1:18:57 AM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    There's another option - cascading.

    With pig and cascading you can use hbase as a backend. So that might
    be something you can explore too... The choice will depend on what
    kind of querying you want to do - real time or batch processed.
    On 10/2/09, Otis Gospodnetic wrote:
    Use Pig or Hive.  Lots of overlap, some differences, but it looks like
    both
    projects' future plans mean even more overlap, though I didn't hear any
    mentions of convergence and merging.

    Otis
    --
    Sematext is hiring -- http://sematext.com/about/jobs.html?mls
    Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


    ----- Original Message ----
    From: Amandeep Khurana
    To: common-user@hadoop.apache.org
    Sent: Friday, October 2, 2009 6:28:51 PM
    Subject: Re: indexing log files for adhoc queries - suggestions?

    Hive is an sql-like abstraction over map reduce. It just enables you
    to execute sql-like queries over data without actually having to write
    the MR job. However it converts the query into a job at the back.

    Hbase might be what you are looking for. You can put your logs into
    hbase and query them as well as run MR jobs over them...
    On 10/1/09, Mayuran Yogarajah wrote:
    ishwar ramani wrote:
    Hi,

    I have a setup where logs are periodically bundled up and dumped
    into
    hadoop dfs as large sequence file.

    It works fine for all my map reduce jobs.

    Now i need to handle adhoc queries for pulling out logs based on
    user
    and time range.

    I really dont need a full indexer (like lucene) for this purpose.

    My first thought is to run a periodic mapreduce to generate a large
    text file sorted by user id.

    The text file will have (sequence file name, offset) to retrieve the
    logs
    ....


    I am guessing many of you ran into similar requirements... Any
    suggestions on doing this better?

    ishwar
    Have you looked into Hive? Its perfect for ad hoc queries..

    M

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

    --


    Amandeep Khurana
    Computer Science Graduate Student
    University of California, Santa Cruz

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 1, '09 at 5:49p
activeOct 5, '09 at 9:32p
posts10
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase