FAQ
Hi,

I am currently evalutating whether Hadoop might be an alternative to our current system. We are providing a web analytics solution for very large websites and run every analysis on all collected data - we do not aggregate the data. This results in very large amounts of data that are processed for each query and currently we are using an in memory database by Exasol with really a lot of RAM, so that it does not take longer than a few seconds and for more complicated queries not longer than a minute to deliever the results.

The solution however is quite expensive and given the growth of data I'd like to explore alternatives. I have read about NoSQL Datastores and about Hadoop, but I am not sure whether it is actually a choice for our web analytics solution. We are collecting data via a trackingpixel which gives data to a trackingserver which writes it to disk once the session of a visitor is done. Our current solution has a large number of tables and the queries running the data can be quite complex:

How many user who came over that keyword and were from that city did actually buy the advertised product? Of these users, what other pages did they look at. Etc.

Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps Mahout?

Thanks for any thoughts,
Benjamin

_______________________________________
Benjamin Dageroth, Business Development Manager
Webtrekk GmbH
Boxhagener Str. 76-78, 10245 Berlin
fon 030 - 755 415 - 360
fax 030 - 755 415 - 100
benjamin.dageroth@webtrekk.com
http://www.webtrekk.com<http://www.webtrekk.de/>
Amtsgericht Berlin, HRB 93435 B
Geschäftsführer Christian Sauer


_______________________________________

Search Discussions

  • John Martyniak at Nov 3, 2009 at 2:10 pm
    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John


    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,

    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________
  • Benjamin Dageroth at Nov 3, 2009 at 3:16 pm
    Hi John,

    Thanks a lot for the fast answer. I was unsure because we would like to avoid aggregating the data so that our users can come up with all kinds of filters and conditions for your queries and always drill down to single users of their website. I am not sure how this works when SQL is not directly available? We are currently using complex sql queries for this, these would then have to be rewritten in form of Map/reduce tasks which provide the final result?

    Or how would one go about to actually replace an RDBMS system?

    Thanks a lot,
    Benjamin



    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________


    -----Ursprüngliche Nachricht-----
    Von: John Martyniak
    Gesendet: Dienstag, 3. November 2009 15:09
    An: common-user@hadoop.apache.org
    Betreff: Re: Web Analytics Use case?

    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John


    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,

    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________
  • Ricky Ho at Nov 3, 2009 at 7:02 pm
    I think you should break down your question into 3 ...

    1) How do I collect the raw data ?
    2) How do I process the raw data that I collect above to produce useful information ?
    3) How do I store the information for large scale consumption needs ?


    In (1), looks like you can collecting raw data using tracking pixel, but where do you store these raw data ? The best way to store them map depends on your answer in (2).

    In (2), what algorithms are you choosing to process the raw data ? Are they parallelizable ? Hadoop is certainly a candidate that you should consider. But be reminded that Hadoop is a batch-oriented processing framework which means your processing will be chopped into chunks. Double check your application is OK with this processing model.

    In (3), what is the size of the output information and how it will be consumed ? Most likely your analytic application has a high tolerance on data integrity issue and that's why you can use a more scalable Non-SQL DB which sacrifice some data integrity but provide a lot of scalability. Note that the decision in (3) is completely independent of (2).

    Rgds,
    Ricky
    http://horicky.blogspot.com

    -----Original Message-----
    From: Benjamin Dageroth
    Sent: Tuesday, November 03, 2009 7:16 AM
    To: common-user@hadoop.apache.org
    Subject: AW: Web Analytics Use case?

    Hi John,

    Thanks a lot for the fast answer. I was unsure because we would like to avoid aggregating the data so that our users can come up with all kinds of filters and conditions for your queries and always drill down to single users of their website. I am not sure how this works when SQL is not directly available? We are currently using complex sql queries for this, these would then have to be rewritten in form of Map/reduce tasks which provide the final result?

    Or how would one go about to actually replace an RDBMS system?

    Thanks a lot,
    Benjamin



    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________


    -----Ursprüngliche Nachricht-----
    Von: John Martyniak
    Gesendet: Dienstag, 3. November 2009 15:09
    An: common-user@hadoop.apache.org
    Betreff: Re: Web Analytics Use case?

    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John


    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,

    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________
  • John Martyniak at Nov 3, 2009 at 7:14 pm
    Benjamin,

    Well instead of SQL you have code that you can use to manipulate the
    data. If it was possible I would see if there was some way that you
    can pre-process as much of the data as possible to put into HBase, and
    then use any additional Map/Reduce jobs to provide any additional
    customizations.

    I don't think that you can "replace" the RDBMS without re-visualizing
    the data, meaning that you will need to re-model it so that it fits
    into HBase architecture, which means no relationships.

    By the way most of this can be done, it just requires some work, and a
    rethinking of the way that you do things, both for Map/Reduce and HBase.

    -John

    On Nov 3, 2009, at 10:16 AM, Benjamin Dageroth wrote:

    Hi John,

    Thanks a lot for the fast answer. I was unsure because we would like
    to avoid aggregating the data so that our users can come up with all
    kinds of filters and conditions for your queries and always drill
    down to single users of their website. I am not sure how this works
    when SQL is not directly available? We are currently using complex
    sql queries for this, these would then have to be rewritten in form
    of Map/reduce tasks which provide the final result?

    Or how would one go about to actually replace an RDBMS system?

    Thanks a lot,
    Benjamin



    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________


    -----Ursprüngliche Nachricht-----
    Von: John Martyniak
    Gesendet: Dienstag, 3. November 2009 15:09
    An: common-user@hadoop.apache.org
    Betreff: Re: Web Analytics Use case?

    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John


    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,

    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________
  • Utku Can Topçu at Nov 4, 2009 at 12:39 am
    Hey,

    Hadoop, HBase and Hive really scale for web analytics, I have been into web
    analytics using Hadoop for more than a year.

    In my case, I periodically rotate logs and put them on the HDFS. (I should
    think of writing directly to HDFS; but it's not a critical issue for me
    right now.)
    When the log files are on the HDFS somehow, a single map/reduce job runs on
    the newly introduced data line by line.

    The key point here is, we need to think web analytics as a series of
    abstraction on the raw data. Each abstraction analysis might symbolize a
    map/reduce job.

    The big question arises just right here, What does the initial analysis do
    for the log files.

    Abstraction #1:
    I assume each log line either represents a pageview or an event; we can
    generalize an event as a pageview too, and surely I will do so!
    An event comes with some valuable information such as,
    [Session identifier, unique visitor identifier, browser and locale related
    data, page related data, location related data, etc...]

    Abstraction #2:
    Our map/reduce job should map an Event to a Session Event in order to get a
    newer abstraction on the raw data. Session Events should be reduced into
    Sessions with respect to the Session Identifiers as keys.
    At the end of the first abstraction, we have our session data sorted out as
    (key,value) pairs, where the keys are the Session Identifiers, and
    presumably the Values should be the Sessions. Which means, now we can store
    Sessions in a Key/Value database that in this case conforms to HBase.

    One can think of additional abstractions from this point I think, I can come
    up with many ideas some of which are fairly mature and some are just dreams
    and/or premature thoughts.

    Regards,
    Utku

    On Tue, Nov 3, 2009 at 9:14 PM, John Martyniak wrote:

    Benjamin,

    Well instead of SQL you have code that you can use to manipulate the data.
    If it was possible I would see if there was some way that you can
    pre-process as much of the data as possible to put into HBase, and then use
    any additional Map/Reduce jobs to provide any additional customizations.

    I don't think that you can "replace" the RDBMS without re-visualizing the
    data, meaning that you will need to re-model it so that it fits into HBase
    architecture, which means no relationships.

    By the way most of this can be done, it just requires some work, and a
    rethinking of the way that you do things, both for Map/Reduce and HBase.

    -John



    On Nov 3, 2009, at 10:16 AM, Benjamin Dageroth wrote:

    Hi John,
    Thanks a lot for the fast answer. I was unsure because we would like to
    avoid aggregating the data so that our users can come up with all kinds of
    filters and conditions for your queries and always drill down to single
    users of their website. I am not sure how this works when SQL is not
    directly available? We are currently using complex sql queries for this,
    these would then have to be rewritten in form of Map/reduce tasks which
    provide the final result?

    Or how would one go about to actually replace an RDBMS system?

    Thanks a lot,
    Benjamin



    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________


    -----Ursprüngliche Nachricht-----
    Von: John Martyniak
    Gesendet: Dienstag, 3. November 2009 15:09
    An: common-user@hadoop.apache.org
    Betreff: Re: Web Analytics Use case?

    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John



    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,
    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________

  • Ricky Ho at Nov 4, 2009 at 2:42 pm
    Good point. Hadoop can be used as a distributed DB loader.

    Just curious. How would you compare this with directly write to HBase (by passing the log, Hadoop step) ?

    Rgds,
    Ricky

    -----Original Message-----
    From: Utku Can Topçu
    Sent: Tuesday, November 03, 2009 4:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: AW: Web Analytics Use case?

    Hey,

    Hadoop, HBase and Hive really scale for web analytics, I have been into web
    analytics using Hadoop for more than a year.

    In my case, I periodically rotate logs and put them on the HDFS. (I should
    think of writing directly to HDFS; but it's not a critical issue for me
    right now.)
    When the log files are on the HDFS somehow, a single map/reduce job runs on
    the newly introduced data line by line.

    The key point here is, we need to think web analytics as a series of
    abstraction on the raw data. Each abstraction analysis might symbolize a
    map/reduce job.

    The big question arises just right here, What does the initial analysis do
    for the log files.

    Abstraction #1:
    I assume each log line either represents a pageview or an event; we can
    generalize an event as a pageview too, and surely I will do so!
    An event comes with some valuable information such as,
    [Session identifier, unique visitor identifier, browser and locale related
    data, page related data, location related data, etc...]

    Abstraction #2:
    Our map/reduce job should map an Event to a Session Event in order to get a
    newer abstraction on the raw data. Session Events should be reduced into
    Sessions with respect to the Session Identifiers as keys.
    At the end of the first abstraction, we have our session data sorted out as
    (key,value) pairs, where the keys are the Session Identifiers, and
    presumably the Values should be the Sessions. Which means, now we can store
    Sessions in a Key/Value database that in this case conforms to HBase.

    One can think of additional abstractions from this point I think, I can come
    up with many ideas some of which are fairly mature and some are just dreams
    and/or premature thoughts.

    Regards,
    Utku

    On Tue, Nov 3, 2009 at 9:14 PM, John Martyniak wrote:

    Benjamin,

    Well instead of SQL you have code that you can use to manipulate the data.
    If it was possible I would see if there was some way that you can
    pre-process as much of the data as possible to put into HBase, and then use
    any additional Map/Reduce jobs to provide any additional customizations.

    I don't think that you can "replace" the RDBMS without re-visualizing the
    data, meaning that you will need to re-model it so that it fits into HBase
    architecture, which means no relationships.

    By the way most of this can be done, it just requires some work, and a
    rethinking of the way that you do things, both for Map/Reduce and HBase.

    -John



    On Nov 3, 2009, at 10:16 AM, Benjamin Dageroth wrote:

    Hi John,
    Thanks a lot for the fast answer. I was unsure because we would like to
    avoid aggregating the data so that our users can come up with all kinds of
    filters and conditions for your queries and always drill down to single
    users of their website. I am not sure how this works when SQL is not
    directly available? We are currently using complex sql queries for this,
    these would then have to be rewritten in form of Map/reduce tasks which
    provide the final result?

    Or how would one go about to actually replace an RDBMS system?

    Thanks a lot,
    Benjamin



    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________


    -----Ursprüngliche Nachricht-----
    Von: John Martyniak
    Gesendet: Dienstag, 3. November 2009 15:09
    An: common-user@hadoop.apache.org
    Betreff: Re: Web Analytics Use case?

    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John



    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,
    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________

  • Utku Can Topçu at Nov 4, 2009 at 3:49 pm
    The reason why I'm choosing DB loading is the fact that, each Session (WAA
    calls this Visit) is composed of multiple Events, where an Event is a line
    of Log File. For every session, we can be sure that, each session is
    basically composed of one or more lines of log, that results in duplication
    of many session constants (i.e IP Addr, User-Agent, Unique Visitor cookie,
    etc..). In addition to this, the Reducer goes over the sessions once again
    before loading them to the DB so that we can compute some session specific
    calculations on the fly.

    I hope I was precise and clear enough in expressing the design choice of
    mine.

    Regards,
    Utku
    On Wed, Nov 4, 2009 at 4:41 PM, Ricky Ho wrote:

    Good point. Hadoop can be used as a distributed DB loader.

    Just curious. How would you compare this with directly write to HBase (by
    passing the log, Hadoop step) ?

    Rgds,
    Ricky

    -----Original Message-----
    From: Utku Can Topçu
    Sent: Tuesday, November 03, 2009 4:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: AW: Web Analytics Use case?

    Hey,

    Hadoop, HBase and Hive really scale for web analytics, I have been into web
    analytics using Hadoop for more than a year.

    In my case, I periodically rotate logs and put them on the HDFS. (I should
    think of writing directly to HDFS; but it's not a critical issue for me
    right now.)
    When the log files are on the HDFS somehow, a single map/reduce job runs on
    the newly introduced data line by line.

    The key point here is, we need to think web analytics as a series of
    abstraction on the raw data. Each abstraction analysis might symbolize a
    map/reduce job.

    The big question arises just right here, What does the initial analysis do
    for the log files.

    Abstraction #1:
    I assume each log line either represents a pageview or an event; we can
    generalize an event as a pageview too, and surely I will do so!
    An event comes with some valuable information such as,
    [Session identifier, unique visitor identifier, browser and locale related
    data, page related data, location related data, etc...]

    Abstraction #2:
    Our map/reduce job should map an Event to a Session Event in order to get a
    newer abstraction on the raw data. Session Events should be reduced into
    Sessions with respect to the Session Identifiers as keys.
    At the end of the first abstraction, we have our session data sorted out as
    (key,value) pairs, where the keys are the Session Identifiers, and
    presumably the Values should be the Sessions. Which means, now we can store
    Sessions in a Key/Value database that in this case conforms to HBase.

    One can think of additional abstractions from this point I think, I can
    come
    up with many ideas some of which are fairly mature and some are just dreams
    and/or premature thoughts.

    Regards,
    Utku


    On Tue, Nov 3, 2009 at 9:14 PM, John Martyniak <
    john@beforedawnsolutions.com
    wrote:
    Benjamin,

    Well instead of SQL you have code that you can use to manipulate the data.
    If it was possible I would see if there was some way that you can
    pre-process as much of the data as possible to put into HBase, and then use
    any additional Map/Reduce jobs to provide any additional customizations.

    I don't think that you can "replace" the RDBMS without re-visualizing the
    data, meaning that you will need to re-model it so that it fits into HBase
    architecture, which means no relationships.

    By the way most of this can be done, it just requires some work, and a
    rethinking of the way that you do things, both for Map/Reduce and HBase.

    -John



    On Nov 3, 2009, at 10:16 AM, Benjamin Dageroth wrote:

    Hi John,
    Thanks a lot for the fast answer. I was unsure because we would like to
    avoid aggregating the data so that our users can come up with all kinds
    of
    filters and conditions for your queries and always drill down to single
    users of their website. I am not sure how this works when SQL is not
    directly available? We are currently using complex sql queries for this,
    these would then have to be rewritten in form of Map/reduce tasks which
    provide the final result?

    Or how would one go about to actually replace an RDBMS system?

    Thanks a lot,
    Benjamin



    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________


    -----Ursprüngliche Nachricht-----
    Von: John Martyniak
    Gesendet: Dienstag, 3. November 2009 15:09
    An: common-user@hadoop.apache.org
    Betreff: Re: Web Analytics Use case?

    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John



    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,
    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________

  • Ricky Ho at Nov 4, 2009 at 9:53 pm
    Why can't you do the session specific calculation and aggregation at the spot where session data is gathered ?


    One of the main usage of Map/Reduce is when the aggregation is done across a very scattered data set. But it looks like the kind of processing you describe is very localized. I mean the same session pretty much hitting the same server, so you can do the aggregation at the same spot.

    Rgds,
    Ricky

    -----Original Message-----
    From: Utku Can Topçu
    Sent: Wednesday, November 04, 2009 7:48 AM
    To: common-user@hadoop.apache.org
    Subject: Re: AW: Web Analytics Use case?

    The reason why I'm choosing DB loading is the fact that, each Session (WAA
    calls this Visit) is composed of multiple Events, where an Event is a line
    of Log File. For every session, we can be sure that, each session is
    basically composed of one or more lines of log, that results in duplication
    of many session constants (i.e IP Addr, User-Agent, Unique Visitor cookie,
    etc..). In addition to this, the Reducer goes over the sessions once again
    before loading them to the DB so that we can compute some session specific
    calculations on the fly.

    I hope I was precise and clear enough in expressing the design choice of
    mine.

    Regards,
    Utku
    On Wed, Nov 4, 2009 at 4:41 PM, Ricky Ho wrote:

    Good point. Hadoop can be used as a distributed DB loader.

    Just curious. How would you compare this with directly write to HBase (by
    passing the log, Hadoop step) ?

    Rgds,
    Ricky

    -----Original Message-----
    From: Utku Can Topçu
    Sent: Tuesday, November 03, 2009 4:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: AW: Web Analytics Use case?

    Hey,

    Hadoop, HBase and Hive really scale for web analytics, I have been into web
    analytics using Hadoop for more than a year.

    In my case, I periodically rotate logs and put them on the HDFS. (I should
    think of writing directly to HDFS; but it's not a critical issue for me
    right now.)
    When the log files are on the HDFS somehow, a single map/reduce job runs on
    the newly introduced data line by line.

    The key point here is, we need to think web analytics as a series of
    abstraction on the raw data. Each abstraction analysis might symbolize a
    map/reduce job.

    The big question arises just right here, What does the initial analysis do
    for the log files.

    Abstraction #1:
    I assume each log line either represents a pageview or an event; we can
    generalize an event as a pageview too, and surely I will do so!
    An event comes with some valuable information such as,
    [Session identifier, unique visitor identifier, browser and locale related
    data, page related data, location related data, etc...]

    Abstraction #2:
    Our map/reduce job should map an Event to a Session Event in order to get a
    newer abstraction on the raw data. Session Events should be reduced into
    Sessions with respect to the Session Identifiers as keys.
    At the end of the first abstraction, we have our session data sorted out as
    (key,value) pairs, where the keys are the Session Identifiers, and
    presumably the Values should be the Sessions. Which means, now we can store
    Sessions in a Key/Value database that in this case conforms to HBase.

    One can think of additional abstractions from this point I think, I can
    come
    up with many ideas some of which are fairly mature and some are just dreams
    and/or premature thoughts.

    Regards,
    Utku


    On Tue, Nov 3, 2009 at 9:14 PM, John Martyniak <
    john@beforedawnsolutions.com
    wrote:
    Benjamin,

    Well instead of SQL you have code that you can use to manipulate the data.
    If it was possible I would see if there was some way that you can
    pre-process as much of the data as possible to put into HBase, and then use
    any additional Map/Reduce jobs to provide any additional customizations.

    I don't think that you can "replace" the RDBMS without re-visualizing the
    data, meaning that you will need to re-model it so that it fits into HBase
    architecture, which means no relationships.

    By the way most of this can be done, it just requires some work, and a
    rethinking of the way that you do things, both for Map/Reduce and HBase.

    -John



    On Nov 3, 2009, at 10:16 AM, Benjamin Dageroth wrote:

    Hi John,
    Thanks a lot for the fast answer. I was unsure because we would like to
    avoid aggregating the data so that our users can come up with all kinds
    of
    filters and conditions for your queries and always drill down to single
    users of their website. I am not sure how this works when SQL is not
    directly available? We are currently using complex sql queries for this,
    these would then have to be rewritten in form of Map/reduce tasks which
    provide the final result?

    Or how would one go about to actually replace an RDBMS system?

    Thanks a lot,
    Benjamin



    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________


    -----Ursprüngliche Nachricht-----
    Von: John Martyniak
    Gesendet: Dienstag, 3. November 2009 15:09
    An: common-user@hadoop.apache.org
    Betreff: Re: Web Analytics Use case?

    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John



    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,
    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________

  • Utku Can Topçu at Nov 5, 2009 at 5:33 am
    Ricky,

    You're absolutely right, I've already started developing a new data
    collection system that populates sessions on the fly. Till that development
    is finished, I felt like I needed to develop a layered approach of
    abstractions. Session aggregation might override the initial Hadoop run at
    the end of the day.

    The case with these log files are that, they are just collections of past 5
    years. When I started this, (5 years ago) I had no idea of what I would be
    facing :)


    On Wed, Nov 4, 2009 at 11:24 PM, Ricky Ho wrote:

    Why can't you do the session specific calculation and aggregation at the
    spot where session data is gathered ?


    One of the main usage of Map/Reduce is when the aggregation is done across
    a very scattered data set. But it looks like the kind of processing you
    describe is very localized. I mean the same session pretty much hitting the
    same server, so you can do the aggregation at the same spot.

    Rgds,
    Ricky

    -----Original Message-----
    From: Utku Can Topçu
    Sent: Wednesday, November 04, 2009 7:48 AM
    To: common-user@hadoop.apache.org
    Subject: Re: AW: Web Analytics Use case?

    The reason why I'm choosing DB loading is the fact that, each Session (WAA
    calls this Visit) is composed of multiple Events, where an Event is a line
    of Log File. For every session, we can be sure that, each session is
    basically composed of one or more lines of log, that results in duplication
    of many session constants (i.e IP Addr, User-Agent, Unique Visitor cookie,
    etc..). In addition to this, the Reducer goes over the sessions once again
    before loading them to the DB so that we can compute some session specific
    calculations on the fly.

    I hope I was precise and clear enough in expressing the design choice of
    mine.

    Regards,
    Utku
    On Wed, Nov 4, 2009 at 4:41 PM, Ricky Ho wrote:

    Good point. Hadoop can be used as a distributed DB loader.

    Just curious. How would you compare this with directly write to HBase (by
    passing the log, Hadoop step) ?

    Rgds,
    Ricky

    -----Original Message-----
    From: Utku Can Topçu
    Sent: Tuesday, November 03, 2009 4:39 PM
    To: common-user@hadoop.apache.org
    Subject: Re: AW: Web Analytics Use case?

    Hey,

    Hadoop, HBase and Hive really scale for web analytics, I have been into web
    analytics using Hadoop for more than a year.

    In my case, I periodically rotate logs and put them on the HDFS. (I should
    think of writing directly to HDFS; but it's not a critical issue for me
    right now.)
    When the log files are on the HDFS somehow, a single map/reduce job runs on
    the newly introduced data line by line.

    The key point here is, we need to think web analytics as a series of
    abstraction on the raw data. Each abstraction analysis might symbolize a
    map/reduce job.

    The big question arises just right here, What does the initial analysis do
    for the log files.

    Abstraction #1:
    I assume each log line either represents a pageview or an event; we can
    generalize an event as a pageview too, and surely I will do so!
    An event comes with some valuable information such as,
    [Session identifier, unique visitor identifier, browser and locale related
    data, page related data, location related data, etc...]

    Abstraction #2:
    Our map/reduce job should map an Event to a Session Event in order to get a
    newer abstraction on the raw data. Session Events should be reduced into
    Sessions with respect to the Session Identifiers as keys.
    At the end of the first abstraction, we have our session data sorted out as
    (key,value) pairs, where the keys are the Session Identifiers, and
    presumably the Values should be the Sessions. Which means, now we can store
    Sessions in a Key/Value database that in this case conforms to HBase.

    One can think of additional abstractions from this point I think, I can
    come
    up with many ideas some of which are fairly mature and some are just dreams
    and/or premature thoughts.

    Regards,
    Utku


    On Tue, Nov 3, 2009 at 9:14 PM, John Martyniak <
    john@beforedawnsolutions.com
    wrote:
    Benjamin,

    Well instead of SQL you have code that you can use to manipulate the data.
    If it was possible I would see if there was some way that you can
    pre-process as much of the data as possible to put into HBase, and then use
    any additional Map/Reduce jobs to provide any additional
    customizations.
    I don't think that you can "replace" the RDBMS without re-visualizing
    the
    data, meaning that you will need to re-model it so that it fits into HBase
    architecture, which means no relationships.

    By the way most of this can be done, it just requires some work, and a
    rethinking of the way that you do things, both for Map/Reduce and
    HBase.
    -John



    On Nov 3, 2009, at 10:16 AM, Benjamin Dageroth wrote:

    Hi John,
    Thanks a lot for the fast answer. I was unsure because we would like
    to
    avoid aggregating the data so that our users can come up with all
    kinds
    of
    filters and conditions for your queries and always drill down to
    single
    users of their website. I am not sure how this works when SQL is not
    directly available? We are currently using complex sql queries for
    this,
    these would then have to be rewritten in form of Map/reduce tasks
    which
    provide the final result?

    Or how would one go about to actually replace an RDBMS system?

    Thanks a lot,
    Benjamin



    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________


    -----Ursprüngliche Nachricht-----
    Von: John Martyniak
    Gesendet: Dienstag, 3. November 2009 15:09
    An: common-user@hadoop.apache.org
    Betreff: Re: Web Analytics Use case?

    Benjamin,

    That is kind of the exact case for Hadoop.

    Hadoop is a system that is built for handling very large datasets, and
    delivering processed results. HBase is built for AdHoc data, so
    instead of having complicated table joins etc, you have very large
    rows (multiple columns) with aggregate data, then use HBase to return
    results from that.

    We currently use hadoop/hbase to collect and process lots of data,
    then take the results from the processing to populate a SOLR Index,
    and a MySQL database which is then used to feed the front ends. It
    seems to work pretty good in that it greatly reduces the number of
    rows and the size of the queries in the DB/index.

    We are exploring using HBase to feed the front-ends in place of the
    MySQL DBs, so far the jury is out on the performance but it does look
    promising.

    -John



    On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:

    Hi,
    I am currently evalutating whether Hadoop might be an alternative to
    our current system. We are providing a web analytics solution for
    very large websites and run every analysis on all collected data -
    we do not aggregate the data. This results in very large amounts of
    data that are processed for each query and currently we are using an
    in memory database by Exasol with really a lot of RAM, so that it
    does not take longer than a few seconds and for more complicated
    queries not longer than a minute to deliever the results.

    The solution however is quite expensive and given the growth of data
    I'd like to explore alternatives. I have read about NoSQL Datastores
    and about Hadoop, but I am not sure whether it is actually a choice
    for our web analytics solution. We are collecting data via a
    trackingpixel which gives data to a trackingserver which writes it
    to disk once the session of a visitor is done. Our current solution
    has a large number of tables and the queries running the data can be
    quite complex:

    How many user who came over that keyword and were from that city did
    actually buy the advertised product? Of these users, what other
    pages did they look at. Etc.

    Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
    Mahout?

    Thanks for any thoughts,
    Benjamin

    _______________________________________
    Benjamin Dageroth, Business Development Manager
    Webtrekk GmbH
    Boxhagener Str. 76-78, 10245 Berlin
    fon 030 - 755 415 - 360
    fax 030 - 755 415 - 100
    benjamin.dageroth@webtrekk.com
    http://www.webtrekk.com<http://www.webtrekk.de/>
    Amtsgericht Berlin, HRB 93435 B
    Geschäftsführer Christian Sauer


    _______________________________________

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 3, '09 at 1:28p
activeNov 5, '09 at 5:33a
posts10
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase