I think you should break down your question into 3 ...
1) How do I collect the raw data ?
2) How do I process the raw data that I collect above to produce useful information ?
3) How do I store the information for large scale consumption needs ?
In (1), looks like you can collecting raw data using tracking pixel, but where do you store these raw data ? The best way to store them map depends on your answer in (2).
In (2), what algorithms are you choosing to process the raw data ? Are they parallelizable ? Hadoop is certainly a candidate that you should consider. But be reminded that Hadoop is a batch-oriented processing framework which means your processing will be chopped into chunks. Double check your application is OK with this processing model.
In (3), what is the size of the output information and how it will be consumed ? Most likely your analytic application has a high tolerance on data integrity issue and that's why you can use a more scalable Non-SQL DB which sacrifice some data integrity but provide a lot of scalability. Note that the decision in (3) is completely independent of (2).
Rgds,
Ricky
http://horicky.blogspot.com-----Original Message-----
From: Benjamin Dageroth
Sent: Tuesday, November 03, 2009 7:16 AM
To:
[email protected]Subject: AW: Web Analytics Use case?
Hi John,
Thanks a lot for the fast answer. I was unsure because we would like to avoid aggregating the data so that our users can come up with all kinds of filters and conditions for your queries and always drill down to single users of their website. I am not sure how this works when SQL is not directly available? We are currently using complex sql queries for this, these would then have to be rewritten in form of Map/reduce tasks which provide the final result?
Or how would one go about to actually replace an RDBMS system?
Thanks a lot,
Benjamin
_______________________________________
Benjamin Dageroth, Business Development Manager
Webtrekk GmbH
Boxhagener Str. 76-78, 10245 Berlin
fon 030 - 755 415 - 360
fax 030 - 755 415 - 100
[email protected]http://www.webtrekk.comAmtsgericht Berlin, HRB 93435 B
Geschäftsführer Christian Sauer
_______________________________________
-----Ursprüngliche Nachricht-----
Von: John Martyniak
Gesendet: Dienstag, 3. November 2009 15:09
An:
[email protected]Betreff: Re: Web Analytics Use case?
Benjamin,
That is kind of the exact case for Hadoop.
Hadoop is a system that is built for handling very large datasets, and
delivering processed results. HBase is built for AdHoc data, so
instead of having complicated table joins etc, you have very large
rows (multiple columns) with aggregate data, then use HBase to return
results from that.
We currently use hadoop/hbase to collect and process lots of data,
then take the results from the processing to populate a SOLR Index,
and a MySQL database which is then used to feed the front ends. It
seems to work pretty good in that it greatly reduces the number of
rows and the size of the queries in the DB/index.
We are exploring using HBase to feed the front-ends in place of the
MySQL DBs, so far the jury is out on the performance but it does look
promising.
-John
On Nov 3, 2009, at 8:28 AM, Benjamin Dageroth wrote:Hi,
I am currently evalutating whether Hadoop might be an alternative to
our current system. We are providing a web analytics solution for
very large websites and run every analysis on all collected data -
we do not aggregate the data. This results in very large amounts of
data that are processed for each query and currently we are using an
in memory database by Exasol with really a lot of RAM, so that it
does not take longer than a few seconds and for more complicated
queries not longer than a minute to deliever the results.
The solution however is quite expensive and given the growth of data
I'd like to explore alternatives. I have read about NoSQL Datastores
and about Hadoop, but I am not sure whether it is actually a choice
for our web analytics solution. We are collecting data via a
trackingpixel which gives data to a trackingserver which writes it
to disk once the session of a visitor is done. Our current solution
has a large number of tables and the queries running the data can be
quite complex:
How many user who came over that keyword and were from that city did
actually buy the advertised product? Of these users, what other
pages did they look at. Etc.
Would this be a good case for Hbase, Hadoop, Map/Reduce and perhaps
Mahout?
Thanks for any thoughts,
Benjamin
_______________________________________
Benjamin Dageroth, Business Development Manager
Webtrekk GmbH
Boxhagener Str. 76-78, 10245 Berlin
fon 030 - 755 415 - 360
fax 030 - 755 415 - 100
[email protected]http://www.webtrekk.com<http://www.webtrekk.de/>
Amtsgericht Berlin, HRB 93435 B
Geschäftsführer Christian Sauer
_______________________________________