Grokbase Groups Pig user October 2008
FAQ
My latest stuff looks at apache logs, aggregates to txt files, then I have a simple perl script that +='s into mysql tables. A few thoughts

* Would sure be nice if I could just STORE my aggregations into any jdbc-friendly database, like mysql, instead of text files. Anyone work on such a thing? I could do the simple case(s), but would need some help with more complicated ones.

* How about a MOVE function? Would be nice to move files once done processing them.

* I have yet to get into hadoop, but it would be nice to have an incoming directory, then a processed directory. Really, I would like to have a daemon that watches a directory that churns through logs exactly once. That's kind of how hadoop works, right?
* How about a LOAD function that can read from S3, or maybe the MOVE could move from S3 to local storage, or vice versa?Thoughts?

Thanks,
Earl


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Search Discussions

  • Ian Holsman at Oct 20, 2008 at 12:37 am

    Earl Cahill wrote:
    My latest stuff looks at apache logs, aggregates to txt files, then I have a simple perl script that +='s into mysql tables. A few thoughts

    * Would sure be nice if I could just STORE my aggregations into any jdbc-friendly database, like mysql, instead of text files. Anyone work on such a thing? I could do the simple case(s), but would need some help with more complicated ones.
    I think a generic 'MySQLStore()' might be a bit troublesome, especially
    if you need to deal with Bags & Maps. but you could encode this kind of
    thing into a properties file that could be passed as an argument to it,
    similar to a delimiter is passed into PigStorage() that maps up the
    fields in the Schema with columns/tables.

    with that you would just create a connection in 'putnext' and push it
    through via INSERT ... on DUPLICATE UPDATE ..
    (http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html ).

    mysql connections are established quite fast, so I'd argue you wouldn't
    even need pooling.

    saying that, it might be faster to create a load file, and use LOAD DATA
    (http://dev.mysql.com/doc/refman/5.0/en/load-data.html ).


    regards
    Ian

    (and no I haven't written one of these yet, but I think I will need one
    shortly).



    * How about a MOVE function? Would be nice to move files once done processing them.

    * I have yet to get into hadoop, but it would be nice to have an incoming directory, then a processed directory. Really, I would like to have a daemon that watches a directory that churns through logs exactly once. That's kind of how hadoop works, right?
    * How about a LOAD function that can read from S3, or maybe the MOVE could move from S3 to local storage, or vice versa?Thoughts?

    Thanks,
    Earl


    __________________________________________________
    Do You Yahoo!?
    Tired of spam? Yahoo! Mail has the best spam protection around
    http://mail.yahoo.com
  • Earl Cahill at Oct 20, 2008 at 4:34 pm

    I think a generic 'MySQLStore()' might be a bit troublesome,
    I was envisioning DbStore(), which would hopefully work with any? jdbc compliant driver, and perhaps at first, croaking when the driver isn't mysql. As you mention, mysql makes the += stuff a breeze, but that is a fair amount harder and costlier for other databases.
    especially if you need to deal with Bags & Maps.
    When I mentioned complicated cases, this is what I meant. The simple case for me would be ArrayList-type stuff.
    saying that, it might be faster to create a load file, and use LOAD DATA
    Not sure how you do += with LOAD DATA. If you know of a way that would be ideal. In my testing from a couple years ago, LOAD DATA was amazingly fast.

    I think I would vote for eventually doing multi-row inserts like this


    INSERT INTO table (a,b,c) VALUES (1,2,3),(4,5,6)
    ON DUPLICATE KEY UPDATE c=VALUES(a)+VALUES(b);
    If it is ok to depend on java.util.concurrent, we could use a static ConcurrentLinkedQueue that gets dequeued when a threshold is met (like every 50 or 500 rows), when an exception is hit or (if possible) when we're all done. For the every 50 case, the system could generate a potentially large query, and have a prepared statement waiting. We would just have to walk the queue, setting values accordingly and execute.

    I think the user would have to tell us how to connect via conf file or constructor, and maybe some sql with question marks For a first shot, I would say single row inserts, like


    INSERT INTO table (a, b, c) VALUES (?, ?, ?)
    ON DUPLICATE KEY UPDATE c = c + ?
    Then we would just need a mechanism for know what from the ArrayList of values goes where and what the types are. Perhaps that gets easier on the types branch.

    This DbStore() functionality would simplify my process a fair amount and I am thinking others would feel the same.


    Thanks,
    Earl

    __________________________________________________
    Do You Yahoo!?
    Tired of spam? Yahoo! Mail has the best spam protection around
    http://mail.yahoo.com
  • Ted Dunning at Oct 21, 2008 at 5:43 am
    Multiple single record insert statements with occasional commit is plenty
    fast enough for most purposes. That is definitely easier than the large
    prepared statement approach.

    And this would definitely be useful for all kinds of ETL processes that pump
    aggregates out to a conventional db for reporting. Copying back to
    conventional file system and doing a load is a pain in the arse.
    On Mon, Oct 20, 2008 at 9:33 AM, Earl Cahill wrote:

    ... I think the user would have to tell us how to connect via conf file or
    constructor, and maybe some sql with question marks For a first shot, I
    would say single row inserts, like


    INSERT INTO table (a, b, c) VALUES (?, ?, ?)
    ON DUPLICATE KEY UPDATE c = c + ?
    Then we would just need a mechanism for know what from the ArrayList of
    values goes where and what the types are. Perhaps that gets easier on the
    types branch.

    This DbStore() functionality would simplify my process a fair amount and I
    am thinking others would feel the same.
  • Earl Cahill at Oct 21, 2008 at 6:12 am
    Multiple single record insert statements with occasional commit is plenty fast enough for most purposes.
    Guess I am just thinking of doing several million inserts. Certainly we could start with single record inserts and go from there.

    And this would definitely be useful for all kinds of ETL processes that pump aggregates out to a conventional db for reporting. Copying back to conventional file system and doing a load is a pain in the arse.
    Indeed it is. I am likely a bit out on working on this, and I have yet to get the types branch to build properly (and haven't tried real hard), but once I do, I will plan on diving into this.

    Thanks,
    Earl

    __________________________________________________
    Do You Yahoo!?
    Tired of spam? Yahoo! Mail has the best spam protection around
    http://mail.yahoo.com
  • Ted Dunning at Oct 21, 2008 at 7:24 am

    On Mon, Oct 20, 2008 at 11:11 PM, Earl Cahill wrote:

    Multiple single record insert statements with occasional commit is plenty
    fast enough for most purposes.

    Guess I am just thinking of doing several million inserts. Certainly we
    could start with single record inserts and go from there.
    If you still have several million rows to deal with, you should still be
    working in Pig. Not dumping the mess onto a relational database.

    :-)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 18, '08 at 5:18a
activeOct 21, '08 at 7:24a
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase