FAQ
Sorry if this is not the correct list to post this on, it was the
closest I could find.

We are using a taildir('/var/log/foo/') source on all of our agents. If
this agent goes down and data can not be sent to the collector for some
time, what happens when this agent becomes available again? Will the
agent tail the whole directory starting from the beginning of all files
thus adding duplicate data to our sink?

I've read that I could set the startFromEnd parameter to true. In that
case however if an agent goes down then we would lose any data that gets
written to our file until the agent comes back up. How do people handle
this? It seems like you either have to deal with the fact that you will
have duplicate or missing data.

Thanks||

Search Discussions

  • James Seigel at Mar 17, 2011 at 1:56 am
    I believe sir there should be a flume support group on cloudera. I'm
    guessing most of us here haven't used it and therefore aren't much
    help.

    This is vanilla hadoop land. :)

    Cheers and good luck!
    James

    On a side note, how much data are you pumping through it?


    Sent from my mobile. Please excuse the typos.
    On 2011-03-16, at 7:53 PM, Mark wrote:

    Sorry if this is not the correct list to post this on, it was the closest I could find.

    We are using a taildir('/var/log/foo/') source on all of our agents. If this agent goes down and data can not be sent to the collector for some time, what happens when this agent becomes available again? Will the agent tail the whole directory starting from the beginning of all files thus adding duplicate data to our sink?

    I've read that I could set the startFromEnd parameter to true. In that case however if an agent goes down then we would lose any data that gets written to our file until the agent comes back up. How do people handle this? It seems like you either have to deal with the fact that you will have duplicate or missing data.

    Thanks||
  • Mark at Mar 17, 2011 at 3:24 am
    Sorry about that

    FYI, About 1GB/day across 4 collectors at the moment
    On 3/16/11 6:55 PM, James Seigel wrote:
    I believe sir there should be a flume support group on cloudera. I'm
    guessing most of us here haven't used it and therefore aren't much
    help.

    This is vanilla hadoop land. :)

    Cheers and good luck!
    James

    On a side note, how much data are you pumping through it?


    Sent from my mobile. Please excuse the typos.

    On 2011-03-16, at 7:53 PM, Markwrote:
    Sorry if this is not the correct list to post this on, it was the closest I could find.

    We are using a taildir('/var/log/foo/') source on all of our agents. If this agent goes down and data can not be sent to the collector for some time, what happens when this agent becomes available again? Will the agent tail the whole directory starting from the beginning of all files thus adding duplicate data to our sink?

    I've read that I could set the startFromEnd parameter to true. In that case however if an agent goes down then we would lose any data that gets written to our file until the agent comes back up. How do people handle this? It seems like you either have to deal with the fact that you will have duplicate or missing data.

    Thanks||

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedMar 17, '11 at 1:53a
activeMar 17, '11 at 3:24a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Mark: 2 posts James Seigel: 1 post

People

Translate

site design / logo © 2022 Grokbase