FAQ
Currently on PAUSE you have to explicitly delete old uploads.

How about changing it so you have to explicitly KEEP old uploads
that appear to have been superseded?

PAUSE already has a mechanism to delete files at some future point in
time. That's currently only used as part of a safety/sanity check to
delay deletions that were manually invoked.

I envisage PAUSE having a set of rules it would apply monthly, say,
to automatically select files for "purging".

The rules might look something like this:

File does not have deletion date set, and
File is older than 3 months, and
File has a later upload
- in the same directory
- with the same major version
- with a higher minor version
- which is also more than 3 months old

(Naturally these are just suggestions. Let's not bikeshed the fine
details yet. It's the approach we need to discuss first.)

Files selected in this way would be scheduled to be deleted in a month
and an email would be sent to the authors, just as if they'd selected
the files for deletion via PAUSE.

All that's needed, in addition to the above script, is a way for authors
to indicate that a particular file shouldn't be purged. The database
could use a far-future date for that which the UI could present as
"do not purge" checkbox against the file.

Tim.

Search Discussions

  • David Golden at Mar 25, 2010 at 11:53 am

    On Thu, Mar 25, 2010 at 7:12 AM, Tim Bunce wrote:
    Currently on PAUSE you have to explicitly delete old uploads.

    How about changing it so you have to explicitly KEEP old uploads
    that appear to have been superseded?
    I would support an option to purge automatically only if the default
    is NOT to purge.

    I don't think it's a good idea to make it hard for people to find
    older versions of a distribution -- where hard means "have to track it
    down on backpan". (Though we could make clients better about it, I
    supposed.)

    Of all the things I'd like to do with PAUSE, this feature is very,
    very low on my list.

    -- David
  • Ruslan Zakirov at Mar 25, 2010 at 12:01 pm
    Hi,

    I think David is not alone here. As PAUSE has interfaces for users to
    delete then it's possible to write
    a cpanpurge utility to do it without changing PAUSE. Small yaml config
    in home dir with protections
    of particular files and interactive mode.
    On Thu, Mar 25, 2010 at 2:52 PM, David Golden wrote:
    On Thu, Mar 25, 2010 at 7:12 AM, Tim Bunce wrote:
    Currently on PAUSE you have to explicitly delete old uploads.

    How about changing it so you have to explicitly KEEP old uploads
    that appear to have been superseded?
    I would support an option to purge automatically only if the default
    is NOT to purge.

    I don't think it's a good idea to make it hard for people to find
    older versions of a distribution -- where hard means "have to track it
    down on backpan".  (Though we could make clients better about it, I
    supposed.)

    Of all the things  I'd  like to do with PAUSE, this feature is very,
    very low on my list.

    -- David


    --
    Best regards, Ruslan.
  • Ovid at Mar 25, 2010 at 12:14 pm

    --- On Thu, 25/3/10, David Golden wrote:

    From: David Golden <xdaveg@gmail.com>
    I don't think it's a good idea to make it hard for people
    to find
    older versions of a distribution -- where hard means "have
    to track it
    down on backpan".  (Though we could make clients
    better about it, I
    supposed.)
    I don't have a particular opinion about this, but this issue could be mitigated if CPAN linked to the backpan.

    Cheers,
    Ovid
  • Eric Wilhelm at Mar 25, 2010 at 8:37 pm
    # from David Golden
    # on Thursday 25 March 2010 04:52:
    Of all the things  I'd  like to do with PAUSE, this feature is very,
    very low on my list.
    Make it easier to create a "mock pause" perhaps? Then at least someone
    could demo their suggested feature and maybe it would be easier to
    drop-in to the official system once it has been proven (e.g. one could
    run a mirror as a demo of "what upstream would do" to test their idea
    over the course of a month or whatever.) My bikeshed is on wheels.

    --Eric
    --
    The only thing that could save UNIX at this late date would be a new $30
    shareware version that runs on an unexpanded Commodore 64.
    --Don Lancaster (1991)
    ---------------------------------------------------
    http://scratchcomputing.com
    ---------------------------------------------------
  • David Cantrell at Mar 25, 2010 at 12:44 pm

    On Thu, Mar 25, 2010 at 11:12:32AM +0000, Tim Bunce wrote:

    How about changing it so you have to explicitly KEEP old uploads
    that appear to have been superseded?
    Why? Is there a problem with the size of the CPAN? Have any mirror
    maintainers complained?

    --
    David Cantrell | even more awesome than a panda-fur coat

    I caught myself pulling grey hairs out of my beard.
    I'm definitely not going grey, but I am going vain.
  • Barbie at Mar 25, 2010 at 1:43 pm

    On Thu, Mar 25, 2010 at 11:12:32AM +0000, Tim Bunce wrote:
    Currently on PAUSE you have to explicitly delete old uploads.
    Which often is a good thing. While BACKPAN exists, it isn't somewhere
    that many go to look for old distributions. For me and probably others,
    BACKPAN only distributions are ones that have been specifically marked
    by the maintainers as obsolete, badly broken or similar.

    Automatic deletes from CPAN would change that.

    There are many distributions on CPAN that older versions work on a
    particular perl/os, but more recent ones don't. Latest isn't necessarily
    the greatest.

    If you are going to perform this then it should really feed off the CPAN
    Testers to know if a specific release has been marked as being the
    latest working release for a particular perl/os.

    I would also suggest extending the timeframe considerably to perhaps 3
    or maybe 5 years.

    Lastly I would also personnally be annoyed if only the latest versions
    were available, as I often make great use of the diff tool on
    search.cpan.org. Having only the latest version renders that great tool
    redundant :(
    Files selected in this way would be scheduled to be deleted in a month
    and an email would be sent to the authors, just as if they'd selected
    the files for deletion via PAUSE.
    There are already many authors who have non-responding email addresses
    (I will get around to publicising that list at some point), so some
    will likely disappear down a blackhole. What if you're about to delete a
    set of distributions that should really be kept available? No one would
    be listening to know that it should still be kept.

    I would prefer a suggestion email to authors to delete, rather than an
    email telling them that their distributions will be deleted unless they
    do something.

    Cheers,
    Barbie.
    --
    Birmingham Perl Mongers <http://birmingham.pm.org>
    Memoirs Of A Roadie <http://barbie.missbarbell.co.uk>
    CPAN Testers Blog <http://blog.cpantesters.org>
    YAPC Conference Surveys <http://yapc-surveys.org>
  • Graham Barr at Mar 25, 2010 at 1:47 pm

    On Mar 25, 2010, at 8:42 AM, Barbie wrote:

    Lastly I would also personnally be annoyed if only the latest versions
    were available, as I often make great use of the diff tool on
    search.cpan.org. Having only the latest version renders that great tool
    redundant :(
    I use that too :-) and it is very annoying that some authors automatically delete
    previous releases when they upload a new one.

    Graham.
  • David Cantrell at Mar 25, 2010 at 2:13 pm

    On Thu, Mar 25, 2010 at 01:42:58PM +0000, Barbie wrote:

    There are many distributions on CPAN that older versions work on a
    particular perl/os, but more recent ones don't. Latest isn't necessarily
    the greatest.

    If you are going to perform this then it should really feed off the CPAN
    Testers to know if a specific release has been marked as being the
    latest working release for a particular perl/os.
    You just described cpXXXan: http://cpxxxan.barnyard.co.uk/

    --
    David Cantrell | Bourgeois reactionary pig

    You know you're getting old when you fancy the
    teenager's parent and ignore the teenager
    -- Paul M in uknot
  • Jarkko Hietaniemi at Mar 25, 2010 at 3:00 pm
    I have one case where the v1 and v2 of a module are simply
    incompatible, but v1 still works, and unless the users have a
    compelling reason, they won't migrate. Pulling the rug from under
    them would be quite unsportsmanlike.

    Deletion should be opt-in, and there should be a way to "pin" some
    releases as unreapable. And warning emails (yes, some email addresses
    are blackholes) to the author well in advance: "your module X version
    Y will be deleted as you requested in Z weeks because there are P
    newer releases ..."

    --
    There is this special biologist word we use for 'stable'. It is
    'dead'. -- Jack Cohen
  • Chris Nandor at Mar 25, 2010 at 3:14 pm
    What Jarkko said.
    On Mar 25, 2010, at 08:00, Jarkko Hietaniemi wrote:

    I have one case where the v1 and v2 of a module are simply
    incompatible, but v1 still works, and unless the users have a
    compelling reason, they won't migrate. Pulling the rug from under
    them would be quite unsportsmanlike.

    Deletion should be opt-in, and there should be a way to "pin" some
    releases as unreapable. And warning emails (yes, some email addresses
    are blackholes) to the author well in advance: "your module X version
    Y will be deleted as you requested in Z weeks because there are P
    newer releases ..."

    --
    There is this special biologist word we use for 'stable'. It is
    'dead'. -- Jack Cohen

    --
    Chris Nandor pudge@pobox.com http://pudge.net/
    Slashdot / Geeknet pudge@slashdot.org http://slashdot.org/
  • Ask Bjørn Hansen at Mar 25, 2010 at 3:10 pm

    On Mar 25, 2010, at 4:12, Tim Bunce wrote:

    Currently on PAUSE you have to explicitly delete old uploads.

    How about changing it so you have to explicitly KEEP old uploads
    that appear to have been superseded?
    I like it.

    I agree with Jarkko that there should be a way to "pin" some versions and the configuration should be "more than N newer releases" or some such.

    I think it should be on by default though. Older than 3 (or 6?) months and at least 2 or 3 (or more?) newer releases or some such.

    For most authors this won't change anything -- but it'll help those who unhelpfully _never_ delete anything.

    On Search CPAN maybe BackPAN could be used to pull in older versions for diffs etc...


    - ask
  • Chris Nandor at Mar 25, 2010 at 3:36 pm

    On Mar 25, 2010, at 08:10, Ask Bjørn Hansen wrote:

    I agree with Jarkko that there should be a way to "pin" some versions and the configuration should be "more than N newer releases" or some such.

    I think it should be on by default though. Older than 3 (or 6?) months and at least 2 or 3 (or more?) newer releases or some such.
    I like that solution better, BUT, there's a significant chance that some things will fall through the cracks (for authors who don't get the notices, for example), and because we put out release software on the CPAN that people rely on, I have to agree with Jarkko and vote to err on the side of safety first.

    I'd rather spend more energy getting people to opt in, than opt them in by default.

    --
    Chris Nandor pudge@pobox.com http://pudge.net/
    Slashdot / Geeknet pudge@slashdot.org http://slashdot.org/
  • Andy Armstrong at Mar 25, 2010 at 3:39 pm

    On 25 Mar 2010, at 15:36, Chris Nandor wrote:
    I like that solution better

    [snip]

    But solution to what? Are we convinced there's actually a problem here?

    --
    Andy Armstrong, Hexten
  • Andy Lester at Mar 25, 2010 at 3:48 pm

    On Mar 25, 2010, at 10:38 AM, Andy Armstrong wrote:

    But solution to what? Are we convinced there's actually a problem here?
    The first two rules of optimization club:

    1) You do not optimize.
    2) You do not optimize without measuring.

    As soon as someone can explain specifics of the problem, including magnitude, I can begin to be concerned.

    xoxo,
    Andy

    --
    Andy Lester => andy@petdance.com => www.theworkinggeek.com => AIM:petdance
  • Ricardo Signes at Mar 25, 2010 at 3:54 pm
    * Andy Armstrong [2010-03-25T11:38:46]
    On 25 Mar 2010, at 15:36, Chris Nandor wrote:
    I like that solution better
    But solution to what? Are we convinced there's actually a problem here?
    I am entirely unconvinced.

    --
    rjbs
  • Ask Bjørn Hansen at Mar 25, 2010 at 3:55 pm

    On Mar 25, 2010, at 8:38, Andy Armstrong wrote:

    I like that solution better

    [snip]

    But solution to what? Are we convinced there's actually a problem here?
    CPAN has almost 200k files. www.cpan.org says there are "17627 modules". rsyncing a gazillion files doesn't work that well (on the server). Helping authors remember to delete things that are now irrelevant from the main CPAN system will make it easier to run mirrors and keep them fresh.


    - ask
  • Eric Wilhelm at Mar 25, 2010 at 8:23 pm
    # from Ask Bjørn Hansen
    # on Thursday 25 March 2010 08:55:
    But solution to what? Are we convinced there's actually a problem
    here?
    CPAN has almost 200k files.  www.cpan.org says there are "17627
    modules".  rsyncing a gazillion files doesn't work that well (on the
    server).  Helping authors remember to delete things that are now
    irrelevant from the main CPAN system will make it easier to run
    mirrors and keep them fresh.
    Maybe CPAN mirrors are more easily updated than via a generic rsync? Is
    the burden only network/cpu for checking whether a bunch of old
    archives have changed, or does disk matter?

    --Eric
    --
    The opinions expressed in this e-mail were randomly generated by
    the computer and do not necessarily reflect the views of its owner.
    --Management
    ---------------------------------------------------
    http://scratchcomputing.com
    ---------------------------------------------------
  • Geoffrey Broadwell at Mar 25, 2010 at 9:08 pm

    On Thu, 2010-03-25 at 13:23 -0700, Eric Wilhelm wrote:
    # from Ask Bjørn Hansen
    # on Thursday 25 March 2010 08:55:
    But solution to what? Are we convinced there's actually a problem
    here?
    CPAN has almost 200k files. www.cpan.org says there are "17627
    modules". rsyncing a gazillion files doesn't work that well (on the
    server). Helping authors remember to delete things that are now
    irrelevant from the main CPAN system will make it easier to run
    mirrors and keep them fresh.
    Maybe CPAN mirrors are more easily updated than via a generic rsync? Is
    the burden only network/cpu for checking whether a bunch of old
    archives have changed, or does disk matter?
    Forgive a lurker, but wasn't that the point of this:

    http://search.cpan.org/~andk/File-Rsync-Mirror-Recent-0.0.7/

    When I saw that announced, I remember thinking "Yay, large archive rsync
    problem solved!" Did it not work out?


    -'f
  • Barbie at Mar 25, 2010 at 10:43 pm

    On Thu, Mar 25, 2010 at 02:08:45PM -0700, Geoffrey Broadwell wrote:

    Forgive a lurker, but wasn't that the point of this:

    http://search.cpan.org/~andk/File-Rsync-Mirror-Recent-0.0.7/

    When I saw that announced, I remember thinking "Yay, large archive rsync
    problem solved!" Did it not work out?
    It currently supports all the fast CPAN mirrors. The CPAN Testers mirror
    is currently 10 seconds behind PAUSE :)

    Cheers,
    Barbie.
    --
    Birmingham Perl Mongers <http://birmingham.pm.org>
    Memoirs Of A Roadie <http://barbie.missbarbell.co.uk>
    CPAN Testers Blog <http://blog.cpantesters.org>
    YAPC Conference Surveys <http://yapc-surveys.org>
  • Eric Wilhelm at Mar 25, 2010 at 10:43 pm
    # from Geoffrey Broadwell
    # on Thursday 25 March 2010 14:08:
    Maybe CPAN mirrors are more easily updated than via a generic rsync?
    Is the burden only network/cpu for checking whether a bunch of old
    archives have changed, or does disk matter?
    Forgive a lurker, but wasn't that the point of this:

    http://search.cpan.org/~andk/File-Rsync-Mirror-Recent-0.0.7/

    When I saw that announced, I remember thinking "Yay, large archive
    rsync problem solved!"  Did it not work out?
    It sounds like it has the tech solved. Now mirror admins just need to
    know about it and how to use it.

    The !!!! PRE-ALPHA ALERT !!!! in the documentation is may seem like a
    big stop sign for potential users. But presumably we don't need to do
    anything to PAUSE/the CPAN for admins to quicken their mirror
    process -- just need some feedback, docs, and a frontend for it to gain
    widespread use. Now it's just a simple matter of education. :-D

    --Eric
    --
    If the collapse of the Berlin Wall had taught us anything, it was that
    socialism alone was not a sustainable economic model.
    --Robert Young
    ---------------------------------------------------
    http://scratchcomputing.com
    ---------------------------------------------------
  • David Golden at Mar 25, 2010 at 11:40 pm
    It's a real memory hog. I don't think it needs to be, but haven't had the
    tuits to prove that assertion.

    David

    On Mar 25, 2010 6:43 PM, "Eric Wilhelm" wrote:

    # from Geoffrey Broadwell
    # on Thursday 25 March 2010 14:08:
    Maybe CPAN mirrors are more easily updated than via a generic rsync?
    Is the burden only network/cpu for checking whether a bunch of old
    archives have changed, or does disk matter?
    Forgive a lurker, but wasn't that the point of this:

    http://search.cpan.org/~andk/File-Rsync-Mirror-Recent-0.0.7/

    When I saw that announced, I remember thinking "Yay, large archive
    rsync problem solved!" Did it not work out?
    It sounds like it has the tech solved. Now mirror admins just need to
    know about it and how to use it.

    The !!!! PRE-ALPHA ALERT !!!! in the documentation is may seem like a
    big stop sign for potential users. But presumably we don't need to do
    anything to PAUSE/the CPAN for admins to quicken their mirror
    process -- just need some feedback, docs, and a frontend for it to gain
    widespread use. Now it's just a simple matter of education. :-D

    --Eric
    --
    If the collapse of the Berlin Wall had taught us anything, it was that
    socialism alone was not a sustainable economic model.
    --Robert Young
    ---------------------------------------------------
    http://scratchcomputing.com
    ---------------------------------------------------
  • Ask Bjørn Hansen at Mar 25, 2010 at 11:53 pm

    On Mar 25, 2010, at 14:08, Geoffrey Broadwell wrote:

    Maybe CPAN mirrors are more easily updated than via a generic rsync? Is
    the burden only network/cpu for checking whether a bunch of old
    archives have changed, or does disk matter?
    Forgive a lurker, but wasn't that the point of this:

    http://search.cpan.org/~andk/File-Rsync-Mirror-Recent-0.0.7/

    When I saw that announced, I remember thinking "Yay, large archive rsync
    problem solved!" Did it not work out?
    Yes - we use that for some/most of the central mirrors; but the other several thousand mirrors don't use it (for various good reasons).

    It also (currently) doesn't support tiered mirrors; and it's not used for "CPAN"; only for the "PAUSE data" -- again for various good reasons.



    - ask
  • Ask Bjørn Hansen at Mar 25, 2010 at 11:55 pm

    On Mar 25, 2010, at 13:23, Eric Wilhelm wrote:

    Maybe CPAN mirrors are more easily updated than via a generic rsync? Is
    the burden only network/cpu for checking whether a bunch of old
    archives have changed, or does disk matter?
    Most CPAN mirrors use rsync. It's not realistic to make them change that ("Hello all mirror operators -- so that tool that you use for ALL YOUR MIRRORS; well ... maybe you can use something else for us?").

    rsync is all disk i/o -- relatively negligible network and CPU.


    - ask
  • Eric Wilhelm at Mar 26, 2010 at 12:15 am
    # from Ask Bjørn Hansen
    # on Thursday 25 March 2010 16:55:
    Most CPAN mirrors use rsync.  It's not realistic to make them change
    that ("Hello all mirror operators -- so that tool that you use for
    ALL YOUR MIRRORS; well ... maybe you can use something else for
    us?").

    rsync is all disk i/o -- relatively negligible network and CPU.
    If you're concerned about the load on upstream mirrors, is it possible
    that the rsync daemon running there knows about these things (e.g.
    in-memory or otherwise cached list of timestamps/checksums) without the
    disk activity?

    --Eric
    --
    A counterintuitive sansevieria trifasciata was once literalized
    guiltily.
    --Product of Artificial Intelligence
    ---------------------------------------------------
    http://scratchcomputing.com
    ---------------------------------------------------
  • Adam Kennedy at Mar 26, 2010 at 4:36 am
    What he said.

    Most people don't mirror CPAN. They mirror many things.

    This is the same reason we've struggled with statistics. How do you
    ask someone mirroring three dozen different things to put in a special
    log-munging tool just for us.

    Adam K
    On Fri, Mar 26, 2010 at 10:55 AM, Ask Bjørn Hansen wrote:
    Most CPAN mirrors use rsync.  It's not realistic to make them change that ("Hello all mirror operators -- so that tool that you use for ALL YOUR MIRRORS; well ... maybe you can use something else for us?").
  • Geoffrey Broadwell at Mar 26, 2010 at 6:38 am

    On Fri, 2010-03-26 at 15:36 +1100, Adam Kennedy wrote:
    What he said.

    Most people don't mirror CPAN. They mirror many things.

    This is the same reason we've struggled with statistics. How do you
    ask someone mirroring three dozen different things to put in a special
    log-munging tool just for us.
    By producing a tool that is compatible with rsync (or a fork, or a
    patch, whatever) that *also* does the extra stuff you want, and selling
    it as the bee's knees for mirror admins, so that they have a good reason
    to use it for all their mirroring needs instead of the unimproved rsync.

    "It slices, it dices, it makes julienne mirrors!"


    -'f
  • Nadim khemir at Mar 25, 2010 at 8:55 pm

    On Mar 25, 2010, at 8:38, Andy Armstrong wrote:

    I like that solution better

    [snip]

    But solution to what? Are we convinced there's actually a problem here?
    CPAN has almost 200k files. www.cpan.org says there are "17627 modules".
    rsyncing a gazillion files doesn't work that well (on the server). Helping
    authors remember to delete things that are now irrelevant from the main CPAN
    system will make it easier to run mirrors and keep them fresh.

    - ask
    So the problem is not a 'purging' problem (that a few confused with deleting
    modules) but more a synchronization problematic between the CPAN mirrors. I
    think we all agree that all modules should be kept safely somewhere but only
    few need to be synchronized to all the mirrors.

    ccpan, cpanp and cpanm (other?) could have a older_versions_url_list that
    would be used if the module version is not part of what the author/community
    want to be mirrored. very old versions are, I think, seldom asked for
    (something that would need figures to confirm).

    Also, I'd bet that 95% of Perl users don't know what BACKPAN is.


    Nadim.
  • Lars Thegler at Mar 26, 2010 at 9:55 am

    On Thu, Mar 25, 2010 at 4:55 PM, Ask Bjørn Hansen wrote:
    On Mar 25, 2010, at 8:38, Andy Armstrong wrote:

    I like that solution better
    [snip]

    But solution to what? Are we convinced there's actually a problem here?
    CPAN has almost 200k files.  www.cpan.org says there are "17627 modules".  rsyncing a gazillion files doesn't work that well (on the server).  Helping authors remember to delete things that are now irrelevant from the main CPAN system will make it easier to run mirrors and keep them fresh.
    I appreciate that the number of files on CPAN has implications for the
    infrastructure, but I feel a need to have some more factual info
    before conceding to such measures.

    Also, having _software_ determine what is 'irrelevant' is a dangerous
    path indeed.

    One of the strengths of CPAN is the low barrier of entry. If we lower
    the barrier of exit, I'm not at all convinced we end up in a
    significantly better place.

    /Lars
  • Andy Lester at Mar 26, 2010 at 4:02 pm

    On Mar 26, 2010, at 4:55 AM, Lars Thegler wrote:

    I appreciate that the number of files on CPAN has implications for the
    infrastructure, but I feel a need to have some more factual info
    before conceding to such measures.
    Absolutely. This factual info would ideally look like this:

    "Of the 17,000 distros on CPAN, there are 8,000 that have versions more than a year older than the most recent one. If those distros with versions more than a year out of date were purged, the number of files would decrease from 200,000 to 120,000. This would save 7GB out of the 12GB that a full CPAN mirror takes now. Removing that 7GB would mean Benefit X to mirror owners."

    Without that, how can module authors be bothered to care?


    xoxo,
    Andy


    --
    Andy Lester => andy@petdance.com => www.theworkinggeek.com => AIM:petdance
  • Arthur Corliss at Mar 26, 2010 at 5:20 pm

    On Fri, 26 Mar 2010, Andy Lester wrote:

    Absolutely. This factual info would ideally look like this:

    "Of the 17,000 distros on CPAN, there are 8,000 that have versions more than a year older than the most recent one. If those distros with versions more than a year out of date were purged, the number of files would decrease from 200,000 to 120,000. This would save 7GB out of the 12GB that a full CPAN mirror takes now. Removing that 7GB would mean Benefit X to mirror owners."

    Without that, how can module authors be bothered to care?
    If you don't mind me interjecting, I still can't be bothered to care. We
    have basically a 12GB data set, and we're worried about that? I see that a
    small barrier to bringing on new mirrors on constrained pipes, but
    ultimately that's not that big a deal. Hell, there's single versions of
    some Linux distros that are bigger than that.

    End sum: I personally don't think this is the most pressing issue facing
    CPAN. Just issue a best practices guide to all the module authors (or
    include it as on-line documentation in PAUSE) and be done with it.

    --Arthur Corliss
    Live Free or Die
  • Jarkko Hietaniemi at Mar 26, 2010 at 10:43 pm

    On Friday-201003-26 13:20, Arthur Corliss wrote:
    On Fri, 26 Mar 2010, Andy Lester wrote:

    Absolutely. This factual info would ideally look like this:

    "Of the 17,000 distros on CPAN, there are 8,000 that have versions more than a year older than the most recent one. If those distros with versions more than a year out of date were purged, the number of files would decrease from 200,000 to 120,000. This would save 7GB out of the 12GB that a full CPAN mirror takes now. Removing that 7GB would mean Benefit X to mirror owners."

    Without that, how can module authors be bothered to care?
    If you don't mind me interjecting, I still can't be bothered to care. We
    have basically a 12GB data set, and we're worried about that? I see that a
    small barrier to bringing on new mirrors on constrained pipes, but
    ultimately that's not that big a deal. Hell, there's single versions of
    some Linux distros that are bigger than that.
    The total size is not the problem. The number of files is. Vanilla
    rsync is horribly inefficient (not the protocol, which is genius, mind)
    because a client coming by and asking for updates basically ends up
    requiring the moral equivalent of
    "find . -type f -print". Let me repeat that: each client. Not fun.
  • Arthur Corliss at Mar 26, 2010 at 11:03 pm

    On Fri, 26 Mar 2010, Jarkko Hietaniemi wrote:

    The total size is not the problem. The number of files is. Vanilla
    rsync is horribly inefficient (not the protocol, which is genius, mind)
    because a client coming by and asking for updates basically ends up
    requiring the moral equivalent of
    "find . -type f -print". Let me repeat that: each client. Not fun.
    Why use rsync, then? Why not have checkpointed logs on cpan with
    additions/removals logged by date so you can roll forward on the client,
    processing only those files? It would be trivial to set up and a lot more
    efficient.

    --Arthur Corliss
    Live Free or Die
  • Jarkko Hietaniemi at Mar 26, 2010 at 11:07 pm

    On Friday-201003-26 19:02, Arthur Corliss wrote:
    On Fri, 26 Mar 2010, Jarkko Hietaniemi wrote:

    The total size is not the problem. The number of files is. Vanilla
    rsync is horribly inefficient (not the protocol, which is genius, mind)
    because a client coming by and asking for updates basically ends up
    requiring the moral equivalent of
    "find . -type f -print". Let me repeat that: each client. Not fun.
    Why use rsync, then? Why not have checkpointed logs on cpan with
    additions/removals logged by date so you can roll forward on the client,
    processing only those files? It would be trivial to set up and a lot more
    efficient.
    We wait your implementation breathlessly. By the time all the CPAN
    mirrors have started using that, we probably will be rather blue in
    the face.
    --Arthur Corliss
    Live Free or Die
  • Arthur Corliss at Mar 26, 2010 at 11:33 pm

    On Fri, 26 Mar 2010, Jarkko Hietaniemi wrote:

    We wait your implementation breathlessly. By the time all the CPAN mirrors
    have started using that, we probably will be rather blue in
    the face.
    Now, let's not be that way. :-) You need to pick your problem domain. You
    guys can try to go through a lot of machinations to establish storage
    policies which account for the million corner cases necessary to support all
    the various versions of libraries & perl, and are relatively painless to
    implement without raising the ire of all the contributors.... or just
    improve the efficiency of synchronizing the mirrors.

    <G> I know what sounds a hell of a lot easier and faster to me... *Really*
    fast for anyone familiar with the PAUSE code base.

    Rsync by itself is definitely a bad idea for the number of files, I agree
    whole-heartedly. But it's the weakest and simplest link to replace.

    Would I be happy to help? Sure. But I don't feel like diving into a
    foreign code base all by myself? No. I don't have that many spare cycles.

    --Arthur Corliss
    Live Free or Die
  • Andy Armstrong at Mar 27, 2010 at 7:45 am

    On 26 Mar 2010, at 23:32, Arthur Corliss wrote:
    But it's the weakest and simplest link to replace.

    Quite a bit of the discussion here on this topic has revolved around an explanation of why that isn't the case. Setting up rsync is trivial for mirror operators. Any alternative would likely be less so.

    --
    Andy Armstrong, Hexten
  • Ask Bjørn Hansen at Mar 26, 2010 at 11:44 pm

    On Mar 26, 2010, at 16:02, Arthur Corliss wrote:

    Why use rsync, then? Why not have checkpointed logs on cpan with
    additions/removals logged by date so you can roll forward on the client,
    processing only those files? It would be trivial to set up and a lot more
    efficient.

    I find it curious that everyone who's actually involved in syncing the files or running mirror servers seem to think it generally sounds like a good idea and everyone who doesn't say it's "not worth the effort".

    Anyway -- we have some other ideas for cutting down the number of files that we already agreed on but just needs announcement (which I promised to write up, oops). No, I'm not going to make Tim's mistake and suggest it here first.

    Tim: Next time just get the paint in your preferred color. :-)


    - ask
  • Arthur Corliss at Mar 27, 2010 at 12:23 am
  • Jan Dubois at Mar 27, 2010 at 12:54 am

    On Fri, 26 Mar 2010, Arthur Corliss wrote:
    But what the hell do I know. I don't run a *CPAN* mirror, so I must be
    freaking clueless...
    It's not about what you know, but about what you are willing to
    do yourself.

    At some point you have to accept that the people who *do* the work
    decide *how* they do it.

    There is not much point in just talking to volunteers that they should
    not be doing something but instead be doing something else if you are
    not willing to take the burden of doing this other thing yourself.

    Volunteers are not free labor that the talking masses can direct with
    majority votes. :)

    Cheers,
    -Jan
  • Elaine Ashton at Mar 27, 2010 at 12:59 am

    On Mar 26, 2010, at 8:23 PM, Arthur Corliss wrote:

    Sure, I don't run a CPAN mirror, but I do manage many, many terrabytes of
    storage as part of my day job. I think it's a tad presumptuous to disregard
    input just because we're not in your inner sanctum. As I mentioned in a
    follow up e-mail: this is simply a matter of selecting the correct problem
    domain. I believe that streamlining the mirroring process will provide
    greater gains for less effort.

    That's not to say that pursuing other efficiencies isn't worthwhile, just
    that you need to prioritize.

    But what the hell do I know. I don't run a *CPAN* mirror, so I must be
    freaking clueless...
    Oh, don't be such a drama queen. I rebuilt and helped run nic.funet.fi for 2 years which is the canonical mirror for a large number of mirrors and the perspective of having a few terabytes spinning in storage changes quite dramatically when you are actually serving a few terabytes to thousands of clients. CPAN grew to be quite a burden on the site not only because of the high demand, but also because of the multitude of small files and I'm sure other mirrors feel similarly burdened.

    The sort of pruning Tim brought up has long been an idea, but with the current and growing size of the archive, something does need to be done to alleviate the burden not only on the canonical mirrors, but also on the random folks who want to grab a local mirror for themselves. In my present work environment, 12gb isn't a lot of disk space, but it's a lot considering I don't need to install perl modules daily and the vast majority of it I'll likely never use. It would be a kindness to both the mirror operators and to the end-users to trim it down to a manageable size.

    As for efficiency, rsync remains a good tool for the job that works on nearly every platform which is a rather tall order to match with any other solution. Relegating the cruft to BackPAN to make the current CPAN slimmer and less demanding on all fronts is an idea that would be welcomed by more than just mirror ops.

    The only snag I can forsee in trimming back on the abundance of modules is the case where some modules have version requirements for other modules where it will barf with a mismatch/newer version of the required module (I bumped into this recently but can't remember exactly which module it was) but I think it's rare and the practise should be discouraged.

    e.
  • Andy Armstrong at Mar 27, 2010 at 7:49 am

    On 27 Mar 2010, at 00:59, Elaine Ashton wrote:
    The only snag I can forsee in trimming back on the abundance of modules is the case where some modules have version requirements for other modules where it will barf with a mismatch/newer version of the required module (I bumped into this recently but can't remember exactly which module it was) but I think it's rare and the practise should be discouraged.

    Maybe that could be solved by having the clients (and maybe search.cpan.org) automagically fall back to a backpan mirror?

    And, yes, if it's considered a good idea I /am/ prepared to do something about it.

    --
    Andy Armstrong, Hexten
  • Nadim khemir at Mar 27, 2010 at 10:29 am
    On 27 Mar 2010, at 00:59, Andy Armstrong wrote:
    On 27 Mar 2010, at 00:59, Elaine Ashton wrote:
    The only snag I can forsee in trimming back on the abundance of modules is
    the case where some modules have version requirements for other modules where
    it will barf with a mismatch/newer version of the required module (I bumped
    into this recently but can't remember exactly which module it was) but I think
    it's rare and the practise should be discouraged.

    Maybe that could be solved by having the clients (and maybe search.cpan.org)
    automagically fall back to a backpan mirror?
    And, yes, if it's considered a good idea I am prepared to do something about
    it.

    Exactly what I wrote in my previous mail, nobody commented I was wondering if
    I was wrong!

    In any case. We do now have a better understanding of the problem and most
    important we have a "real" user (Elaine) wishing for something to be done.

    Andreas, Chris, Tatsuhiko and other have done a tremendous job implementing
    stuff but I must admit that I would have like to see a list of what they are
    implementing. Not to mention the need to see a context diagram. IMVHO the
    first thing we should do is have a requirement list of what CPAN actors
    (clients, pause, mirrors, search engines, ...) should do. Maybe that document
    already exists somewhere.

    What implication we may have on CPAN, ExtUtils, Module::Build, and all other ,
    still unknown, modules are, I believe, not to be under estimated.

    Andy (since you are the first to really volunteer (and now you don't have any
    choice anymore;)), count me in whatever development time is needed to get
    things moving.

    Ask, this thread is getting a tad long and although I'm very happy to see more
    input, requirements and ideas, Would it be possible to see a some condensed
    results somewhere?

    Cheers, Nadim.
  • Arthur Corliss at Mar 27, 2010 at 6:52 pm

    On Fri, 26 Mar 2010, Elaine Ashton wrote:

    Oh, don't be such a drama queen. I rebuilt and helped run nic.funet.fi for 2 years which is the canonical mirror for a large number of mirrors and the perspective of having a few terabytes spinning in storage changes quite dramatically when you are actually serving a few terabytes to thousands of clients. CPAN grew to be quite a burden on the site not only because of the high demand, but also because of the multitude of small files and I'm sure other mirrors feel similarly burdened.
    Don't be such an arrogant prick. You guys made baseless assumptions about
    people's experience with storage management in an attempt to diregard their
    opinions. That's being a dick by any metric.
    The sort of pruning Tim brought up has long been an idea, but with the current and growing size of the archive, something does need to be done to alleviate the burden not only on the canonical mirrors, but also on the random folks who want to grab a local mirror for themselves. In my present work environment, 12gb isn't a lot of disk space, but it's a lot considering I don't need to install perl modules daily and the vast majority of it I'll likely never use. It would be a kindness to both the mirror operators and to the end-users to trim it down to a manageable size.
    I think I was quite explicit in saying that efficiencies should be pursued
    in multiple areas, but the predominant bitch I took away from your thread
    dealt with the burden of synchronizing mirrors. What's the easiest way to
    address that pain? I don't believe it's your method. I'd look into the
    size issue *after* you address the incredible inefficiencies of a simple
    rsync.
    As for efficiency, rsync remains a good tool for the job that works on nearly every platform which is a rather tall order to match with any other solution. Relegating the cruft to BackPAN to make the current CPAN slimmer and less demanding on all fronts is an idea that would be welcomed by more than just mirror ops.
    Rsync is an excellent tool for smaller file sets. I use it to sync my own
    mirrors, those mirrors are typically ~10k files. Am I surprised that it
    doesn't scale when you're stat'ing every single file? No. Which is why
    alternatives should be considered. A simple FTP client playing a
    transaction log forward is trivial.

    I maintain several mirrors, most with rsync. But that's with a clear
    understanding of the size of the file set. Use the right tool for the job.
    And it seems apparent to me that rsync isn't the right tool for ~200k files.
    The only snag I can forsee in trimming back on the abundance of modules is the case where some modules have version requirements for other modules where it will barf with a mismatch/newer version of the required module (I bumped into this recently but can't remember exactly which module it was) but I think it's rare and the practise should be discouraged.
    Try doing a simple cost-benefit analysis. What you guys are proposing will
    help. But not as much as simpler alternatives. Like replacing rsync with a
    perl script and modifying PAUSE to log the transactions.

    --Arthur Corliss
    Live Free or Die
  • Nicholas Clark at Mar 27, 2010 at 7:33 pm

    On Sat, Mar 27, 2010 at 10:52:05AM -0800, Arthur Corliss wrote:

    I think I was quite explicit in saying that efficiencies should be pursued
    in multiple areas, but the predominant bitch I took away from your thread
    dealt with the burden of synchronizing mirrors. What's the easiest way to
    address that pain? I don't believe it's your method. I'd look into the
    size issue *after* you address the incredible inefficiencies of a simple
    rsync.
    "I"

    You?

    Or someone else?


    I am quite happy to agree that your understanding and experience of storage
    management is better than mine. But that's not the key question, in a
    volunteer organisation. The questions I ask, repeating Jan's comments in
    another message, are.

    Nicholas Clark
  • Arthur Corliss at Mar 27, 2010 at 7:52 pm

    On Sat, 27 Mar 2010, Nicholas Clark wrote:

    "I"

    You?

    Or someone else?


    I am quite happy to agree that your understanding and experience of storage
    management is better than mine. But that's not the key question, in a
    volunteer organisation. The questions I ask, repeating Jan's comments in
    another message, are.
    Oh, I understand that fully. And I'd be happy to lend some of my time. But
    you don't make people inclined to help when people are lobbing snarky
    comments like "we'll wait breathlessly for you to do it." The impression
    I'm getting from most of you right now is that you're hell bent on solving
    the problem your way, and no one is interested in exploring the technical
    merits of other approaches.

    Hell, I would even help with work towards your desired method *if* I thought
    that was the consensus after a genuine exchange and consideration of ideas.
    I definitely won't should it appear that we have some kind of elitist cabal
    that will make their decision in isolation. If that's going to be the case
    then this should have never been raised on an open forum like the module
    author's list.

    Quite frankly, at times some discussions on this list fail the concept of a
    technical meritocracy, and tend towards an established aristocracy.

    --Arthur Corliss
    Live Free or Die
  • Jarkko Hietaniemi at Mar 27, 2010 at 9:41 pm

    Oh, I understand that fully. And I'd be happy to lend some of my
    time. But
    you don't make people inclined to help when people are lobbing snarky
    comments like "we'll wait breathlessly for you to do it."
    The time-honored tradition of many open source communities is to talk.
    And talk. And talk. The problem is that this solves nothing. To do, does.

    You are free to decide to take this as a personal insult.
  • Arthur Corliss at Mar 28, 2010 at 12:45 am

    On Sat, 27 Mar 2010, Jarkko Hietaniemi wrote:

    The time-honored tradition of many open source communities is to talk. And
    talk. And talk. The problem is that this solves nothing. To do, does.

    You are free to decide to take this as a personal insult.
    I didn't take it as an insult, I took it as what it was -- a dodge. You
    already have your minds made up and are not willing to evaluate options
    on their merits.

    Let's just be honest about what's going on here.

    --Arthur Corliss
    Live Free or Die
  • Andreas J. Koenig at Mar 28, 2010 at 4:07 am

    On Sat, 27 Mar 2010 16:44:49 -0800 (AKDT), Arthur Corliss said:
    On Sat, 27 Mar 2010, Jarkko Hietaniemi wrote:
    The time-honored tradition of many open source communities is to
    talk. And talk. And talk. The problem is that this solves nothing.
    To do, does.
    >>
    You are free to decide to take this as a personal insult.
    I didn't take it as an insult, I took it as what it was -- a dodge. You
    already have your minds made up and are not willing to evaluate options
    on their merits.
    Says the author of a module named Paranoid. A lovely coincidence.
    Let's just be honest about what's going on here.
    If you want to study the CPAN "checkpointed logs" solution running on
    the very CPAN for exactly one year now: File::Rsync::Mirror::Recent

    What needs to be done is really extremely trivial: rewrite it in C and
    convince the rsync people to incoude it in rsync code base. Just that.

    So are you a taker, Arthur?

    --
    andreas
  • Eric Wilhelm at Mar 28, 2010 at 8:08 am
    # from Andreas J. Koenig
    # on Saturday 27 March 2010 21:02:
    If you want to study the CPAN "checkpointed logs" solution running on
    the very CPAN for exactly one year now: File::Rsync::Mirror::Recent

    What needs to be done is really extremely trivial: rewrite it in C and
    convince the rsync people to incoude it in rsync code base. Just that.
    Or even write an rsync daemon (or proxy perhaps) in Perl. So, when the
    client asks for a file, you can answer without checking the disk. Can
    something like that work with an unmodified client, or does the amount
    of data needed to answer a naive client overwhelm any potential gain?

    Unfortunately the protocol is not formally documented and the perl code
    I've seen (File::RsyncP) seems to be lagging:

    http://lists.samba.org/archive/rsync/2008-October/021912.html

    If it's possible for a mirror operator to install something that will
    immediately save them a ton of disk I/O without any changes upstream or
    downstream, then the person who makes the decision (and does the work)
    gets the benefit. Scenarios where authors or downstream mirrors must
    do something special are a tougher sell.

    --Eric
    --
    Turns out the optimal technique is to put it in reverse and gun it.
    --Steven Squyres (on challenges in interplanetary robot navigation)
    ---------------------------------------------------
    http://scratchcomputing.com
    ---------------------------------------------------
  • Adam Kennedy at Mar 31, 2010 at 2:04 am
    I've said nothing till now, because I figured more noise wouldn't help much.

    But I quite like the rsync daemon/proxy idea, and as it so happens I'm
    attending the OzLabs Unconference in 3 weeks time to hang out with
    Tridge, Rusty and the other Australia C/Kernel/Samba/RSync elites.

    So I'd be happy to raise any issues or ideas in this area with them in
    person over beers.

    Adam K
    On Sun, Mar 28, 2010 at 7:08 PM, Eric Wilhelm wrote:
    Or even write an rsync daemon (or proxy perhaps) in Perl.  So, when the
    client asks for a file, you can answer without checking the disk.  Can
    something like that work with an unmodified client, or does the amount
    of data needed to answer a naive client overwhelm any potential gain?

    Unfortunately the protocol is not formally documented and the perl code
    I've seen (File::RsyncP) seems to be lagging:
  • Nicholas Clark at Mar 31, 2010 at 10:11 am

    On Wed, Mar 31, 2010 at 01:03:51PM +1100, Adam Kennedy wrote:
    I've said nothing till now, because I figured more noise wouldn't help much.

    But I quite like the rsync daemon/proxy idea, and as it so happens I'm
    attending the OzLabs Unconference in 3 weeks time to hang out with
    Tridge, Rusty and the other Australia C/Kernel/Samba/RSync elites.

    So I'd be happy to raise any issues or ideas in this area with them in
    person over beers.
    I can see two possibly useful things (and I have no idea if either is yet
    possible, or a great understanding of how the protocol works)

    1: stateful rsync daemon which doesn't scan all the time, either by
    a: Actually having a means to update
    b: Simply telling fibs, and pretending that the file system it scanned
    $n minutes ago is still current. (Which I think would work, at least for
    a mirror where files aren't edited (much) - if the server discovers that
    the client's view of that file *is* out of date, then scan that file for
    real, and give the up to date truth)

    2: federated (or federate-able) server (or proxy) - so that you can say
    "hand this subtree off to that other server"
    This would allow the (fast, existing, C) rsync server to serve most of
    (say) funet.fi, handing off to a stateful server for the CPAN subtree.

    Nicholas Clark

Related Discussions

People

Translate

site design / logo © 2021 Grokbase