FAQ

On Thu, Apr 01, 2010 at 12:39:27AM -0400, David Nicol wrote:
On Wed, Mar 31, 2010 at 7:43 AM, Ask Bjørn Hansen wrote:
The main point here is that we can't use 20 inodes per distribution.
so don't. How much reengineering would be needed to keep CPAN in a
database instead of a file system?
Random thoughts...

* If you squint a little you can view git as a database with excellent
replication support.

* cpanminus already supports installing from a git repo.

* For backwards compatibility a simple perl web server could provide a
classic CPAN http mirror 'view' over a git repo like gitpan.
This cpan-git-server would create and serve up cached distro tarballs on demand.
Someone could whip up one to work over gitpan as a proof of concept.

* The need for widespread mirroring is less significant than it was in
years past. (Also using git as the inter-mirror transport of source files
means there'll be much less traffic between mirrors. Effectively only
the diffs between releases.)

* New approaches to replication, such as git, don't have to be supported
by existing mirror providers. A new set of cpan-git-mirror providers could
emerge.

* Any cpan-git-mirror provider running a cpan-git-server could be
included in the list of mirrors used by existing installers.

* Over time the number of cpan-git-mirror's and cpan-git-server's could
grow and the number of traditional CPAN ftp/rsync mirrors could fall.

Tim.

Search Discussions

  • Nicholas Clark at Apr 1, 2010 at 4:24 pm

    On Thu, Apr 01, 2010 at 03:50:49PM +0100, Tim Bunce wrote:

    * The need for widespread mirroring is less significant than it was in
    years past. (Also using git as the inter-mirror transport of source files
    means there'll be much less traffic between mirrors. Effectively only
    the diffs between releases.)
    [Really really not relevant now, but somehow I still feel the urge to note it]

    but *some* mirroring is still necessary. If a client assumes that the
    master repository is up at all times, so that it can always pull from a
    particular tag corresponding to a release, then that client is going fail
    sooner or later.

    Nicholas Clark
  • Michael G Schwern at Apr 1, 2010 at 5:11 pm

    On Thu, Apr 1, 2010 at 7:50 AM, Tim Bunce wrote:
    On Thu, Apr 01, 2010 at 12:39:27AM -0400, David Nicol wrote:
    On Wed, Mar 31, 2010 at 7:43 AM, Ask Bjørn Hansen wrote:
    The main point here is that we can't use 20 inodes per distribution.
    so don't. How much reengineering would be needed to keep CPAN in a
    database instead of a file system?
    Random thoughts...
    FWIW I've had similar thoughts. I was discussing them with David
    Wheeler in relation to the proposed PgAN (Postgres).

    * If you squint a little you can view git as a database with excellent
    replication support.
    For bonus points, its smaller. A bare gitpan is 5 gigs. BackPAN is 14.

    * cpanminus already supports installing from a git repo.

    * For backwards compatibility a simple perl web server could provide a
    classic CPAN http mirror 'view' over a git repo like gitpan.
    This cpan-git-server would create and serve up cached distro tarballs on demand.
    Someone could whip up one to work over gitpan as a proof of concept.
    Its potentially even simpler over gitpan as github will produce
    tarballs. You just need to map the URLs. I say potentially because
    github will produce a tarball named after the commit checksum, not the
    tag. Something I've been on them to fix.

    * The need for widespread mirroring is less significant than it was in
    years past. (Also using git as the inter-mirror transport of source files
    means there'll be much less traffic between mirrors. Effectively only
    the diffs between releases.)
    Not being a sysadmin, this is my gut feeling. Relative to hard drive
    prices, CPAN (hell, BackPAN) has shrunk. I'd imagine the same to be
    the case relative to network capacity.

    * New approaches to replication, such as git, don't have to be supported
    by existing mirror providers. A new set of cpan-git-mirror providers could
    emerge.

    * Any cpan-git-mirror provider running a cpan-git-server could be
    included in the list of mirrors used by existing installers.

    * Over time the number of cpan-git-mirror's and cpan-git-server's could
    grow and the number of traditional CPAN ftp/rsync mirrors could fall.
    The central thesis is correct, git provides a very simple, very
    compact database that sorts things by version and by distribution.
    The downside is CPAN doesn't really do things by distribution, so that
    would have to be worked out. IMO this is a Good Thing that needs to
    be done.

    See http://use.perl.org/~schwern/journal/40014 for gitpan's issues
    with identifying distributions.
  • Adam Kennedy at Apr 2, 2010 at 12:29 am

    On Fri, Apr 2, 2010 at 4:11 AM, Michael G Schwern wrote:
    * The need for widespread mirroring is less significant than it was in
    years past. (Also using git as the inter-mirror transport of source files
    means there'll be much less traffic between mirrors. Effectively only
    the diffs between releases.)
    Not being a sysadmin, this is my gut feeling.  Relative to hard drive
    prices, CPAN (hell, BackPAN) has shrunk.  I'd imagine the same to be
    the case relative to network capacity.
    I keep a year's worth of historical Apache logs for the
    cpan.strawberryperl.com redirector, which I would imagine is one of
    the more heavily used mirrors.

    If anyone wanted access to these logs to get some idea of the how much
    network traffic mirrors do, I can make them available.

    Adam K
  • Arthur Corliss at Apr 1, 2010 at 5:51 pm

    On Thu, 1 Apr 2010, Tim Bunce wrote:

    Random thoughts...

    * If you squint a little you can view git as a database with excellent
    replication support.

    * cpanminus already supports installing from a git repo.

    * For backwards compatibility a simple perl web server could provide a
    classic CPAN http mirror 'view' over a git repo like gitpan.
    This cpan-git-server would create and serve up cached distro tarballs on demand.
    Someone could whip up one to work over gitpan as a proof of concept.

    * The need for widespread mirroring is less significant than it was in
    years past. (Also using git as the inter-mirror transport of source files
    means there'll be much less traffic between mirrors. Effectively only
    the diffs between releases.)

    * New approaches to replication, such as git, don't have to be supported
    by existing mirror providers. A new set of cpan-git-mirror providers could
    emerge.

    * Any cpan-git-mirror provider running a cpan-git-server could be
    included in the list of mirrors used by existing installers.

    * Over time the number of cpan-git-mirror's and cpan-git-server's could
    grow and the number of traditional CPAN ftp/rsync mirrors could fall.
    From what I've heard about git, this sounds like a workable idea, and being
    an established and semi-portable tool, should alleviate the whininess over
    replacing the sacred cow, er, rsync.

    --Arthur Corliss
    Live Free or Die
  • Tim Bunce at Apr 1, 2010 at 8:12 pm

    On Thu, Apr 01, 2010 at 08:03:53PM +0300, Burak Gürsoy wrote:
    From: Tim Bunce On Behalf Of Tim Bunce
    Subject: Distributing the CPAN
    * cpanminus already supports installing from a git repo.
    * Over time the number of cpan-git-mirror's and cpan-git-server's could
    grow and the number of traditional CPAN ftp/rsync mirrors could fall.
    There is a part missing in this scenario. Mirroring gitPAN can be a
    good idea since it has the actual released distros [...]
    Yes, I was envisaging something like gitPAN. Though if this took off
    then moving the tarball->git import logic to the PAUSE server would
    probably be a good idea.

    Tim.
  • David E. Wheeler at Apr 1, 2010 at 8:56 pm

    On Apr 1, 2010, at 1:12 PM, Tim Bunce wrote:

    Yes, I was envisaging something like gitPAN. Though if this took off
    then moving the tarball->git import logic to the PAUSE server would
    probably be a good idea.
    /me stashes these ideas away for PGAN…
  • Ask Bjørn Hansen at Apr 1, 2010 at 11:17 pm

    On Apr 1, 2010, at 16:50, Tim Bunce wrote:

    * The need for widespread mirroring is less significant than it was in
    years past. (Also using git as the inter-mirror transport of source files
    means there'll be much less traffic between mirrors. Effectively only
    the diffs between releases.)
    The bandwidth isn't an issue -- the disk IO is.

    Maybe there'd be less disk IO with git if all of CPAN was in one big repository; but there are many good reasons for it not to be.

    If we had a repository per distribution we're back to square one; more or less.


    - ask
  • Tim Bunce at Apr 2, 2010 at 12:03 pm

    On Fri, Apr 02, 2010 at 01:16:58AM +0200, Ask Bjørn Hansen wrote:
    On Apr 1, 2010, at 16:50, Tim Bunce wrote:

    * The need for widespread mirroring is less significant than it was in
    years past. (Also using git as the inter-mirror transport of source files
    means there'll be much less traffic between mirrors. Effectively only
    the diffs between releases.)
    The bandwidth isn't an issue -- the disk IO is.
    Anyone know how much IO (stats, reads etc) it takes for a git server to
    know that nothing has changed when a does a fetch?
    Maybe there'd be less disk IO with git if all of CPAN was in one big
    repository; but there are many good reasons for it not to be.

    If we had a repository per distribution we're back to square one; more
    or less.
    I agree that one big repro isn't the way to go. So we need one repro per
    distribution. Given that, we need an efficient way to communicate which
    of the distro repros have changed.

    I'm no expert with git but I wonder if submodules may help here:
    http://www.kernel.org/pub/software/scm/git/docs/git-submodule.html
    https://git.wiki.kernel.org/index.php/GitSubmoduleTutorial

    Imagine a cpan-all 'superproject' repro that has all the distros as
    submodules. This repro would be tiny when cloned because it only
    contains empty directories for the distos plus the metadata for where
    the upstream distro repro lives and what the current commit it.
    When a distro is updated the cpan-all repro would be updated
    to reference the latest version of the distro.

    Given its small size it could be regularly and widely sync'd.
    (And may prove to be a very useful thing in itself for branching and
    tagging etc. I see _lots_ of possibilities there.)

    For a cpan-git-mirror to update the individual distro submodule repros
    it would simply do "git submodule update". (I thought this might go and
    do a "git fetch" on all the submodule repros, but it doesn't. I checked.)

    So, for a cpan-git-mirror to update itself it only needs to do:

    cd cpan-all && git pull && git submodule update

    The git pull of cpan-all repro would be very fast as it's tiny.
    The git submodule update will only do anything for repros that
    cpan-all indicates have changed (or are new).

    Hopefully someone with more git foo than me can sanity check it.
    Assuming I'm not talking nonsense, I think this has great potential.

    Tim.
  • Ask Bjørn Hansen at Apr 2, 2010 at 3:08 pm

    On Apr 2, 2010, at 14:03, Tim Bunce wrote:

    Imagine a cpan-all 'superproject' repro that has all the distros as
    submodules. This repro would be tiny when cloned because it only
    contains empty directories for the distos plus the metadata for where
    the upstream distro repro lives and what the current commit it.
    When a distro is updated the cpan-all repro would be updated
    to reference the latest version of the distro.
    That's a really good idea actually. That'd mean, too, that it's possible to "reset" a distribution (to get rid of excessive size etc).

    It'd be fun to try on the gitpan data...


    - ask (on a sketchy 3g connection out in the country)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcpan-workers @
categoriesperl
postedApr 1, '10 at 2:50p
activeApr 2, '10 at 3:08p
posts10
users7
websitecpan.org

People

Translate

site design / logo © 2021 Grokbase