On 3/13/13 2:31 PM, David Golden wrote:
tl;dr: non-unique distribution names are annoying and create a
security hole on rt.cpan.org. Fixing it may not be trivial.

"Distributions", releases of a single project, are largely informal
entities yet they're basic CPAN structure. It would be good to
normalize and formalize them.

## Terminology and context ##

By "distribution", I generally mean the unique path of a CPAN
distribution in the authors/id/X/XY directory. I may occassionally
refer to this as a "distfile" for specific clarity. The "distribution
name" is the portion of the basename without version or suffix.

distribution: DAGOLDEN/Foo-Bar-1.23.tar.gz
author: DAGOLDEN
distribution name: Foo-Bar
version: 1.23
suffix: .tar.gz

Distributions contain modules (.pm files), which contain packages
(namespaces declared by "package NAME"). PAUSE indexes packages and
associates them with a source distribution. PAUSE has a system of
permissions for packages and ensures that distributions are unique.
I'd suggest that what you describe is a *release* of a distribution.

Here's how BackPAN::Index does it...

A distribution has a name and a list of releases.

id: Foo-Bar
- DAGOLDEN/Foo-Bar-1.23.tar.gz
- DAGOLDEN/Foo-Bar-1.22.tar.gz
- MORBO/Foo-Bar-1.00.tar.gz

It is effectively "the project" and may make more sense to call it
"project" to avoid ambiguity over "distribution".

Releases have a file (which is the same as the identifier, but does not
have to be), an author (really "releaser"), a version and a
distribution. They have other stuff, but this is enough to get the
basic release vs distribution relationship.

id: DAGOLDEN/Foo-Bar-1.23.tar.gz
releaser: DAGOLDEN
version: 1.23
distribution: Foo-Bar
"Foo::Bar": 1.23,
"Foo::Bar::Baz": 1.23

Currently the release contains most of the meta information about the
distribution such as mailing list, stability, contact info and version
control. It may make sense to move the formal information of project
meta data into the distribution, but keep the mechanism for updating it
to include it with the latest release. Effectively, most of the project
meta data is aliased to the latest release.

I observed the following after PAUSE accepted the distributions and
indexed the packages:

(1) metacpan.org and search.cpan.org incorrectly linked my
distributions to BinGOs'. E.g. they both believe the latest
"Acme-CPAN-Testers-UNKNOWN" is mine, though their contents and primary
maintainers are completely different.

(2) rt.cpan.org treated both distributions as having the same RT
queue. I gained administrative access to BinGOs' existing queues.

(3) cpantesters.org treated both distributions as one for the purpose
of aggregating test reports[1]
These are good observations about how (not) easy it is to get permission
information out CPAN/PAUSE.

## Solutions ##

Here's where I start brainstorming. If we can get some good
discussion on this list, then maybe we could finalize a plan at the QA
hackathon, which will have a number of the relevant
maintainers/administrators attending.

(a) We could do nothing; we've lived with it and can continue to live
with it and will police any incidents on a one-off basis
In several projects (Gitpan and BackPAN::Index being two) I've found
that putting together what a "distribution" is either very difficult to
get correct, or you live with a high amount of broken distribution
lists. It would make working with CPAN much easier if discovering what
a distribution is and their releases was easy and correct.

Which is to say, this is not just a security problem. The cost of our
messy concept of distributions is a barrier to doing interesting things
with CPAN.

(b) We could extend PAUSE's permission system distribution names as
well, so that distribution names would have primary/co-maint rights
just as packages do. This would not fix any existing duplicates, but
would prevent future infractions. It means changing a lot of PAUSE
code, but would allow RT, search sites and CPAN Testers (CT) to pretty
much remain as is.

IMO this is a necessary piece of missing CPAN meta data which everyone
else has to piece together again and again.

We could also retroactively fix duplicates as they are reported once and
for all.

(c) We could restrict PAUSE to allow only "well formed" distribution
names[2] -- ones matching a module name inside containing a package of
a corresponding name. E.g. "Foo-Bar", containing "Foo/Bar.pm" with
package "Foo::Bar". The existing package permissions system becomes
the chokepoint to restrict abuse.
-1 I don't think this is necessary if B is in place, and B is a much
better solution.

We've always had a policy of being very liberal with what we allow and
not everything is a Perl library, PAUSE will not try to index it. I'm
ok with that.

If you have something which falls outside the normal structure, for some
reason Foo-Bar-X.YZ.tar.gz doesn't have a lib/Foo/Bar.pm, the meta data
would be trusted. If it says its release X.YZ of the Foo-Bar
distribution then it is. The permissions system in B protects the rest
and a permissions/distribution API lets external sites query it.

Existing distributions with
non-conforming names (e.g. libwww-perl) either change for their next
release or get grandfathered somehow.
I'm happy to grandfather in existing major packages, especially major
ones like libwww-perl where people have learned to look for
libwww-perl-X.YZ.tar.gz and not LWP-X.YZ.tar.gz

(d) RT, search sites and CT stop using distribution name as a key and
revert either to package names or to distfile in some fashion. This
is not a trivial amount of code change and -- in the case of RT --
might make RT much more complicated and less useful.

The distribution name is still a good identifier and I'd rather see the
distribution meta problem solved.

(e) We could develop a new, unique way to identify collections of
related packages. This could be based on some combination of
distribution name and the name of an authorized packages it contains,
or perhaps just on the name of a "primary" package. RT and the search
sites would need to migrate to a new data model and probably change
their HTTP routes to match.

Sounds complicated and unnecessary. Its hard to express for humans.
The set of authorized packages and who did the release changes from
release to release. Colons give some filesystems (OS X) indigestion.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 14 | next ›
Discussion Overview
groupcpan-workers @
postedMar 13, '13 at 9:31p
activeMar 20, '13 at 3:13a



site design / logo © 2018 Grokbase