FAQ
Hello,

I'm working with T.J. Mather on updating Geo::PostalCode. One of the
things we're looking at is how to manage the ZIP code database that's
necessary for its operation. I've proposed creating a
Geo::PostalCode::US module as a very simple subclass of
Geo::PostalCode, and bundling the ZIP code data with that module. The
data is 1.3MB uncompressed, and compressed the entire module is about
480K. We have some concerns this may be frowned on, however, so I
thought I'd get some feedback here.

The advantages of having the data on CPAN is that the entire module is
self-sufficient and widely mirrored. It makes it much easier to
install, and if you have a CPAN distribution on CD or in a local
mirror, you have everything you need. The disadvantage is that it
takes up 480K on every single CPAN mirror.

Are there other modules that do this? Is there a consensus on what's
appropriate? And what are the list members opinions on the matter?

Thanks!

-----ScottG.

Search Discussions

  • Andy Lester at Jan 5, 2005 at 3:58 am

    The advantages of having the data on CPAN is that the entire module is
    self-sufficient and widely mirrored. It makes it much easier to
    install, and if you have a CPAN distribution on CD or in a local
    mirror, you have everything you need. The disadvantage is that it
    takes up 480K on every single CPAN mirror.
    I'd be more concerned about the updates. Make ::Data subdistro of it,
    like brian d foy has done with Business::ISBN and Business::ISBN::Data.
    When the code updates, you don't have to push out half a gig again.
    Code and data are separate distros.

    xoxo,
    Andy

    --
    Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance
  • Chris Josephes at Jan 5, 2005 at 4:27 am

    On Tue, 4 Jan 2005, Scott W Gifford wrote:

    The advantages of having the data on CPAN is that the entire module is
    self-sufficient and widely mirrored. It makes it much easier to
    install, and if you have a CPAN distribution on CD or in a local
    mirror, you have everything you need. The disadvantage is that it
    takes up 480K on every single CPAN mirror.
    Is the data currently encoded in a standard format? How difficult is it
    for end users and developers to get updated copies of the data.
    Are there other modules that do this? Is there a consensus on what's
    appropriate? And what are the list members opinions on the matter?
    I once wrote a perl script that had base64 encoded PNG images for default
    image icons, but the total damage was much smaller than 480kb. One
    concern I would have is how dependent would your users be on you to be the
    primary provider of the data updates?

    Would updates always be in the form of a re-versioned perl module?


    --------------------
    Christopher Josephes
    cpj1@visi.com
  • Scott W Gifford at Jan 5, 2005 at 4:59 am
    A private emailer wrote:

    [...]
    Even better isn't all this on a USPS server? Whatever tool you use
    to grab their server database, include it and do that as part of the
    build process or perhaps offer it as an option , the alternative
    being to go to the USPS server every time.
    Three things I don't like about that. First, it gives a single point
    of failure for the system, as compared to relying on CPAN's large
    network of mirrors; worse, it would break if USPS's server ever
    changed the file location or format. Second, if you want to use
    something like CPAN on CD to install onto a non-Internet connected
    machine, downloading from an external server will break; you'll have
    to arrange for that one file to be copied over and installed by hand.
    Third, if you cache CPAN modules for installation to many machines,
    this will bypass the cache.

    Andy Lester <andy@petdance.com> writes:

    [...]
    I'd be more concerned about the updates. Make ::Data subdistro of
    it, like brian d foy has done with Business::ISBN and
    Business::ISBN::Data.
    Right, the data is in a new module called Geo::PostalCode::US;
    Geo::PostalCode is the main module. The code for Geo::PostalCode::US
    is 26 lines plus some cleverness in Makefile.PL, so I'm not too
    concerned about seperating code and data for this module.
    When the code updates, you don't have to push out half a gig again.
    Just to be clear, it's half a meg; I wouldn't put half a gig onto
    CPAN.

    Chris Josephes <cpj1@visi.com> writes:

    [...]
    Is the data currently encoded in a standard format? How difficult is it
    for end users and developers to get updated copies of the data. [...]
    One concern I would have is how dependent would your users be on you
    to be the primary provider of the data updates?
    The main Geo::PostalCode module provides easy instructions for
    generating your own location database from a simple tab-seperated
    value file. Geo::PostalCode::US is just for easier download and
    installation.
    Would updates always be in the form of a re-versioned perl module?
    Yes. They'd be very infrequent. The data we're using now is from
    1999.

    I'm working on this in response to a user on PerlMonks who had a very
    difficult time getting Geo::PostalCode installed and set up right on a
    hosting provider. It's much nicer to be able to install it with the
    standard "perl Makefile.PL; make; make test; make install".

    ----ScottG.
  • Dana Hudes at Jan 5, 2005 at 5:04 am

    On Tue, 4 Jan 2005, Scott W Gifford wrote:

    A private emailer wrote:

    [...]
    Even better isn't all this on a USPS server? Whatever tool you use
    to grab their server database, include it and do that as part of the
    build process or perhaps offer it as an option , the alternative
    being to go to the USPS server every time.
    Three things I don't like about that. First, it gives a single point
    of failure for the system, as compared to relying on CPAN's large
    network of mirrors; worse, it would break if USPS's server ever
    changed the file location or format. Second, if you want to use
    something like CPAN on CD to install onto a non-Internet connected
    machine, downloading from an external server will break; you'll have
    to arrange for that one file to be copied over and installed by hand.
    Third, if you cache CPAN modules for installation to many machines,
    this will bypass the cache.
    the alternative is stale data. The USPS server is authoritative.
    This is like saying I should distribute the entire registry for all
    of .COM with Net::Whois .
    I guess it comes down to how often the ZIP codes change.
    I have no idea but of course its less frequent than the registry of .COM .

    You can always offer a tool in the scripts/ directory of your code distro
    to build the database from the Internet. That addresses your cache issue.
  • Scott W Gifford at Jan 5, 2005 at 5:20 am

    Dana Hudes writes:
    On Tue, 4 Jan 2005, Scott W Gifford wrote:

    A private emailer wrote:

    [...]
    Even better isn't all this on a USPS server? Whatever tool you use
    to grab their server database, include it and do that as part of the
    build process or perhaps offer it as an option , the alternative
    being to go to the USPS server every time.
    Three things I don't like about that.
    [...]
    Third, if you cache CPAN modules for installation to many machines,
    this will bypass the cache.
    the alternative is stale data. The USPS server is authoritative. [...]
    I guess it comes down to how often the ZIP codes change.
    [...]

    The file we have now is from the 1999 US census, so it seems likely
    we'll get a new one with the census, every 10 years. I think it's
    unlikely the data format (a dBase file with a Word document describing
    the fields) will be the same in 2009, and I don't think there's any
    guarantee about what the URL will be when it is published; the current
    one is:

    http://www.census.gov/geo/www/tiger/zip1999.html

    So I don't think we have any real options as far as keeping data
    up-to-date automatically. If the data were published in a
    standardized place in a standardized format, I'd be more inclined to
    agree about the evils of stale data.
    You can always offer a tool in the scripts/ directory of your code
    distro to build the database from the Internet. That addresses your
    cache issue.
    I was thinking of somebody using something like CPAN::Mini to create a
    local cache of CPAN modules. I thought there was a CPAN::Cache module
    that somehow downloaded only one copy for a group of machines in the
    same place to save bandwidth, but maybe that was just a rather dull
    dream. In any event, using a script to download it doesn't help with
    caching for either of these circumstances, although of course it will
    with a Web cache.

    ----ScottG.
  • _brian_d_foy at Jan 5, 2005 at 6:43 pm
    In article <Pine.LNX.4.58.0501050002020.14260@screamer.tcp-ip.info>,
    Dana Hudes wrote:
    On Tue, 4 Jan 2005, Scott W Gifford wrote:

    A private emailer wrote:

    [...]
    Even better isn't all this on a USPS server? Whatever tool you use
    to grab their server database, include it and do that as part of the
    build process or perhaps offer it as an option , the alternative
    being to go to the USPS server every time.
    Three things I don't like about that. First, it gives a single point
    of failure for the system,
    the alternative is stale data. The USPS server is authoritative.
    It may be authorative, but it certainly is slow. I would like a
    data file that I can use without a net connection and for several
    thousand records. I don't want to make several thousand requests
    to a web site.
    I guess it comes down to how often the ZIP codes change.
    I have no idea but of course its less frequent than the registry of .COM .
    You can always offer a tool in the scripts/ directory of your code distro
    to build the database from the Internet. That addresses your cache issue.
    The USPS folks offer all the data on CD. Those of us that really care
    about such things would rather just build it from those files, I
    think :)

    In Business::ISBN::Data, I included the script I used to convert the
    information from the ISBN folks to the data file I needed. Failing
    that, instructions are the next best thing.

    --
    brian d foy, comdog@panix.com
  • _brian_d_foy at Jan 5, 2005 at 6:38 pm
    [[ This message was both posted and mailed: see
    the "To," "Cc," and "Newsgroups" headers for details. ]]

    In article <qszfz1gmt78.fsf@timepilot.gpcc.itd.umich.edu>, Scott W
    Gifford wrote:
    I'm working with T.J. Mather on updating Geo::PostalCode. One of the
    things we're looking at is how to manage the ZIP code database that's
    necessary for its operation.
    I eventually bundled the ISBN data for Business::ISBN separately.
    Geo::IP has a separate and updatable data file which users need
    to download separately.

    That's the way to go I think.

    --
    brian d foy, comdog@panix.com
  • Graciliano M. P. at Jan 5, 2005 at 8:22 pm
    You should compress the data and than append it with __DATA__ or as a string
    saved with Base64 to avoid binary errors.

    Here's a simple sample:

    use Compress::Zlib qw(compress uncompress) ;
    use MIME::Base64 qw(encode_base64 decode_base64) ;

    my $uncompressed = 'some sample data' ;

    my $base64 = encode_base64( compress($uncompressed) ) ;

    print "$base64\n" ;

    my $original = uncompress( decode_base64($base64) ) ;

    print "$original\n" ;

    Regards,
    Graciliano M. P.

    ----- Original Message -----
    From: "Scott W Gifford" <gifford@umich.edu>
    To: <module-authors@perl.org>
    Sent: Wednesday, January 05, 2005 12:48 AM
    Subject: [Spam] Including a 480K data file with a module

    Hello,

    I'm working with T.J. Mather on updating Geo::PostalCode. One of the
    things we're looking at is how to manage the ZIP code database that's
    necessary for its operation. I've proposed creating a
    Geo::PostalCode::US module as a very simple subclass of
    Geo::PostalCode, and bundling the ZIP code data with that module. The
    data is 1.3MB uncompressed, and compressed the entire module is about
    480K. We have some concerns this may be frowned on, however, so I
    thought I'd get some feedback here.

    The advantages of having the data on CPAN is that the entire module is
    self-sufficient and widely mirrored. It makes it much easier to
    install, and if you have a CPAN distribution on CD or in a local
    mirror, you have everything you need. The disadvantage is that it
    takes up 480K on every single CPAN mirror.

    Are there other modules that do this? Is there a consensus on what's
    appropriate? And what are the list members opinions on the matter?

    Thanks!

    -----ScottG.

    E-mail classificado pelo Identificador de Spam Inteligente Terra.
    Para alterar a categoria classificada, visite
    http://www.terra.com.br/centralunificada/emailprotegido/imail/imail.cgi?+_u=
    gmpowers&_l=1,1104896949.850686.30145.lusaca.terra.com.br,2642,2003112711410
    1,20031127114101
    Esta mensagem foi verificada pelo E-mail Protegido Terra.
    Scan engine: McAfee VirusScan / Atualizado em 29/12/2004 / Versão:
    4.4.00 - Dat 4417
    Proteja o seu e-mail Terra: http://www.emailprotegido.terra.com.br/



    --
    No virus found in this incoming message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.6.8 - Release Date: 3/1/2005


    --
    No virus found in this outgoing message.
    Checked by AVG Anti-Virus.
    Version: 7.0.300 / Virus Database: 265.6.8 - Release Date: 3/1/2005
  • Ken Williams at Jan 6, 2005 at 12:45 am

    On Jan 5, 2005, at 3:29 PM, Graciliano M. P. wrote:

    You should compress the data and than append it with __DATA__ or as a
    string
    saved with Base64 to avoid binary errors.
    I think not - then you'd have to decompress it every time you used the
    module. Just gzip it and put it on whatever distribution site you
    choose (which may or may not be CPAN) and then users will decompress
    before installation.

    -Ken
  • Rhet Turnbull at Jan 6, 2005 at 2:25 am

    module. Just gzip it and put it on whatever distribution site you
    choose (which may or may not be CPAN) and then users will decompress
    before installation.
    Why not just keep the file gzipped and use IO::Zlib to read it at run
    time? The speed difference will probably not even be noticeable.


    On Wed, 5 Jan 2005 18:44:51 -0600, Ken Williams wrote:
    On Jan 5, 2005, at 3:29 PM, Graciliano M. P. wrote:

    You should compress the data and than append it with __DATA__ or as a
    string
    saved with Base64 to avoid binary errors.
    I think not - then you'd have to decompress it every time you used the
    module. Just gzip it and put it on whatever distribution site you
    choose (which may or may not be CPAN) and then users will decompress
    before installation.

    -Ken
  • Ken Williams at Jan 6, 2005 at 2:32 am

    On Jan 5, 2005, at 8:24 PM, Rhet Turnbull wrote:

    module. Just gzip it and put it on whatever distribution site you
    choose (which may or may not be CPAN) and then users will decompress
    before installation.
    Why not just keep the file gzipped and use IO::Zlib to read it at run
    time? The speed difference will probably not even be noticeable.
    Because not everybody has IO::Zlib available to them on the platform
    they're targeting. But pretty much everyone can decompress a file once
    using WinZip or another machine or whatever.

    -Ken
  • Scott W Gifford at Jan 6, 2005 at 6:12 am

    "Ken Williams" <ken@mathforum.org> writes:
    On Jan 5, 2005, at 3:29 PM, Graciliano M. P. wrote:

    You should compress the data and than append it with __DATA__ or as
    a string
    saved with Base64 to avoid binary errors.
    I think not - then you'd have to decompress it every time you used the
    module. Just gzip it and put it on whatever distribution site you
    choose (which may or may not be CPAN) and then users will decompress
    before installation.
    Why not just leave it uncompressed, and let the compression of the
    whole package into Geo-PostalCode-US-0.1.tar.gz take care of
    compressing it? Otherwise I'd end up with a gzip file inside of a
    gzip file---no extra compression, and I have to require an extra
    module to uncompress it on the client side.

    -----ScottG.
  • Ken Williams at Jan 6, 2005 at 1:47 pm

    On Jan 6, 2005, at 12:12 AM, Scott W Gifford wrote:

    "Ken Williams" <ken@mathforum.org> writes:
    On Jan 5, 2005, at 3:29 PM, Graciliano M. P. wrote:

    You should compress the data and than append it with __DATA__ or as
    a string
    saved with Base64 to avoid binary errors.
    I think not - then you'd have to decompress it every time you used the
    module. Just gzip it and put it on whatever distribution site you
    choose (which may or may not be CPAN) and then users will decompress
    before installation.
    Why not just leave it uncompressed, and let the compression of the
    whole package into Geo-PostalCode-US-0.1.tar.gz take care of
    compressing it?
    Yeah, that's what I mean.

    -Ken
  • _brian_d_foy at Jan 6, 2005 at 6:04 pm
    In article <qszvfabrsqc.fsf@rygar.gpcc.itd.umich.edu>, Scott W Gifford
    wrote:

    Why not just leave it uncompressed, and let the compression of the
    whole package into Geo-PostalCode-US-0.1.tar.gz take care of
    compressing it?
    I recommend not doing that. I had a lot of problems distributing
    the data for Business::ISBN with the code. Most notably, installing
    the module overwrote any updated data the user had added.

    --
    brian d foy, comdog@panix.com
  • Scott W Gifford at Jan 6, 2005 at 6:57 pm

    _brian_d_foy writes:

    In article <qszvfabrsqc.fsf@rygar.gpcc.itd.umich.edu>, Scott W Gifford
    wrote:

    Why not just leave it uncompressed, and let the compression of the
    whole package into Geo-PostalCode-US-0.1.tar.gz take care of
    compressing it?
    I recommend not doing that. I had a lot of problems distributing
    the data for Business::ISBN with the code. Most notably, installing
    the module overwrote any updated data the user had added.
    If the user has custom data, they would just install Geo::PostalCode
    and build their own database (it includes a short script to do this,
    and the process takes about 2 minutes). Geo::PostalCode::US doesn't
    replace Geo::PostalCode, but just adds a second module with the data.

    -----ScottG.
  • Chris Josephes at Jan 6, 2005 at 7:55 pm

    On Thu, 6 Jan 2005, Scott W Gifford wrote:

    If the user has custom data, they would just install Geo::PostalCode
    and build their own database (it includes a short script to do this,
    and the process takes about 2 minutes). Geo::PostalCode::US doesn't
    replace Geo::PostalCode, but just adds a second module with the data.
    I think the thing that is still gnawing at me is having one perl module
    that is just data. Other guys do it, but the data in Geo::Postal is
    actually really, really useful.

    What about this option?

    1. Include the datafiles, either compressed or uncompressed with the
    module distribution.

    2. Include a "make install-data" target that will unzip the data, and then
    install it in a standard location your module will search by default.
    Would /usr/share/postal/us be a bad place? Is anyone else doing this?

    3. Whenever the Geo::Postal code is updated, include the latest datafiles.

    4. If the datafiles get updated, and it does not coincide with a code
    release, make sure people know where to get those files (either point them
    to the Cencus website, or provide your own zipfile distro).

    That way you bundle the data with the code, and you give end-users the
    option of whether or not they want to use the data you provide.


    --------------------
    Christopher Josephes
    cpj1@visi.com
  • Scott W Gifford at Jan 6, 2005 at 8:08 pm
    Chris Josephes writes:

    [...]
    2. Include a "make install-data" target that will unzip the data, and then
    install it in a standard location your module will search by default.
    Would /usr/share/postal/us be a bad place? Is anyone else doing this?
    The problem is that you can't easily do this from perl -MCPAN -e
    shell.

    None of the objections I've seen so far have related to the size of
    the data. There have been many concerns about the freshness, but
    because of how rarely it's updated (every 10 years) and the lack of a
    standard way to get and process the data, I think that there isn't
    really a better alternative. There have been many concerns about
    having to re-distribute the data when the code changes, or the code
    when the data changes, but the data is in a seperate module with very
    little code (26 lines), so I don't really think that's a problem,
    either.

    If you have any objections about the size of the file, or any other
    objections that haven't been brough up already, let me know; otherwise
    I'll recommend to TJ that he upload the Geo::PostalCode::US module
    with the data.

    Thanks for all the feedback!

    -----ScottG.
  • Sébastien Aperghis-Tramoni at Jan 6, 2005 at 8:35 pm

    Scott W Gifford wrote:

    If you have any objections about the size of the file, or any other
    objections that haven't been brough up already, let me know; otherwise
    I'll recommend to TJ that he upload the Geo::PostalCode::US module
    with the data.
    Just to make a point about the size, I don't think that an archive of
    less than half a megabyte is an issue. Looking in my MiniCPAN, I see
    three distributions of more than 10 MB (Bio-Affymetrix,
    bioperl-microarray, Bio-PrimerDesigner). Tk is 5.7 MB, and some of the
    modules than deals with Asian languages (Unicode-Unihan,
    Lingua-ZH-CCDICT, Encode) are between 2 and 4 MB.

    My 2 eurocents.

    Regards,

    Sébastien Aperghis-Tramoni
    -- - --- -- - -- - --- -- - --- -- - --[ http://maddingue.org ]
    Close the world, txEn eht nepO
  • Chris Josephes at Jan 6, 2005 at 10:12 pm

    On Thu, 6 Jan 2005, Chris Josephes wrote:

    2. Include a "make install-data" target that will unzip the data, and then
    install it in a standard location your module will search by default.
    Would /usr/share/postal/us be a bad place? Is anyone else doing this?
    As a quick follow-up to my own post, I'll admit that this location doesn't
    make any sense in the Windows world. My bad. Maybe C:\WINDOWS\POSTAL\US
    ??

    --------------------
    Christopher Josephes
    cpj1@visi.com
  • Hugh S. Myers at Jan 6, 2005 at 11:01 pm
    I've a similar although smaller situation with one of my modules and I
    finally decided that the only platform non-dependant solution was to keep
    the DB in question in the install tree of the module. That way it is the
    same regardless of OS.

    --hsm
    -----Original Message-----
    From: Chris Josephes
    Sent: Thursday, January 06, 2005 3:12 PM
    To: module-authors@perl.org
    Subject: Re: Including a 480K data file with a module
    On Thu, 6 Jan 2005, Chris Josephes wrote:

    2. Include a "make install-data" target that will unzip the data, and then
    install it in a standard location your module will search by default.
    Would /usr/share/postal/us be a bad place? Is anyone else doing this?
    As a quick follow-up to my own post, I'll admit that this location doesn't
    make any sense in the Windows world. My bad. Maybe C:\WINDOWS\POSTAL\US
    ??

    --------------------
    Christopher Josephes
    cpj1@visi.com
  • Scott W Gifford at Jan 7, 2005 at 9:06 am

    "Hugh S. Myers" <hsmyers@sdragons.com> writes:

    I've a similar although smaller situation with one of my modules and I
    finally decided that the only platform non-dependant solution was to keep
    the DB in question in the install tree of the module. That way it is the
    same regardless of OS.
    That's what I did, too. I couldn't find another straightforward way
    to figure out a good place to put it, and where I could find it later.

    ----ScottG.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmodule-authors @
categoriesperl
postedJan 5, '05 at 3:49a
activeJan 7, '05 at 9:06a
posts22
users10
websitecpan.org...

People

Translate

site design / logo © 2021 Grokbase