FAQ
Reviewers: rsc, volker.dobler,

Message:
Hello rsc@golang.org, dr.volker.dobler@gmail.com (cc:
golang-dev@googlegroups.com),

I'd like you to review this change to
https://code.google.com/p/go


Description:
exp/publicsuffix: new package.

The tables were generated by:

go run gen.go -version "publicsuffix.org's effective_tld_names.dat, hg
revision 05b11a8d1ace (2012-11-09)" >table.go

go run gen.go -version "publicsuffix.org's effective_tld_names.dat, hg
revision 05b11a8d1ace (2012-11-09)" -test >table_test.go

The input data is temporarily filtered to the .ao, .ar, .arpa, .uk and
.zw domains, so that code review is easier while still covering the
interesting * and ! rules. A follow-up changelist will check in the
unfiltered public suffix list.

Update issue 1960.

Please review this at https://codereview.appspot.com/6873056/

Affected files:
A src/pkg/exp/publicsuffix/gen.go
A src/pkg/exp/publicsuffix/list.go
A src/pkg/exp/publicsuffix/list_test.go
A src/pkg/exp/publicsuffix/table.go
A src/pkg/exp/publicsuffix/table_test.go

Search Discussions

  • Rob Pike at Dec 5, 2012 at 4:25 am
    I'd rather not check in the full list. It's a lot of data that will
    just be out of date soon, breaking stuff and bloating the repo. Can't
    we generate it at build time?

    -rob
  • David Symonds at Dec 5, 2012 at 4:37 am

    On Wed, Dec 5, 2012 at 3:25 PM, Rob Pike wrote:

    I'd rather not check in the full list. It's a lot of data that will
    just be out of date soon, breaking stuff and bloating the repo. Can't
    we generate it at build time?
    We don't do that for the unicode tables. Doing it for this would add a
    a build-time dependency on a web site; it would be a gigantic nuisance
    if no-one can build Go just because some Mozilla-run webserver is
    down.
  • Andrew Gerrand at Dec 5, 2012 at 4:42 am
    But we do pull time zone information at runtime.

    Hm.
  • Russ Cox at Dec 5, 2012 at 4:51 am
    The raw file is being compiled into some other preprocessed form.
    We're only talking about the preprocessed tables being in the repo,
    not the raw file.

    How big is the source file form of the generated full list?
    How often does the publicsuffix.org list get updated?
    Is there a version history?

    For reference, the Unicode tables.go is 160 kB and we
    edited it five times this year.

    Russ
  • Nigel Tao at Dec 5, 2012 at 5:16 am
    Just an idea: it could live in the go.net subrepo. Punycode could also
    live there, I guess. That could push exp/cookiejar into go.net too,
    since it's not really useful without a public suffix list
    implementation.

    On Wed, Dec 5, 2012 at 3:50 PM, Russ Cox wrote:
    How big is the source file form of the generated full list?
    The generated table.go is 390K, but half of that is comments, since
    most lines look like:
    {0x40100c95, 0x006a7f09}, // 0x0629 (0x0c95-0x0ca5) + yamaguchi
    {0x401c0ca5, 0x006aa709}, // 0x062a (0x0ca5-0x0cc1) + yamanashi
    {0x20010cc1, 0x006bb608}, // 0x062b (0x0cc1-0x0cc2) _* yokohama
    {0x40000000, 0x00018705}, // 0x062c (-------------) + aisai
    {0x40000000, 0x00027303}, // 0x062d (-------------) + ama
    {0x40000000, 0x00036a04}, // 0x062e (-------------) + anjo
    {0x40000000, 0x00059005}, // 0x062f (-------------) + asuke
    {0x40000000, 0x000d9306}, // 0x0630 (-------------) + chiryu
    {0x40000000, 0x000d9905}, // 0x0631 (-------------) + chita

    The node table has 5840 rows, and the "const text = `etc`" string is
    27990 bytes, presumably the latter could get smaller if we dedupe it a
    la exp/html/atom's text string.

    The pkg/linux_amd64/exp/publicsuffix.a file is 562K. That package
    imports exp/cookiejar and strings. exp/cookiejar.a is 152K and
    strings.a is 334K. In comparison, unicode.a is 626K.

    The raw effective_tld_names.dat file has 6949 lines and weighs in at
    99K. For me, effective_tld_names.dat.gz is 33K, so I don't know where
    bradfitz is getting his 96K number from.

    How often does the publicsuffix.org list get updated?
    Is there a version history?
    http://hg.mozilla.org/mozilla-central/filelog/e8f0504ccbb9/netwerk/dns/effective_tld_names.dat
    suggests that there were 40 changes so far in 2012, although a couple
    of those were formatting changes.
  • Brad Fitzpatrick at Dec 5, 2012 at 5:21 am
    The raw effective_tld_names.dat file is 96 KB gzipped. That seems
    reasonable to keep in the repo, if we only keep deltas from that and not
    new 100 KB files every week or month.
    On Tue, Dec 4, 2012 at 8:50 PM, Russ Cox wrote:

    The raw file is being compiled into some other preprocessed form.
    We're only talking about the preprocessed tables being in the repo,
    not the raw file.

    How big is the source file form of the generated full list?
    How often does the publicsuffix.org list get updated?
    Is there a version history?

    For reference, the Unicode tables.go is 160 kB and we
    edited it five times this year.

    Russ
  • Rob Pike at Dec 5, 2012 at 5:19 am
    I think of this as more like the time zone data than the unicode
    table. Regardless, it's data that most programs will not need and I
    strongly object to checking in a version of every internet database
    that comes along.

    There are going to be more of these. If you insist on building it in,
    I'd prefer a subrepo to set the proper precedent. You don't need it
    built in anyway, since you'll have the repo built when you're on the
    plane.

    I don't know much about the data so I can't judge the merit of the
    hybrid approach but it does seem to solve all problems: little churn,
    small hit to repo, can work in in the Marianas trench, full
    flexibility available for show-offs.

    -rob
  • Brad Fitzpatrick at Dec 5, 2012 at 5:31 am
    The Marianas trench rebuttal is unwarranted: we've never required a network
    connection to run all.bash so far. This shouldn't be what breaks that.

    For comparison, we have 2 MB of images checked in to the repo. This is 96
    KB compressed. With an domain-specific compression scheme we could make
    that even less.

    I don't care about this database much but it's hard to discount its
    importance. It's as important to the web as HTTP and an HTML5 parser is,
    which we seem cool with having in the main repo.

    On Tue, Dec 4, 2012 at 9:19 PM, Rob Pike wrote:

    I think of this as more like the time zone data than the unicode
    table. Regardless, it's data that most programs will not need and I
    strongly object to checking in a version of every internet database
    that comes along.

    There are going to be more of these. If you insist on building it in,
    I'd prefer a subrepo to set the proper precedent. You don't need it
    built in anyway, since you'll have the repo built when you're on the
    plane.

    I don't know much about the data so I can't judge the merit of the
    hybrid approach but it does seem to solve all problems: little churn,
    small hit to repo, can work in in the Marianas trench, full
    flexibility available for show-offs.

    -rob
  • Russ Cox at Dec 5, 2012 at 5:37 am
    We shouldn't check in the raw data file, even gzipped. Nothing but the
    table generator will ever read that file, and it can read it from the
    internet, just like Unicode's maketables.go does.

    The only thing we should be talking about checking in is the generated
    tables.go file, which cannot be compressed (or it can't be compiled).

    Russ
  • Brad Fitzpatrick at Dec 5, 2012 at 5:40 am

    On Tue, Dec 4, 2012 at 9:37 PM, Russ Cox wrote:

    We shouldn't check in the raw data file, even gzipped. Nothing but the
    table generator will ever read that file, and it can read it from the
    internet, just like Unicode's maketables.go does.

    The only thing we should be talking about checking in is the generated
    tables.go file, which cannot be compressed (or it can't be compiled).

    I was under the impression this thread had turned into concerns about the
    repository's disk usage.

    It's not clear to me what this thread is even about:

    -- the web's grossness
    -- the necessity of more tables
    -- build system complexity
    -- build time
    -- network availability
    -- disk space

    ... and in what order.
  • Rob Pike at Dec 5, 2012 at 6:03 am
    This thread is about where and how to make this data accessible to Go programs.

    Does it need to be in the main repo?
    Does all of it need to be in the main repo?
    Would it be sufficient to put it in go.net?
    How should it be represented?
    How should it be maintained?

    -rob
  • David Symonds at Dec 5, 2012 at 6:07 am
    I'm thinking this package and cookiejar should just move to go.net right
    now, and we can figure out the long-term plan later.
  • Brad Fitzpatrick at Dec 5, 2012 at 6:10 am

    On Tue, Dec 4, 2012 at 10:03 PM, Rob Pike wrote:

    This thread is about where and how to make this data accessible to Go
    programs.

    Does it need to be in the main repo?
    Does all of it need to be in the main repo?

    Would it be sufficient to put it in go.net?
    >

    depends what the concerns are. what are your concerns?

    How should it be represented?
    How should it be maintained?
    etc

    -rob
  • Rob Pike at Dec 5, 2012 at 6:04 am
    This thread is about where and how to make this data accessible to Go programs.

    Does it need to be in the main repo?
    Does all of it need to be in the main repo?
    Would it be sufficient to put it in go.net?
    How should it be represented a

    -rob
  • David Symonds at Dec 5, 2012 at 6:03 am
    Yeah, I thought tables.go was the only thing being considered for checking
    in.
  • Brad Fitzpatrick at Dec 5, 2012 at 4:49 am
    I frequently build-hack-build Go on airplanes, buses, and ski lodges, all
    with terrible Internet access. I would personally trade some disk space in
    exchange for the ability to continue doing so. I left my previous project
    in part because it was too tedious to build.

    That said, it may not have to be in the repo. Support for the file format
    should be the repo, but perhaps only a minimal subset of the data gets
    included with Go and the rest is either fetched & cached lazily (at even
    runtime), or looked for at a well-known location on disk ala MIME type
    tables. (which we also only ship a minimal subset of)

    On Tue, Dec 4, 2012 at 8:25 PM, Rob Pike wrote:

    I'd rather not check in the full list. It's a lot of data that will
    just be out of date soon, breaking stuff and bloating the repo. Can't
    we generate it at build time?

    -rob
  • Dr Volker Dobler at Dec 5, 2012 at 7:12 am
    Just two remarks from a quick skim:

    nodes is an array? Would the GC have to scan
    this array or is it treated as one?

    I found the kyoto/kobe test in the testdata
    suite which can be found at
    http://mxr.mozilla.org/mozilla-central/source/netwerk/test/unit/data/test_psl.txt?raw=1
    often broke my work. Could you include those too?
    On 2012/12/05 00:47:08, nigeltao wrote:
    Hello mailto:rsc@golang.org, mailto:dr.volker.dobler@gmail.com (cc:
    mailto:golang-dev@googlegroups.com),
    I'd like you to review this change to
    https://code.google.com/p/go

    https://codereview.appspot.com/6873056/
  • Nigel Tao at Dec 5, 2012 at 10:03 am

    On Wed, Dec 5, 2012 at 6:12 PM, wrote:
    nodes is an array? Would the GC have to scan
    this array or is it treated as one?
    The GC skips the nodes array. For example, take this program:

    $ cat main.go
    package main

    var u = [...][2]uint32{
    {0x0101, 0x0102},
    {0x0201, 0x0202},
    {0x0301, 0x0302},
    }

    var p = [...][2]*uint32{
    {&u[0][0], &u[0][1]},
    {&u[1][0], &u[1][1]},
    {&u[2][0], &u[2][1]},
    }

    func main() {
    println(u[0][0])
    println(p[0][0])
    }


    The compiler output is:

    $ go tool 6g -SS -o /dev/null main.go | grep GLOBL
    0040 (main.go:7) GLOBL u+0(SB),16,$24(AL*0)
    0040 (main.go:13) GLOBL p+0(SB),$48(AL*0)
    [etc]

    Note that the u global variable is marked "16". This matches
    $GOROOT/src/cmd/6l/6.out.h:37:

    #define NOPTR (1<<4)

    which means that the value doesn't directly or indirectly contain any
    pointers, and thus doesn't need scanning during garbage collection.
    Keep grepping for NOPTR in $GOROOT/src/cmd if you're curious about the
    details (e.g. it leads to the haspointers function in
    $GOROOT/src/cmd/gc/reflect.c, which is where it's decided that
    [...][2]uint32 has no pointers).

    I'll look at the the kyoto/kobe test tomorrow.
  • Dr Volker Dobler at Dec 5, 2012 at 1:42 pm
    Current code breaks for the jp rules like *.kobe.jp
    which should match kobe.jp. But the fix seems trivial
    (see below).

    Your algorithm is very clean and nice.
    Maybe some more comments about how the rules are encoded
    into the tree of nodes might be useful for those who get
    confronted with this code for the first time.

    You named the table generator "gen" as does atom and md5
    while unicode and norm named it "maketable(s)".
    Any chance of unifying these? Any rule when to pick
    which name?

    The generated table.go can be stripped further and seems
    of reasonable size to be checked in. The comments per
    node should be suppressed by default and turned on via
    a command line flag to gen (if checked in).

    It is okay if all this goes to go.net, but I do not
    think that the public suffixes are just some "internet
    database that comes along": Cookies belong to HTTP,
    they are used for security relevant tasks and their
    security depends (to some degree) on not allowing domain
    cookies on certain domains. Unfortunately this list is
    long but not locally accessible like time zone data.
    A "striped version" is useless, even dangerous. The main
    issue I see with a built-in list is, that the Go release
    cycle is much longer than the lists (or current browser)
    and users might rely on an outdated list (which might be
    as bad as none or a stripped list). There is just one
    solution here: fetch at runtime.

    Fetching at runtime would allow a long running server to
    updated its list. Fetch at runtime seems okay, as you
    won't need the list if you don't have internet connection.
    Building Go itself or any other project would still be
    possible without network connectivity. No large, quickly
    out-dating table needs to be checked in (neither the raw,
    nor some generated table.go). Fetch at runtime doesn't
    necessarily mean fetch the raw list from the net: a local
    server, local file or falling back from net to local file
    might be simple and usable.

    Anyway, even fetch at runtime needs an internal
    representation of the rules and the proposed solution is
    a very nice one.





    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go
    File src/pkg/exp/publicsuffix/gen.go (right):

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode83
    src/pkg/exp/publicsuffix/gen.go:83: res, err := http.Get(*url)
    It would be nice to be able to use a local file on disc too.

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode215
    src/pkg/exp/publicsuffix/gen.go:215: fmt.Fprintf(w, "const text = ")
    Maybe a short doc comment for text too?

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode236
    src/pkg/exp/publicsuffix/gen.go:236: // [2] nodeType [1] wildcard [13]
    number of children [16] first child.
    A reversed layout of
    numChildren firstChild nodeType wildcard
    would result in an encoding with the two most significant
    bytes set to zero for all nodes without children.
    If output as %x instead of %08x, this would save some more
    bytes of checked in table data.

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode241
    src/pkg/exp/publicsuffix/gen.go:241: var nodes = [...][2]uint32{
    Using a uint64 instead of [2]uint32 reduces each line of
    the generated table by 6 byte (total of 22k or 6%) but
    might be slower on 32 bit hardware.

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode277
    src/pkg/exp/publicsuffix/gen.go:277: children map[string]*node
    If you use a slice instead of a map for the children,
    you can sort it once and do not need to "collect keys,
    sort keys, iterate map in sorted keys order" as you do
    twice below in assignNodeIndex and printTable.

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode325
    src/pkg/exp/publicsuffix/gen.go:325: fmt.Fprintf(w, "{0x%08x, 0x%08x},
    // 0x%04x (%s) %s%c %s\n",
    Could use "0x%x" as format which would save some bytes
    in the generated table (while decreasing readability).

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list.go
    File src/pkg/exp/publicsuffix/list.go (right):

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list.go#newcode47
    src/pkg/exp/publicsuffix/list.go:47: switch u {
    Empty nodes do match wildcard nodes. Adding
    case nodeTypeEmpty:
    if wildcard { suffix = 1 + dot }
    fixes all the testcases, including the kobe.jp
    ones.

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list_test.go
    File src/pkg/exp/publicsuffix/list_test.go (right):

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list_test.go#newcode163
    src/pkg/exp/publicsuffix/list_test.go:163: got :=
    slowPublicSuffix(tc.domain)
    Very nice idea.

    https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list_test.go#newcode242
    src/pkg/exp/publicsuffix/list_test.go:242: // TODO(nigeltao): add the
    "Effective Top Level Domain Plus 1" tests from
    These are the ones which discovered the kobe.jp bug.
    Feel free to just copy my testcases.

    https://codereview.appspot.com/6873056/
  • Florian Weimer at Dec 5, 2012 at 9:22 pm

    The input data is temporarily filtered to the .ao, .ar, .arpa, .uk and
    .zw domains, so that code review is easier while still covering the
    interesting * and ! rules. A follow-up changelist will check in the
    unfiltered public suffix list.
    What's the technical need for this list?

    As far as I can tell, it's just a cargo cult. Few things need it (I
    can see that Noscript benefits from it, for instance). The cookie
    restrictions never made much sense to me because they are trivially
    bypassed with a couple of server round trips (among cooperating web
    sites).
  • Russ Cox at Dec 6, 2012 at 3:58 am
    I don't want to move cookiejar code to go.net. I think it's
    embarrassing that we can't make HTTP requests with cookies because we
    don't have a cookiejar implementation, and I want to keep the pressure
    on to get one for Go 1.1. We might not make it, but I don't want to
    give up yet. I realize that the process is generating a lot of
    discussion, but for the most part I think it is helping us arrive a
    better API and implementation.

    I would like to continue iterating on the design of the table here,
    with an eye toward keeping both the in-memory and in-Go-file size as
    small as possible. Once Nigel and Volker are happy with both of those,
    I am not worried about the size of the generator program and the
    generated Go file (don't check in the original web version). In the
    current repository a few hundred K is not a significant difference. I
    would be more worried if we were checking in new tables every day, but
    we're not. However...

    Although I am not worried about the size, I do like Florian's
    suggestion of just not having the list at all. Is that tenable? Is it
    only full web browsers that really need to care? One alternative would
    be to implement a parser for the actual web format of the list and say
    that if you care your client must supply the file content to use.

    I would also like to know what the deal is with punycode. Why does it
    come up here? Why does it matter for cookies? Can we make the client
    be in charge of dealing with punycode too instead of making the cookie
    jar do it?

    Thanks.
    Russ
  • Nigel Tao at Dec 6, 2012 at 4:38 am

    On Thu, Dec 6, 2012 at 2:53 PM, Russ Cox wrote:
    Although I am not worried about the size, I do like Florian's
    suggestion of just not having the list at all. Is that tenable? Is it
    only full web browsers that really need to care? One alternative would
    be to implement a parser for the actual web format of the list and say
    that if you care your client must supply the file content to use.
    It would be possible to parse at runtime, but you might have to care
    then about how efficient the table-building code is.

    It's tenable to provide a 'dev null' public suffix list that always
    returns the last label (the au in foo.com.au). It would mean that
    evil.com.au could affect the cookies for google.com.au, which is
    obviously bad for a full web browser but a simple web-scraping program
    might not care.

    A slightly smarter list is to return the z in x.y.z if z is in
    {com,edu,gov,int,mil,net,org}, otherwise return y.z. I'm told that the
    original Netscape cookie spec did this, and it works for a lot of the
    internet, including example.com, example.com.au and example.co.uk, but
    it isn't correct for example.fr, example.shibuya.tokyo.jp or
    example.pvt.k12.ma.us.

    I believe that Florian is right in that co-operating web sites can
    work around the publicsuffix restrictions on cookies, but I think the
    concern is more about malicious web sites.

    I would also like to know what the deal is with punycode. Why does it
    come up here? Why does it matter for cookies? Can we make the client
    be in charge of dealing with punycode too instead of making the cookie
    jar do it?
    Punycode matters because some elements of
    http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1
    are IDNs (which I'm leaving as TODO for now). For example, grepping
    for "Hong Kong" finds:

    // xn--j6w193g ("Hong Kong" Chinese-Han) : HK
    // https://www2.hkirc.hk/register/rules.jsp
    香港

    I think the client should be responsible for punycoding the domain,
    but if we're generating the table ahead of time, we'll have to
    punycode at that time too.
  • Andrew Gerrand at Dec 6, 2012 at 4:53 am
    What are the security implications of the suffix list updates?

    If the list is checked into the main Go repo, does that mean people at Go
    1.1.0 might be left with an insecure HTTP client if a change is made
    between 1.1.0 and 1.1.1?

    Andrew
  • Russ Cox at Dec 6, 2012 at 4:58 am
    I can't believe I'm saying this but what does IE do?
  • Florian Weimer at Dec 6, 2012 at 8:36 am

    * Russ Cox:

    I can't believe I'm saying this but what does IE do?
    Here's a somewhat dated description, but I think the rationale still
    applies (but IE's behavior may have changed since):

    <http://blogs.msdn.com/b/ieinternals/archive/2009/09/19/private-domain-names-and-public-suffixes-in-internet-explorer.aspx>

    It's hard to come up with a short summary.

    The bypass I mentioned involves having something like this in web
    pages:

    <script src="https://www.enyo.de/metrics/guid.js">

    And the web server would look at a "GUID" cookie, set a new one if it
    does not exist, and return page content like this, reproducing the
    cookie value:

    enyoMetricsGUID = "90b57a93-4433-42b5-81cc-321b09edf868";

    The web page including the script then has access to this global
    cookie, and could use XMLHttpRequest to report it back to its server.

    Regarding the protection from session fixation attacks, these have to
    be targeted at a specific web site (because a valid session cookie is
    required). Setting such a cookie globally therefore does not make
    much sense. But it seems that all web browsers support third party
    cookies by default (and I recently discovered that some Wordpress
    services rely on them), so you could directly set the cookie on the
    victim domain, public suffix list or not.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupgolang-dev @
categoriesgo
postedDec 5, '12 at 12:47a
activeDec 6, '12 at 8:36a
posts26
users8
websitegolang.org

People

Translate

site design / logo © 2022 Grokbase