Current code breaks for the jp rules like *.kobe.jp
which should match kobe.jp. But the fix seems trivial
(see below).
Your algorithm is very clean and nice.
Maybe some more comments about how the rules are encoded
into the tree of nodes might be useful for those who get
confronted with this code for the first time.
You named the table generator "gen" as does atom and md5
while unicode and norm named it "maketable(s)".
Any chance of unifying these? Any rule when to pick
which name?
The generated table.go can be stripped further and seems
of reasonable size to be checked in. The comments per
node should be suppressed by default and turned on via
a command line flag to gen (if checked in).
It is okay if all this goes to go.net, but I do not
think that the public suffixes are just some "internet
database that comes along": Cookies belong to HTTP,
they are used for security relevant tasks and their
security depends (to some degree) on not allowing domain
cookies on certain domains. Unfortunately this list is
long but not locally accessible like time zone data.
A "striped version" is useless, even dangerous. The main
issue I see with a built-in list is, that the Go release
cycle is much longer than the lists (or current browser)
and users might rely on an outdated list (which might be
as bad as none or a stripped list). There is just one
solution here: fetch at runtime.
Fetching at runtime would allow a long running server to
updated its list. Fetch at runtime seems okay, as you
won't need the list if you don't have internet connection.
Building Go itself or any other project would still be
possible without network connectivity. No large, quickly
out-dating table needs to be checked in (neither the raw,
nor some generated table.go). Fetch at runtime doesn't
necessarily mean fetch the raw list from the net: a local
server, local file or falling back from net to local file
might be simple and usable.
Anyway, even fetch at runtime needs an internal
representation of the rules and the proposed solution is
a very nice one.
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.goFile src/pkg/exp/publicsuffix/gen.go (right):
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode83src/pkg/exp/publicsuffix/gen.go:83: res, err := http.Get(*url)
It would be nice to be able to use a local file on disc too.
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode215src/pkg/exp/publicsuffix/gen.go:215: fmt.Fprintf(w, "const text = ")
Maybe a short doc comment for text too?
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode236src/pkg/exp/publicsuffix/gen.go:236: // [2] nodeType [1] wildcard [13]
number of children [16] first child.
A reversed layout of
numChildren firstChild nodeType wildcard
would result in an encoding with the two most significant
bytes set to zero for all nodes without children.
If output as %x instead of %08x, this would save some more
bytes of checked in table data.
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode241src/pkg/exp/publicsuffix/gen.go:241: var nodes = [...][2]uint32{
Using a uint64 instead of [2]uint32 reduces each line of
the generated table by 6 byte (total of 22k or 6%) but
might be slower on 32 bit hardware.
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode277src/pkg/exp/publicsuffix/gen.go:277: children map[string]*node
If you use a slice instead of a map for the children,
you can sort it once and do not need to "collect keys,
sort keys, iterate map in sorted keys order" as you do
twice below in assignNodeIndex and printTable.
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/gen.go#newcode325src/pkg/exp/publicsuffix/gen.go:325: fmt.Fprintf(w, "{0x%08x, 0x%08x},
// 0x%04x (%s) %s%c %s\n",
Could use "0x%x" as format which would save some bytes
in the generated table (while decreasing readability).
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list.goFile src/pkg/exp/publicsuffix/list.go (right):
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list.go#newcode47src/pkg/exp/publicsuffix/list.go:47: switch u {
Empty nodes do match wildcard nodes. Adding
case nodeTypeEmpty:
if wildcard { suffix = 1 + dot }
fixes all the testcases, including the kobe.jp
ones.
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list_test.goFile src/pkg/exp/publicsuffix/list_test.go (right):
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list_test.go#newcode163src/pkg/exp/publicsuffix/list_test.go:163: got :=
slowPublicSuffix(tc.domain)
Very nice idea.
https://codereview.appspot.com/6873056/diff/12001/src/pkg/exp/publicsuffix/list_test.go#newcode242src/pkg/exp/publicsuffix/list_test.go:242: // TODO(nigeltao): add the
"Effective Top Level Domain Plus 1" tests from
These are the ones which discovered the kobe.jp bug.
Feel free to just copy my testcases.
https://codereview.appspot.com/6873056/