On Wed, Mar 23, 2016 at 11:43 PM, wrote:
Re the overall pictures at http://imgur.com/a/5sWM4 and at
https://blog.klauspost.com/rebalancing-deflate-compression-levels/
it's not obvious to me whether the graphs are showing the old or new
algorithms for levels 1, 2 and 3. In fact, I think the best graph
would show both.
I didn't want to go into details about the re-balancing, since we can takeRe the overall pictures at http://imgur.com/a/5sWM4 and at
https://blog.klauspost.com/rebalancing-deflate-compression-levels/
it's not obvious to me whether the graphs are showing the old or new
algorithms for levels 1, 2 and 3. In fact, I think the best graph
would show both.
that discussion separately, but the point was to show that:
* There is a big gap from the proposed level 3 to level 4, which will be
In any case, I will repeat myself in case there was a
mis-communication, but even if there is or was or both being a big gap
between level 3 and level 4, it would be great to see both the old and
new algorithms for levels 1-3 on the graph.
Flate is normally Huffman + LZ77. In
https://go-review.googlesource.com/#/c/20982/ you are also proposing a
Huffman-only compression level. How does that relate (if at all) to
this proposal?
It is not directly related, so I separated it out in a separate CL, andhttps://go-review.googlesource.com/#/c/20982/ you are also proposing a
Huffman-only compression level. How does that relate (if at all) to
this proposal?
didn't mention it here to have the discussion a bit more focused.
mentioning at some point on golang-dev. But it's mentioned now, and
the change LGTM (modulo some minor issues).
Should we have more than one such level? Conversely, when introducing
Snappy style matching, should we replace levels 1, 2 and 3, or just
level 1 aka BestSpeed?
Level 2+3 offer good progression. The "generic" deflater leaves a big gap toSnappy style matching, should we replace levels 1, 2 and 3, or just
level 1 aka BestSpeed?
level 1, and hits a performance ceiling. This is what you see in my blog
post, which was written before I wrote the specialized level 2+3.
after for levels 1-3 on the graphs would help inform the discussion),
but in practice, do people use anything other than BestSpeed,
BestCompression, NoCompression or DefaultCompression? I don't know the
answer, but if not, it might not be worth having three copy/pastes of
the new algorithm if nobody uses levels 2 or 3.
Please don't do that. We have had problems with this before:
Don't worry, I know what I am doing. Each deflate block is never longerthank 64KB, so skipping is reset for every 64KB data processed. This works
as intended. If it didn't 10GB.tar would end up considerably worse.
persuasive. That sort of reassurance would not have prevented the bug
that https://github.com/golang/snappy/commit/d1d908a252c22fd7afd36190d5cffb144aa8f777
fixed.
Even if you are now resetting every 64KB, that still means that the "s
+= 1 + ((s - lit) >> 4)" skipping can be very aggressive within each
64KB block. Specifically, it is exponential instead of quadratic (see,
for example, the table after "It prints" on the golang/snappy commit
linked to immediately above), which I think is too aggressive.
We clearly disagree on where to pick the trade-off for speed vs size,
but I feel that "do more or less what C++ snappy does" is a safe
choice, and using a novel algorithm has some risk: I discovered and
fixed one bug in d1d908a2 but there may be others, on other classes of
input.
I know that the compression experts that made google/snappy test
algorithmic tweaks on large corpora of test data, whether samples of a
web crawl or of RPC traffic, although it's clearly not practical to
check in 100s of megabytes of test data into the google/snappy git
repository. If you can convince the C++ google/snappy maintainers to
change their skipping algorithm to yours, then I'm happy to adopt your
skipping algorithm. But until then, I don't think I've seen enough
evidence to move off the safe choice.
For example, https://go-review.googlesource.com/#/c/20513/ points to a
spreadsheet at https://docs.google.com/spreadsheets/d/1VLxi-ac0BAtf735HyH3c1xRulbkYYUkFecKdLPH7NIQ/edit?usp=sharing
but to repeat what I said earlier on that code review: If I'm reading
the spreadsheet linked to from CL description right, there are only
three test files used to compare the before and after size. I don't
see links to the actual test files, but all three of them ("very
compressible", "random data", "xml") sound like they're either all
compressible or all incompressible. Three isn't a large sample size,
and it doesn't cover things like PDF files which, IIUC, is like a
textual format (i.e. compressible) that can embed binary resources
such as JPEG images (i.e. incompressible) all in the one file.
Btw, now that you mention it, you still haven't fixed Snappy which will
completely stop compressing data once it encounters 1MB uncompressible data
This is drifting off-topic, but that was fixed incompletely stop compressing data once it encounters 1MB uncompressible data
https://github.com/golang/snappy/commit/bf2ded9d81f5c22b62cf76363673c6f9765a6703
which was committed a month ago. The test cases added in that commit
includes a 2MB input of which the first 1MB is uncompressible.
Finally, I hope that I haven't come across as being too negative in
the various flate discussions. I have made a number of criticisms, but
I aspire to be constructive. Please let me know if you are not finding
that to be the case. I do appreciate the work you've done, and I do
think that it will significantly improve Go 1.7 and beyond.
--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/d/optout.