FAQ

On Wed, Mar 23, 2016 at 11:43 PM, wrote:
Re the overall pictures at http://imgur.com/a/5sWM4 and at
https://blog.klauspost.com/rebalancing-deflate-compression-levels/
it's not obvious to me whether the graphs are showing the old or new
algorithms for levels 1, 2 and 3. In fact, I think the best graph
would show both.
I didn't want to go into details about the re-balancing, since we can take
that discussion separately, but the point was to show that:

* There is a big gap from the proposed level 3 to level 4, which will be
Is part of this bullet point missing? The sentence appears incomplete.

In any case, I will repeat myself in case there was a
mis-communication, but even if there is or was or both being a big gap
between level 3 and level 4, it would be great to see both the old and
new algorithms for levels 1-3 on the graph.

Flate is normally Huffman + LZ77. In
https://go-review.googlesource.com/#/c/20982/ you are also proposing a
Huffman-only compression level. How does that relate (if at all) to
this proposal?
It is not directly related, so I separated it out in a separate CL, and
didn't mention it here to have the discussion a bit more focused.
OK. It is an API change (an addition), so I think it's worth at least
mentioning at some point on golang-dev. But it's mentioned now, and
the change LGTM (modulo some minor issues).

Should we have more than one such level? Conversely, when introducing
Snappy style matching, should we replace levels 1, 2 and 3, or just
level 1 aka BestSpeed?
Level 2+3 offer good progression. The "generic" deflater leaves a big gap to
level 1, and hits a performance ceiling. This is what you see in my blog
post, which was written before I wrote the specialized level 2+3.
They may offer good progression (again, seeing both the before and the
after for levels 1-3 on the graphs would help inform the discussion),
but in practice, do people use anything other than BestSpeed,
BestCompression, NoCompression or DefaultCompression? I don't know the
answer, but if not, it might not be worth having three copy/pastes of
the new algorithm if nobody uses levels 2 or 3.

Please don't do that. We have had problems with this before:
Don't worry, I know what I am doing. Each deflate block is never longer
thank 64KB, so skipping is reset for every 64KB data processed. This works
as intended. If it didn't 10GB.tar would end up considerably worse.
I'm sorry, but I don't find "don't worry, I know what I am doing"
persuasive. That sort of reassurance would not have prevented the bug
that https://github.com/golang/snappy/commit/d1d908a252c22fd7afd36190d5cffb144aa8f777
fixed.

Even if you are now resetting every 64KB, that still means that the "s
+= 1 + ((s - lit) >> 4)" skipping can be very aggressive within each
64KB block. Specifically, it is exponential instead of quadratic (see,
for example, the table after "It prints" on the golang/snappy commit
linked to immediately above), which I think is too aggressive.

We clearly disagree on where to pick the trade-off for speed vs size,
but I feel that "do more or less what C++ snappy does" is a safe
choice, and using a novel algorithm has some risk: I discovered and
fixed one bug in d1d908a2 but there may be others, on other classes of
input.

I know that the compression experts that made google/snappy test
algorithmic tweaks on large corpora of test data, whether samples of a
web crawl or of RPC traffic, although it's clearly not practical to
check in 100s of megabytes of test data into the google/snappy git
repository. If you can convince the C++ google/snappy maintainers to
change their skipping algorithm to yours, then I'm happy to adopt your
skipping algorithm. But until then, I don't think I've seen enough
evidence to move off the safe choice.

For example, https://go-review.googlesource.com/#/c/20513/ points to a
spreadsheet at https://docs.google.com/spreadsheets/d/1VLxi-ac0BAtf735HyH3c1xRulbkYYUkFecKdLPH7NIQ/edit?usp=sharing
but to repeat what I said earlier on that code review: If I'm reading
the spreadsheet linked to from CL description right, there are only
three test files used to compare the before and after size. I don't
see links to the actual test files, but all three of them ("very
compressible", "random data", "xml") sound like they're either all
compressible or all incompressible. Three isn't a large sample size,
and it doesn't cover things like PDF files which, IIUC, is like a
textual format (i.e. compressible) that can embed binary resources
such as JPEG images (i.e. incompressible) all in the one file.

Btw, now that you mention it, you still haven't fixed Snappy which will
completely stop compressing data once it encounters 1MB uncompressible data
This is drifting off-topic, but that was fixed in
https://github.com/golang/snappy/commit/bf2ded9d81f5c22b62cf76363673c6f9765a6703
which was committed a month ago. The test cases added in that commit
includes a 2MB input of which the first 1MB is uncompressible.


Finally, I hope that I haven't come across as being too negative in
the various flate discussions. I have made a number of criticisms, but
I aspire to be constructive. Please let me know if you are not finding
that to be the case. I do appreciate the work you've done, and I do
think that it will significantly improve Go 1.7 and beyond.

--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 13 of 20 | next ›
Discussion Overview
groupgolang-dev @
categoriesgo
postedMar 22, '16 at 10:53a
activeApr 10, '16 at 2:06p
posts20
users10
websitegolang.org

People

Translate

site design / logo © 2023 Grokbase