FAQ
Hi!

Thanks for the feedback.

I have adjusted the CL, since the benchmark numbers have changed after CL
20929 has been merged.

I have added several more benchmarks to the spreadsheet (A disk image,
10GB.tar, GOB stream, Go Object files, Go source files, web server
benchmarks), as well as the revision numbers used for the benchmark. Files
for each benchmark can now be downloaded.

https://docs.google.com/spreadsheets/d/1VLxi-ac0BAtf735HyH3c1xRulbkYYUkFecKdLPH7NIQ/edit?usp=sharing

It should show the strengths and weaknesses of the revised algorithm
doesn't perform so well compared to the current master. However, the "bad"
cases are mostly on content that is already very fast to compress, so this
CL should hopefully reduce the "worst case" (in terms of speed) compression
scenarios, where it provides a very good speedup. I want to be transparent
about this tradeoff.


I have made some small adjustments to the code:
* I have taken out most of the shared code, so each function now only has
the main encoding loop.
* I have removed an optimization that may be too aggressive for a standard
package, which skips Huffman compression if no matches are found in a 64KB
block. This would have a negative impact on encoding random base64 encoded
data for instance, so to avoid this case it now always attempts a Huffman
encode.
* Hash values are now 16 bits instead of 14 bits. This does not impact
performance negatively and allows for better compression than the previous
version.
* Minor tweaks.

On Thursday, 24 March 2016 13:10:46 UTC+1, Nigel Tao wrote:


In any case, I will repeat myself in case there was a
mis-communication, but even if there is or was or both being a big gap
between level 3 and level 4, it would be great to see both the old and
new algorithms for levels 1-3 on the graph.
I will do that, when we should look at re-balancing the levels. I have
updated the spreadsheet with the new numbers, which you are welcome to grab
if you should have the time.

They may offer good progression (again, seeing both the before and the
after for levels 1-3 on the graphs would help inform the discussion),
but in practice, do people use anything other than BestSpeed,
BestCompression, NoCompression or DefaultCompression? I don't know the
answer, but if not, it might not be worth having three copy/pastes of
the new algorithm if nobody uses levels 2 or 3.
You do have a good point. I cannot be the judge of whether it should be
included. I have tried to factor out the common code in the CL now - at
least the things that could be done without sacrificing performance in a
measurable way.

If it is still "too much", I think I would go for adding "encodeL2" as
BestSpeed. It is often very close to "encodeL1" in terms of speed (within
5-10%), but offers 1% better compression. If we should only replace level
1, I would choose that. Maybe other people have input on whether it should
be a choice, or we should just implement 1 level.


Even if you are now resetting every 64KB, that still means that the "s
+= 1 + ((s - lit) >> 4)" skipping can be very aggressive within each
64KB block. Specifically, it is exponential instead of quadratic (see,
for example, the table after "It prints" on the golang/snappy commit
linked to immediately above), which I think is too aggressive.
I have some "mixed content" data, a typical backup set and a virtual disk
image. I do not see any significant compression loss from the aggressive
skipping. IMO the risk is small, since only up to 64KB will be affected. In
my optinion that is a reasonable trade for level 1-3, where 2&3 are set
less aggressive.


I know that the compression experts that made google/snappy test
algorithmic tweaks on large corpora of test data, whether samples of a
web crawl or of RPC traffic, although it's clearly not practical to
check in 100s of megabytes of test data into the google/snappy git
repository. If you can convince the C++ google/snappy maintainers to
change their skipping algorithm to yours, then I'm happy to adopt your
skipping algorithm. But until then, I don't think I've seen enough
evidence to move off the safe choice.
I have added a bunch more test sets, which you can download test. I cannot
"compete" with a secret testset, plus we are talking an algorithm where the
user has selected "I want the fastest option". If we where talking
"DefaultCompression", I would definitely go for the safe option.


compressible or all incompressible. Three isn't a large sample size,
and it doesn't cover things like PDF files which, IIUC, is like a
textual format (i.e. compressible) that can embed binary resources
such as JPEG images (i.e. incompressible) all in the one file.
The 10GB testset is a good "mixed content", as well as the disk image. If
you have any more specific ones, I would love to test it.


Finally, I hope that I haven't come across as being too negative in
the various flate discussions. I have made a number of criticisms, but
I aspire to be constructive. Please let me know if you are not finding
that to be the case. I do appreciate the work you've done, and I do
think that it will significantly improve Go 1.7 and beyond.
And I also love a good discussion. For issues like this it is important to
discuss tradeoffs, but I don't like to be presented with "facts" where you
subtly imply I don't know what I am doing. We both do sub-optimal things
from time to time, but bringing up the same thing for 4th or 5th time and
implying "you made a mistake there, therefore this is garbage" does not
feel like a constructive way of discussing this.

I don't think we have wildly different views on this. I think I try to
cater a bit more toward the typical use case. Whether we as "compression
experts" know that gzip compressing a JPEG is usually a silly thing, that
is something that will happen quite often, (think backups, disk images) so
for the compressor to quickly (5x faster) being able to "skip" that is an
important in my design. That is a much more likely scenario than a tiny
chance that there might be some slightly sub-optimal compression in a 64KB
block, so I optimize for the former.

The optimization I just removed is another case. In cases where no matches
were found in a 64KB block, it was always skipped (stored). For *most*
content that is fine, but for something like random data that was base 64
encoded, it would never be able to reduce the entropy. To me this is a
clear case of the worst case being bad, and a realistic usage scenario -
yes, people will compress base 64 encoded data and it should be compressed.

We are not dealing with "perfect" compression here, which makes this
discussion harder than average. We are dealing with a "give me a fast"
(BestSpeed), "give me a reasonable" (Default) and "give me a close to
optimal" (BestCompression) compression spectrum. For the "fast", you *will*
be able to create worst cases (both in terms of speed and compression),
however, they should be rare. To me (and hopefully to a users) it is
obvious that BestSpeed will not deliver the best compression because it
makes certain choices for speed. To me a worst case where speed is very low
is bad when you select a "fast" option, and my choices in the code should
reflect that.


/Klaus


--
You received this message because you are subscribed to the Google Groups "golang-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 14 of 20 | next ›
Discussion Overview
groupgolang-dev @
categoriesgo
postedMar 22, '16 at 10:53a
activeApr 10, '16 at 2:06p
posts20
users10
websitegolang.org

People

Translate

site design / logo © 2021 Grokbase