On 2014-02-09 05:59, Karl Williamson wrote:
The regex optimizer under certain circumstances uses a "synthetic start
class" (SSC) to aid in finding suitable places in the target string
being matched to try out the pattern. For example, in


It would generate an SSC equivalent to /[ab]/, as we know the pattern
will match those, and only those two characters.

I have noticed that not infrequently the generated SSC matches every
single possible code point. This isn't helpful, and we might as well
just not use the SSC in those cases. But suppose the SSC matches every
single code point but one, such as \N does. This is helpful only to
rule out \n, amongst the billions of possible code points. Is it
worthwhile to use an SSC in this case? If not, how about if it rules
out 2, 3, ... possibilities. At what point does it become worthwhile?
I don't know, and so am asking if anybody has any ideas.

In some ways the answer depends on context. Suppose the SSC ruled out
only Cyrillic characters. If you are processing Russian, that could
well be worthwhile to use; but if you're not using Russian, then this
particular SSC won't be helpful. The optimizer doesn't know your plans,
so doesn't know the answer.

I'm looking for a general rule, and would appreciate any insight.
With (pre-generated) bloomfilters (for example a bitmask of and-ed
codepoints) it could well be worth it.


"Compared to the DFA-based repeated scanning
approach, our DFA-based one pass scanning approach has 12 to
42 times performance improvements. Compared to the
NFAbased implementation, our DFA scanner is 50 to 700 times faster
on traffic dumps obtained form MIT and Berkeley networks."


Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 2 of 8 | next ›
Discussion Overview
groupperl5-porters @
postedFeb 9, '14 at 4:59a
activeJun 29, '14 at 8:01a



site design / logo © 2021 Grokbase