I happen to be working on the logical expression simplification effort
(https://issues.apache.org/jira/browse/PIG-1399), but not on the filter
split front. So I guess our interests will have some overlaps.
I think the filter logic split problem can be divided into 2 parts:
1) the filtering logic that can be applied to individual input sources;
and 2) the filtering logic that has to be applied when merged(or joined)
inputs are processed.
The benefits for 1) are any benefits if the underlying storage supports
predicate pushdown; plus the memory/CPU savings by PIG for not
processing the unqualified rows.
For 2), the purpose is not paying higher evaluation costs than
For 1), no normal form is necessary. The original logical expression
can be trimmed off any sub-expressions that are not constants nor just
from a particular input source. The complexity is linear with the tree
size; while the use of normal form could potentially lead to exponential
complexity. The difficulty with this approach is how to generate the
filtering logic for 2); while CNF can be used to easily figure out the
logic for 2). However, the exact logic in 2) might not be cheaper to
evaluate than the original logical expression. An example is "Filter J2
by ((C1 < 10) AND (a3+b3>10)) OR ((C2 == 5) AND (a2+b2 >5))". In 2) the
filtering logic after CNF will be "((C1 < 10) OR (a2+b2 > 5)) AND
((a3+b3>10) OR (C2 == 5)) AND ((a3+b3 >10) OR (a2+b2 > 5))". The cost
will be 5 logical evaluations (3 OR plus 2 AND), which could be reduced
to 4, compared with 3 logical evaluations in the original form.
In summary, if only 1) is desired, the tree trimming is enough. If 2) is
desired too, then CNF could be used but its complexity should be
controlled and the cost of the filtering logic evaluation in 2) should
be computed and compared with the original expression evaluation cost.
Further optimization is possible in this direction.
Another potential optimization to consider is to support logical
expression tree of multiple children vs. the binary tree after taking
into consideration of the commutative property of OR and AND operations.
The advantages are less tree traversal costs and easier to change the
evaluation ordering within the same sub-tree in order to maximize the
possibilities to short-cut the evaluations. Although this is general for
all logical expressions, this tends to be more suitable for normal form
handlings as normal forms group the sub-expressions by the operators
that act on the sub-expressions.
From: Swati Jain
Sent: Monday, July 05, 2010 2:34 AM
Cc: Daniel Dai
Subject: PIG Logical Optimization: Use CNF in SplitFilter
I am interested in implementing logical optimization rules and to target
this I have studied currently implemented logical rules and the rule
framework. In particular, I felt that rules dealing with LOfilter are
able to handle complicated boolean expressions. I would like to share
suggestions to improve handling of boolean expressions in LOFilter to
1. SplitFilter Rule : SplitFilter rule is splitting one LOFilter into
"AND". However it will not be able to split LOFilter if the top level
operator is "OR". For example:
A = load 'file_a' USING PigStorage(',') as (a1:int,a2:int,a3:int);
B = load 'file_b' USING PigStorage(',') as (b1:int,b2:int,b3:int);
C = load 'file_c' USING PigStorage(',') as (c1:int,c2:int,c3:int);
J1 = JOIN B by b1, C by c1;
J2 = JOIN J1 by $0, A by a1;
D = *Filter J2 by ( (c1 < 10) AND (a3+b3 > 10) ) OR (c2 == 5);*
In the above example current rule is not able to any filter condition
any join as it contains columns from all branches (inputs). But if we
convert this expression into "Conjunctive Normal Form" (CNF) then we
be able to push filter condition c1< 10 and c2 == 5 below both join
conditions. Here is the CNF expression for highlighted line:
( (c1 < 10) OR (c2 == 5) ) AND ( (a3+b3 > 10) OR (c2 ==5) )
*Suggestions:* It would be a good idea to convert LOFilter's boolean
expression into CNF, it would then be easy to push parts (conjuncts) of
LOFilter boolean expression selectively.
I have started thinking about the design for implementing this
(arbitrary boolean expression to CNF) and would appreciate any feedback