tidb/pkg/planner/funcdep/doc.go

// Copyright 2022 PingCAP, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package funcdep

// Theory to Practice
//
// For more rigorous examination of functional dependencies and their
// interaction with various SQL operators, see the following Master's Thesis:
//
//   Norman Paulley, Glenn. (2000).
//   Exploiting Functional Dependence in Query Optimization.
//   https://cs.uwaterloo.ca/research/tr/2000/11/CS-2000-11.thesis.pdf

// TODO: Add the RFC design.

// NOTE 1.
// when handling Lax FD, we don't care the null value in the dependency, which means
// as long as null-attribute coverage of the determinant can make a Lax FD as strict one.

// The definition of "lax" used in the paper differs from the definition used by this
// library. For a lax dependency A~~>B, the paper allows this set of rows:
//
//	a  b
//	-------
//	1  1
//	1  NULL
//
//	This alternate definition is briefly covered in section 2.5.3.2 of the paper (see definition
//	2.19). The reason for this change is to allow a lax dependency to be upgraded to a strict
//	dependency more readily, needing only the determinant columns to be not-null rather than
//  both determinant and dependant columns.
//
// This is on the condition that, for definite values of determinant of a Lax FD, it won't
// have two same definite dependant value. That's true, because there is no way can derive
// to this kind of FD.
//
// Even in our implementation of outer join, the only way to produce duplicate definite
// determinant is the join predicate. But for now, we only maintain the equivalence and
// some strict FD of it.
//
//   t(a,b) left join t1(c,d,e) on t.a = t1.c and b=1
//  a  b  |  c     d     e
//  ------+----------------
//  1  1  |  1    NULL   1
//  1  2  | NULL  NULL  NULL
//  2  1  | NULL  NULL  NULL
//
// Actually it's possible, the lax FD {a} -> {c} can be derived but not that useful. we only
// maintain the {c} ~> {a} for existence after outer join. Besides, there two Cond-FD should
// be preserved waiting for be visible again once with the null-reject on the condition of
// null constraint columns. (see below)
//
// NOTE 2.
// When handle outer join, it won't produce lax FD with duplicate definite determinant values and
// different dependency values.
//
// In implementation，we come across some lax FD dependent on null-reject of some other cols. For
// example.
//   t(a,b) left join t1(c,d,e) on t.a = t1.c and b=1
//  a  b  |  c     d     e
//  ------+----------------
//  1  1  |  1    NULL   1
//  1  2  | NULL  NULL  NULL
//  2  1  | NULL  NULL  NULL
//
// here constant FD {} -> {b} won't be existed after the outer join is done. Notice null-constraint
// {c,d,e} -| {c,d,e}, this FD should be preserved and will be visible again when some null-reject
// predicate take effect on the null-constraint cols.
//
// It's same for strict equivalence {t.a} = {t1.c}. Notice there are no lax equivalence here, because
// left side couldn't be guaranteed to be definite or null. like a=2 here. Let's collect all of this
// on-condition FD down, correspondent with a null-constraints column set, name it as Cond-FD.
//
// lax equivalencies are theoretically possible, but it won't be constructed from an outer join unless
// t already has a constant FD in column `a` here before outer join take a run. So the lax equivalence
// has some pre-conditions as you see, and it couldn't cover the case shown above. Let us do it like a
// Cond-FD does.
//
// The FD constructed from the join predicate should be considered as Cond-FD. Here like equivalence of
// {a} == {c} and constant FD {b} = 1 (if the join condition is e=1, it's here too). We can say that for
// every matched row, this FDs is valid, while for the other rows, the inner side are supplied of null
// rows. So this FDs are stored as ncEdges with nc condition of all inner table cols.
//
// We introduced invisible FD with null-constraint column to solve the problem above named as Cond-FD.
// For multi embedded left join, we take the following case as an example.
//    a,b         c,d,e
// 	-----------+-----------
//   1    2    |    1  1  1
// 	 2    2    |
//  -----------+-----------
//
//  left join on (a=c) res:
//   a   b    c     e     e
//  -------------------------
//   1   2    1     1     1
//   2   2 +- null null  null -+
//         |                   |
//         +-------------------+
//                              \
//                               \
//  the Cond-FD are < a=c with {c,d,e} > the latter is as null constraint cols
//
//    e,f
//  -----------------------
//   1   2
//   2   2
//   3   3
//  -----------------------
//
//  left join on (e=a) res:
//   e   f    a     b      c    d    e
//  -----------------------------------
//   1   2    1     2      1    1    1
//   2   2    2     2  +- null null null --+---------------> Cond-FD are <a=c with {c,d,e}> still exists.
//   3   3 +-null null |  null null null   |---+
//         |           +-------------------+   |
//         +-----------------------------------+-----------> New Cond-FD are <e=a with {a,b,c,d,e}> occurs.
//
//
// the old Cond-FD with null constraint columns set {c,d,e} is preserved cause new appended cols are all null too.
// the new Cond-FD with null constraint columns set {a,b,c,d,e} are also meaningful, even if the null-reject column
// is one of {c,d,e} which may reduce one of the matched row out of the result, the equivalence {a}={e} still exist.
//
// Provide that the result of the first left join is like:
//  left join on (a=c) res:
//   a     b    c     e     e
//  ---------------------------
//   1     2    1     1     1
//   null  2  null  null  null
//
//  THEN: left join on (e=a) res:
//   e   f    a     b     c    d    e
//  ---------------------------------
//   1   2    1     2     1    1    1
//   2   2    null null null null null
//   3   3    null null null null null
//
//  Even like that, the case of old Cond-FD and new Cond-FD are existed too. Seems the null-constraint column set of
//  old Cond-FD {c,d,e} can be expanded as {a,b,c,d,e} visually, but we couldn't derive the inference of the join predicate
//  (e=a). The null-reject of column `a` couldn't bring the visibility to the old Cond-FD theoretically, it just happened
//  to refuse that row with a null value in column a.
//
// Think about adding one more row in first left join result.
//
//  left join on (a=c) res:
//   a     b    c     e     e
//  ---------------------------
//   1     2    1     1     1
//   null  2  null  null  null
//   3     3  null  null  null
//
//  THEN: left join on (e=a) res:
//   e   f    a     b     c    d    e
//  ---------------------------------
//   1   2    1     2     1    1    1
//   2   2    null null null null null
//   3   3    3     3   null null null
//
//  Conclusion:
//  As you see that's right we couldn't derive the inference of the join predicate (e=a) to expand old Cond-FD's nc
//  {c,d,e} as {a,b,c,d,e}. So the rule for Cond-FD is quite simple, just keep the old ncEdge from right, appending
//  the new ncEdges in current left join.
//
//  If the first left join result is in the outer side of the second left join, just keep the ncEdge from left as well,
//  appending the new ncEdges in current left join.
//
//  For a inner join, both side of the join result won't be appended with null-supplied rows, so we can simply collect
//  the ncEdges from both join side together.