The resultset of SELECTs that use functions whose result will
always vary or whose result depend upon the user executing the
query should not be cached. The list of functions is the same
as that specified for the query cache of MariaDB:
https://mariadb.com/kb/en/mariadb/query-cache/
If user or system variables are used in a SELECT statement, then
the result will not be cached. That ensures that the wrong result
will not be returned.
As before, the cache will be used if there is no ongoing transaction
(includes autocommit being on), or if there is an explicitly read-
only transaction.
In addition, the cache will be used and populated during any other
transaction as long as only pure read statements are executed. After
first non-read statement, the use of the cache is disabled.
- Add note about filters having become available in 2.1.
- Add insertstream to release notes and change log.
- Add complete example to cache documentation.
- Hard TTL; the maximum time a value will be used from the cache.
- Soft TLL; the time after which the cache value should be updated
from the server.
So as not to unnecessarily fetch the same value multiple times, when
the soft TTL has been reached, the value will be updated for the first
client, while all other clients will use the stale value until it has
become updated.
With different soft and hard TTLs there is a definite upper bound for
how old a value can be used.
0 is now the default of all cache configuration parameters and in
all cases the meaning is the same; that is, no limit. Internally
all limits but ttl are now for the sake of consistency 64-bit.
The maximum count and maximum size of the cache can now be
specified and a storage can declare what capabilities it has.
If a storage modile cannot enforce the maximum count or maximum
size limits, the storage is decorated with an LRU storage that
can.
Since it's a cache, we do not need to retain the data from one
MaxScale invocation to the next. Consequently, we can turn off
write ahead logging completely if we simply wipe the RocksDB
database at each startup. In addition, the location of the
cache directory can now be specified explicitly, so it can be
placed, for instance, on a RAM disk.
With the advent of qc_get_field_info, columns can now be matched.
However, there is still some undeterminism caused by the table
information not containing contextual information (exactly where
is the table used).
Further, suppose table X contains the column A and table Y contains
the column B, then given a statement like
SELECT a, b from X, Z;
we cannot know whether a is in X or Z, or b in X or Z, without being
aware of the schema, which we currently are not.
Consequently, as long as MaxScale is not aware of the schema, some
heuristics must be applied. For instance, if exactly one table is
referred to, then we can assume that columns that are not explicitly
qualified are from that table.
The rule tests are currently rather rudimentary and need to be
expanded.
The concept of 'allowed_references' was removed from the
documentation and the code. Now that COM_INIT_DB is tracked,
we will always know what the default database is and hence
we can create a cache key that distinguises between identical
queries targeting different default database (that is not
implemented yet in this change).
The rules for the cache is expressed using a JSON object.
There are two decisions to be made; when to store data to the
cache and when to use data from the cache. The latter is
obviously dependent on the former.
In this change, the 'store' handling is implemented; 'use'
handling will be in a subsequent change.
When a query has been sent to a backend, the response is now
processed to the extent that the cache is capable of figuring
out how many rows are being returned, so that the cache setting
`max_resultset_rows` can be processed.
The code is now also written in such a manner that it should be
insensitive to how a package has been split up into a chain of
GWBUFs.
The cache filter consists of two separate components; the cache
itself that evaluates whether a particular query is subject to
caching and the actual cache storage. The storage is loaded at
runtime by the cache filter. Currently using a custom mechanism;
once the new plugin loading macros/mechanism is in place, I'll see
if that can be used.
There are a few open questions/issues.
- Can a GWBUF delivered to the filter contain more MySQL packets
than one? If yes, then some queueing mechanism needs to be
introduced. Currently the code is written so that the packets
are processes in a loop, which will not work.
- Currently, the storage API is synchronous. That may work with a
storage built upon RocksDB, that writes asynchronously to disk,
but not with memcached that can be (and in MaxScale's case
would have to be) used asynchronously.
Reading may be problematic with RocksDB as values are returned
synchronously. So that will stall the thread being used. However,
as RocksDB uses in-memory caching and it is possible to arrange
so that e.g. selects targeting the same table are stored together,
it is not obvious what the impact would be.
So as not to block the MaxScale worker threads, there'd have to
be a separate thread-pool for RocksDB access and then arrange
the data to be moved across.
But initially, the inteface is synchronous.
- How is the cache configured? The original requirement mentions
all sorts of parameters - database name, table name, column name,
presence of WHERE clause, regexp, date/time of query, user -
but it's not alltogether clear exactly how they should be specified
and how they should interract. So initially all selects will
be subject to caching with a TTL.