Commit Graph

158 Commits

Author SHA1 Message Date
01b1d469a8 MXS-2435 Handle recoverable Clustrix errors
If
- transaction replay is enabled,
- an error is returned and
- the error is one of the recoverable Clustrix errors
we will retry the transaction.

If it succeeds, then the client will not notice anything but
for a short delay.

Note that the error message is looked for irrespective of whether
the backend is Clustrix or not. However, as errors are not common
the price for doing that can probably be ignored.

However, a bigger problem is that explicit knowledge of different
backends should *not* be coded into routers.
2019-04-26 10:54:57 +03:00
c643f9bc8d Merge branch '2.3' into develop 2019-04-12 13:23:49 +03:00
ec890b33cd Prevent checksum mismatch on second trx replay
If a transaction replay has to be executed twice due to a failure of the
original candidate master, the query queue could contain replayed
queries. The replayed queries would be placed into the queue if a new
connection needs to be created before the transaction replay can start.
2019-04-05 13:33:16 +03:00
6421af1bb4 Backport query queue changes to 2.3
Backported the changes that convert the query queue in readwritesplit into
a proper queue. This changes combines both
5e3198f8313b7bb33df386eb35986bfae1db94a3 and
6042a53cb31046b1100743723567906c5d8208e2 into one commit.
2019-04-05 13:33:16 +03:00
2aa3515fc8 Merge commit '09cb4a885f88d30b5108d215dcdaa5163229a230' into develop 2019-04-04 14:34:17 +03:00
a217dde1f0 MXS-2419: Queue queries executed during trx replay
By storing the queries in the query queue and routing it once the
transaction replay is done, we prevent two problems:

* Multiple transaction replays would overwrite the m_interrupted_query
  buffer that was used to store any queries executed during the
  transaction replay.

* Incorrect ordering of queries when the query queue is not empty and a
  new query is executed during transaction replay.
2019-04-03 12:57:05 +03:00
5242cd5ebf Readwritesplit: Graceful maintenance mode
By allowing transactions to the master to end even if the server is in
maintenance mode makes it possible to terminate connections at a known
point. This helps prevent interrupted transactions which can help reduce
errors that are visible to the clients.
2019-04-02 14:21:54 +03:00
74eeb64fba Don't close connections to servers being drained
The connections to servers being drained should not be closed like they
should be for servers in maintenance mode. The change in functionality
between 2.3 and develop caused the connections to be discarded if the
server was in either maintenance or drain mode.
2019-03-21 18:19:10 +02:00
9bc721afb6 Merge commit '11ee74bad327e7fb15e8388d20e7838b9e49cadf' into 2.3 2019-03-21 17:52:42 +02:00
6042a53cb3 Replace raw GWBUF pointers with mxs::Buffer
Now that the query queue is stored in an actual container, it is only
logical to use mxs::Buffer instead of GWBUF as the stored type.
2019-03-18 13:18:52 +02:00
5e3198f831 Replace the plain GWBUF query queue with std::deque
Using a std::deque to store the queries retains the exact state of the
object thus removing the need to parse the query again. It also removes
the need to split the queue into individual packets which makes the code
cleaner.
2019-03-18 13:18:52 +02:00
0001babd26 Clean up readwritesplit routing functions
Moved the more verbose parts of the routing code into subfunctions and
arranged it so that more relevant parts are closer to each other. Also
added the SQL statement that is being delayed to the message.
2019-03-18 13:18:52 +02:00
4bf9fa872c MXS-2313: Use servers of same rank in readwritesplit
When a readwritesplit session has a connection to a master server, servers
of the same rank as the master are used. If no master connection is
available, the server with the highest rank among all connected servers is
used. If there are no open connections, the server with the best rank is
chosen and a connection to it is made.

Connections with different rank values than what is the current rank value
of the session will be discarded. This reduces the use of server with
different ranks when the master server of a session fails. Without the
active pruning of connections, slave connections to primary clusters
without masters would remain in use even after the primary master
fails. This guarantees full switchover to a secondary cluster if a master
change occurs.
2019-03-18 13:12:59 +02:00
ba448cb12c MXS-2313: Clean up readwritesplit connection creation
The connection creation is now internal to RWSplitSession. This makes the
code more readable by removing the need to pass parameters and allowing
easier reuse of existing functions. The various conditions require to
create connections are now also checked in only one place.
2019-03-18 13:12:58 +02:00
4dda31ffe3 Merge branch '2.2' into 2.3 2019-03-16 09:30:56 +02:00
995c890664 Fix uninitialized pointers in readwritesplit 2019-03-15 15:41:39 +02:00
667a9f1c6f Merge branch '2.3' into develop 2019-03-15 12:31:08 +02:00
09dc92973e Discard connections as the last step
Th discarding of connections in maintenance mode must be done after any
results have been written to them. This prevents closing of the connection
before the actual result is returned.
2019-03-14 12:15:30 +02:00
b537176248 Fix parsing of non-query packets
Packets that do not contain SQL should not be parsed.
2019-03-13 15:44:02 +02:00
1c3a5bda83 Merge branch '2.3' into develop 2019-03-11 12:29:56 +02:00
710e5df27b MXS-2365: Fix classification of queued queries
Queries in the query queue need to be explicitly parsed since they are
stored in a single buffer and thus share the query classification
information. In the next major version this should be changed into an
array of individual buffers instead of a shared buffer.
2019-03-08 14:45:18 +02:00
24ea222ed6 MXS-2350: Allow lazy connection creation
The lazy connection creation reduces the burden that short sessions place
on the backend servers. This also prevents the problems caused by early
disconnections that happen when only one server is used but multiple
connections are created. This does not solve the problem (MXS-619) but it
does mitigate it to acceptable levels.

This commit also adds a change to the weighting algorithm that prefers
existing connections over unopened ones. This helps avoid the
flip-flopping that happens when the absolute scores are very similar. The
hard-coded value might need to be tuned once testing is done.
2019-03-08 08:20:44 +02:00
95317725ce Merge branch '2.3' into develop 2019-03-07 16:21:03 +02:00
b97976c4ee MXS-2323: Close stale connections
Cleaning up and closing stale connections to servers in maintenance mode
helps administrators see when a server is no longer in use.
2019-03-07 15:59:26 +02:00
6038f1f386 Merge branch '2.3' into develop 2019-02-01 13:55:54 +02:00
24c9b62a2f Add verbose logging for session command failures
If the routing of a session command fails due to problems with the backend
connections, a more verbose error message is logged. The added status
information in the Backend class makes tracking the original cause of the
problem a lot easier due to knowing where, when and why the connection was
closed.
2019-01-31 14:23:26 +02:00
a3fa2f8111 Merge branch '2.3' into develop 2019-01-16 16:31:14 +02:00
021d48f94c Log low-level reason and idle time on master failure
If the connection to the master is lost, knowing what type of an error
caused the call to handleError helps deduce what was the real reason for
it. Logging the idle time of the connection helps detect when the
wait_timeout of a connection is exceeded.
2019-01-16 09:43:49 +02:00
7cac2c009d Merge branch '2.3' into develop 2019-01-10 12:43:46 +02:00
9cac927542 MXS-2220 Move server response calculation functions inside class 2019-01-10 10:26:53 +02:00
147f0bb656 Extend master failure error message
The error now describes the failure mode in more detail. This should make
post mortem analysis of failed connections a lot easier.
2019-01-09 20:05:38 +02:00
f0f9c21d1c Merge branch '2.3' into develop 2019-01-07 10:54:42 +02:00
40485d746c MXS-2220 Change server name to constant string 2019-01-03 12:13:15 +02:00
9adbd2f8f0 Cache the local server statistics object
By storing the server statistics object in side the session, the lookup
involved in getting a worker-local value is avoided. Since the lookup is
done multiple times for a single query, it is beneficial to store it in
the session.

As the worker-local value is never deleted, it is safe to store a
reference to it in the session. It is also never updated concurrently so
no atomic operations are necessary.
2019-01-03 09:37:59 +02:00
1fa3b133c7 Make keepalive ping checks more efficient
The code now only checks the need for a keepalive ping once every
keepalive interval. Reduced the number of mxs_clock calls to one so that
all servers use the same value.
2019-01-03 09:37:59 +02:00
4d0a40ef9f Add missing pointer initialization
The change from SRWBackend to RWBackend* had some side effects, namely the
missing automatic initialization into zero values.
2018-12-28 08:19:23 +02:00
20fe9b9dca MXS-2196: Rename session states
Minor renaming of the session state enum values. Also exposed the session
state stringification function in the public header and removed the
stringification macro.
2018-12-13 13:27:45 +02:00
48efa6d027 MXS-2213: Clear stored PS information
The information stored for each prepared statement would not be cleared
until the end of the session. This is a problem if the sessions last for a
very long time as the stored information is unused once a COM_STMT_CLOSE
has been received.

In addition to this, the session command response maps were not cleared
correctly if all backends had processed all session commands.
2018-12-11 13:54:10 +02:00
77477d9648 MXS-2196: Rename dcb_role_t to DCB::Role 2018-12-05 15:30:44 +02:00
0d09b56f58 MXS-2025 RWBackends as a vector of unique_ptr:s
For lifetime management keep RWBackends in a vector of unique_ptrs.
RWSplitSession keeps the unique_ptrs very private, and provides a vector
of plain pointers for all other interfaces.
2018-12-05 10:23:57 +02:00
20b62a3f3d MXS-2025 Change RWBackend usage to a vector of raw ptrs.
This is essentially just a search and replace to change SRWBackend to
RWBackend* and SRWBackendList to PRWBackends, a vector of a raw
pointers. In the next few commits vector<unique_ptr<RWBackend>>
will be used for life time management.

There are a lot of diffs from the global search and replace. Only a few manual
edits had to be done.

list-src -x build | xargs sed -ri 's/SRWBackends/prwbackends/g'
list-src -x build | xargs sed -ri 's/const mxs::SRWBackend\&/const mxs::RWBackend\*/g'
list-src -x build | xargs sed -ri 's/const SRWBackend\&/const RWBackend\*/g'
list-src -x build | xargs sed -ri 's/mxs::SRWBackend\&/mxs::RWBackend\*/g'
list-src -x build | xargs sed -ri 's/mxs::SRWBackend/mxs::RWBackend\*/g'
list-src -x build | xargs sed -ri 's/SRWBackend\(\)/nullptr/g'
list-src -x build | xargs sed -ri 's/mxs::SRWBackend\&/mxs::RWBackend\*/g'
list-src -x build | xargs sed -ri 's/mxs::SRWBackend/mxs::RWBackend\*/g'
list-src -x build | xargs sed -ri 's/SRWBackend\&/RWBackend\*/g'
list-src -x build | xargs sed -ri 's/SRWBackend\b/RWBackend\*/g'
list-src -x build | xargs sed -ri 's/prwbackends/PRWBackends/g'
2018-12-05 10:23:57 +02:00
d96a7dedc5 MXS-2205 Convert maxscale/poll.h to .hh 2018-12-04 14:51:02 +02:00
da83551493 MXS-2189: Prevent unwanted trx replay
When a transaction is being executed on a slave and the master fails, the
transaction replay would start.
2018-11-27 12:52:45 +02:00
1abcbd64bd MXS-2187: Allow multiple transaction retries
By resetting the replay state the transaction replay can start again on a
new server. This allows the replay process work when a master server is
shutting down.
2018-11-27 12:52:44 +02:00
e6325d39fb Delay initial transaction replay
By delaying the replay for a second, we give the monitor a small chance to
adap to master failures. It'll also prevent rapid re-querying if multiple
transaction replays are supported.
2018-11-27 12:52:44 +02:00
851793cb86 Fix transaction replay debug assertion
A transaction that just completed will go through the start_trx_replay
function as from the client protocol's point of view the transaction is
still open. The debug assertion did not take this into account and would
fail if a successful commit was the last thing done on master that failed.

Also fixed the formatting.
2018-11-27 12:52:44 +02:00
7bf5c07835 Ignore errors sent by servers in shutdown
When a server is stopping, it'll send an error to the client before
terminating the TCP connection. The code in readwritesplit would detect
this error and create a hangup event on the DCB. This would cause it to
appear as if the TCP connection was broken and the router would
immediately try to reconnect to the same server.

By ignoring the error and allowing the connection to die on its own, we
avoid immediately reconnecting and retrying any transactions on the
stopping server. This increases the chances that the monitor will see it
first and assign the server states correctly before the transaction replay
is attempted.
2018-11-26 09:42:12 +02:00
925670ae2f Fix false master failure log message
The message would be logged even if the session continues.
2018-11-26 09:42:11 +02:00
cab8a4bde8 MXS-2144: Treat server shutdown as a network error
If the server where a query is being executed is shutting down,
readwritesplit should treat it as an error to make retrying of the query
possible.

By treating server shutdowns as network errors, the same code path that is
used for actual network errors can be taken. This removes the need for any
extra retrying logic for this particular case.
2018-11-14 16:23:47 +02:00
c32bb18862 Fix transaction replay checksum mismatches
The transaction replay could get mixed up with new queries if the client
managed to perform one while the delayed routing was taking place. A
proper way to solve this would be to cork the client DCB until the
transaction is fully replayed. As this change would be relatively more
complex compared to simply labeling queries that are being retried the
corking implementation is left for later when a more complete solution can
be designed.

This commit also adds some of the missing info logging for the transaction
replaying which makes analysis of failures easier.
2018-11-13 16:48:03 +02:00