Responses generated by replayed session commands must not be treated as
actual responses to retained statements. In 2.5 this is not a problem as
it is done implicitly with the pre-assignment of the server that delivers
the session command response.
This could happen if a session command triggers a master reconnection and
the connection fails while the history replay is ongoing. The code assumed
that history replay would only happen when a query was in the query queue.
It appears that rollback errors are possible outside of
transactions. Since this was not something we expected to see, logging it
as an error allows us to see why this happens in production deployments.
The errors that are ignored by readwritesplit are now stored as the
current close reason in the Backend. This allows the information about the
error to be retained and it can be used later in the error handler to
display the true reason why the connection was closed.
This allows the same verbose information to be logged in the cases where
it is of use. Mostly this information can be used to figure out why a
certain session was closed.
By doing the reconnection only when a new query arrives, we prevent the
excessive reconnecting that is done when a server's actual and monitored
states are in conflict.
If a master failed during an ongoing session command history replay, it
would be treated as if a normal session command failed which would result
in the already executed session command being re-executed on all servers
at the wrong logical position.
To fix this, the history replay must be distinguished from normal session
command execution. When a connection replaying the history fails, the
query routing simply needs to be attempted again.
When a query returns a WSREP error, most of the time it is not something
the client application is expecting. To prevent this from affecting the
client, it can be treated the same way a transaction rollback is treated:
ignore the error and try again.
The expected response counter was not decremented if a transaction replay
was started. This caused the connections to hang which in turn caused the
failure of the mxs1507_trx_stress test case.
The assertion in routeQuery that expects there to be at least one ongoing
query would be triggered if a query was received after a master had failed
but before the session would close. To make sure the internal logic stays
consistent, the error handler should only decrement the expected response
count if the session can continue.
This could end up in infinite mutual recursion if no responses are
expected. Although this does not happen now that MXS-2587 is fixed, the
code should not even be there.
If a transaction replay fails, no queries must be routed before the
connection is closed. This could happen if the client received the error
from the replay failure and closes the connection before the fake hangup
generated by the replay failure is processed.