Using delayed_call rather than usleep. This caused a fair amount of changes to
the timing ascpects (or delaying). Also some other small changes; more config
and all durations in milliseconds.
There's now double bookkeeping:
- All delayed calls are in a map whose key is the next
invocation time. Since it's a map (and not an unordered_map)
it's sorted just the way we want to have it.
- In addition, there's an unordered set for each tag.
With this arrangement we can easily invoke the delayed calls
in the right order and be able to efficiently remove all
delayed calls related to a particular tag.
When canceling, a DelayedCall instance must be removed from the
collection holding all delayed calls. Consequently priority_queue
cannot be used as it 1) does not provide access to the underlying
collection and 2) the underlying collection (vector or deque)
is a bad choise if items in the middle needs to be removed.
The interface for canceling calls is now geared towards the needs
of sessions. Basically the idea is as follows:
class MyFilterSession : public maxscale::FilterSession
{
...
int MyFilterSession::routeQuery(GWBUF* pPacket)
{
...
if (needs_to_be_delayed())
{
Worker* pWorker = Worker::current();
void* pTag = this;
pWorker->delayed_call(5000, pTag, this,
&MyFilterSession::delayed_routeQuery,
pPacket);
return 1;
}
...
}
bool MyFilterSession::delayed_routeQuery(Worker::Call:action_t action,
GWBUF* pPacket)
{
if (action == Worker::Call::EXECUTE)
{
routeQuery(pPacket);
}
else
{
ss_dassert(action == Worker::Call::CANCEL);
gwbuf_free(pPacket);
}
return false;
}
~MyFilterSession()
{
void* pTag = this;
Worker::current()->cancel_delayed_calls(pTag);
}
}
The alternative, returning some key that the caller must keep
around seems more cumbersome for the general case.
It's now possible to provide Worker with a function to call
at a later time. It's possible to provide a function or a
member function (with the object), taking zero or one argument
of any kind. The argument must be copyable.
There's currently no way to cancel a call, which must be added
as typically the delayed calling is associated with a session
and if the session is closed before the delayed call is made,
bad things are likely to happen.
Worker::Timer class and Worker::DelegatingTimer templates are
timers built on top of timerfd_create(2). As such they consume
descriptor and hence cannot be created independently for each
timer need.
Each Worker has now a private timer member variable on top of
which a general timer mechanism will be provided.
The maximum number of workers and routing workers are now
hardwired to 128 and 100, respectively. It is still so that
all workers must be created at startup and destroyed at
shutdown, creating/destorying workers at runtime is not
possible.
Worker is now the base class of all workers. It has a message
queue and can be run in a thread of its own, or in the calling
thread. Worker can not be used as such, but a concrete worker
class must be derived from it. Currently there is only one
concrete class RoutingWorker.
There is some overlapping in functionality between Worker and
RoutingWorker, as there is e.g. a need for broadcasting a
message to all routing workers, but not to other workers.
Currently other workers can not be created as the array for
holding the pointers to the workers is exactly as large as
there will be RoutingWorkers. That will be changed so that
the maximum number of threads is hardwired to some ridiculous
value such as 128. That's the first step in the path towards
a situation where the number of worker threads can be changed
at runtime.
A new class mxs::Worker will be introduced and mxs::RoutingWorker
will be inherited from that. mxs::Worker will basically only be a
thread with a message-loop.
Once available, all current non-worker threads (but the one
implicitly created by microhttpd) can be creating by inheriting
from that; in practice that means the housekeeping thread, all
monitor threads and possibly the logging thread.
The benefit of this arrangement is that there then will be a general
mechanism for cross thread communication without having to use any
shared data structures.
The old hkheartbeat variable was changed to the mxs_clock() function that
simply wraps an atomic load of the variable. This allows it to be
correctly read by MaxScale as well as opening up the possibility of
converting the value load to a relaxed memory order read.
Renamed the header and associated macros. Removed inclusion of the
heartbeat header from the housekeeper header and added it to the files
that were missing it.
The variable containing the number of workers must be updated
only after the workers have been successfully created.
Failure to do this led to crash in Worker::shutdown_all() if a
terminating signal was received after the worker initialization
had failed.
With a granularity of 1 second, the load will from a human
perspective reflect the current situation. That also means
that the maxadmin output shows "natural" steps; 1s, 1m and 1h.
By definition, the load is calculated using the following formula:
L = 100 * ((T - t) / T)
where T is a time period and t the time of that period that the worker
spends in epoll_wait(). So, if there is so much work that epoll_wait()
always returns immediately, then the load is 100 and if the thread
spends the entire period in epoll_wait(), then the load is 0.
The basic idea is that the timeout given to epoll_wait() is adjusted
so that epoll_wait() will always return roughly at 10 seconds interval.
By making a note of when we are about to enter epoll_wait() and when we
return from it, we have all the information we need for calculating the
load.
Due to the nature of things, we will not be able to calculate the load
at exact 10-second boundaries, but it will be pretty close. And the load
is always calculated using the true length of the period.
We will then calculate 1 minute load by averaging the load value for 6
consecutive 10-second periods and the 1 hour load by averaging the load
value of 60 consecutive 1 minute loads.
So, while the 10-second load represents the load of the most recently
measured 10-second period (and not the load of the most recent 10
seconds), the 1 minute load and the 1 hour load represents the load of
the most recent minute and hour respectively.
Since a shutdown message will now be sent via the regular epoll route,
there is no need to regularily wake up from epoll in order to check
whether shutdown has been initiated, but we can simply wait in epoll_wait
until told to wake up.
Pre-loading users for all threads at startup significantly reduces the
chance for failures caused by the lazy initialization of the user database
done by the authenticators.
If users are not loaded at startup and the connection limit for all
servers is reached, authentication in MaxScale will fail not due to too
many connections but due to the lack of authentication data. This causes
repeated reloading of users, which floods the log with messages, and
unnecessary stress on the cluster itself.
The internal header directory conflicted with in-source builds causing a
build failure. This is fixed by renaming the internal header directory to
something other than maxscale.
The renaming pointed out a few problems in a couple of source files that
appeared to include internal headers when the headers were in fact public
headers.
Fixed maxctrl in-source builds by making the copying of the sources
optional.
By moving the initialization into Worker::run, all threads, including the
main thread, are properly initialized. This was not noticed before as
qc_sqlite initialized the main thread in the process initialization
callback.
The polling mechanism can now optionally be used for managing
the lifetime of an object placed into the poll set.
If a MXS_POLL_DATA has a non-null 'free', then the reference count
of the data will be increased before calling the handler and
decreased after. In that case, if the reference count reaches 0,
the free function will be called.
Note that the reference counts of *all* MXS_POLL_DATAs returned
by 'epoll_wait' will be increased before the events are delivered
to the handlers individually for each MXS_POLL_DATA, and then once
all events have been delivered, the reference count of each
MXS_POLL_DATA will be decreased.
This ensure that if there are interdependencies between different
MXS_POLL_DATAs returned by one call to 'epoll_wait', the case that
an MXS_POLL_DATA is deleted before its events have been delivered
can be avoided by using the reference count for lifetime management.
In subsequent commits, the reference count will be taken into use
in the lifetime management of DCBs.
It is now possible to specify the thread stack size to be used,
when a new thread is created. This will subsequently be used
for allowing the stack size to be specified for worker threads.
The template class wraps a HashMap such that only a few operations
are allowed. Usage requires specializing a RegistryTraits class
template for each entry type.
The top level resource self links pointed to the collection instead of the
resource itself. The individual resoures now also have a links field that
contains the self link to the resource. This should make navigation of the
API easier as all objects have valid links in them.
Various small changes to part2, as suggested by comments and otherwise.
Mostly renaming, working logic should not change.
Exception: session id changed to 64bit in the container and associated
functions. Another commit will change it to 64bit in the session itself.
MySQL sessions are added to a hasmap when created, removed when closed.
MYSQL_COM_PROCESS_KILL is now detected, the thread_id is read and the kill
command sent to all worker threads to find the correct session. If found, a
fake hangup even is created for the client dcb.
As is, this function is of little use since the client could just disconnect
itself instead. Later on, additional commands of this nature will be added.
The old polling message system is obsolete now that the worker messages
are implemented. The old system was only used to clean up the persistent
connection pool of a server.
It is now possible to define whether tasks are executed immediately or put
on the event queue of the worker thread. If task execution is in automatic
mode and the current executing thread is a worker thread, the
Task->execute method is called immediately.
This allows tasks to be posted from within worker threads. This is
intended to be used when purging of stale persistent connections and
printing diagnostic output via MaxAdmin. All of these actions are done
from within a worker thread.
- Posting a task to a worker for execution (without implicit wait)
is called "post".
- Posting a task to every worker for execution (without implicit wait)
is called "broadcast".
In these cases the task must be provided as a pointer or auto_ptr, to
indicate that the provided pointer must remain alive for longer than
the duration of the function call.
- Posting a task to a worker for execution *and* waiting for all workers
to have executed the task is called "execute" and the two variants are
now called "execute_concurrently" and "execute_serially".
In these cases the task is provided as a reference, since the functions
will return only when all workers have (in concurrent or serial fashion)
executed the task. That is, it need not remain alive for longer than the
duration of the function call.
Concurrently executing a task on all workers *and* waiting until
all workers have executed the task seems to be common enough to
warrant a helper function for that purpose.
When the Worker mechanism has been initialized the current_worker_id
of the calling thread is set to 0. That way, connections can be created
after Worker::init() has been called, but before the workers have been
started. Such connections will be handled by the worker that is running
in the main thread.
A Worker::Task is an object that can be sent to a worker for
execution. The task is sent to the worker using the messaging
mechanism where the `execute` function of the task will be
called in the thread context of the worker.
There are two kinds of tasks; regular tasks and disposable tasks.
The former are just sent to the worker for execution while the
latter are sent and subsequently disposed of, once the task has
been executed.
A disposable task can be sent to either one worker or to all
workers. In the latter case, the task will be deleted once it
has been executed by all workers.
A semaphore can be associated with a regular task. Once the task
has been executed by the worker, the semaphore will automatically
be posted. That way, it is trivial to send a task for execution
to a worker and wait until the task has been executed. For instance:
Semaphore sem;
MyTask task;
pWorker->execute(&task, &sem);
sem.wait();
const MyResult& result = task.result();
The low level mechanism for posting and broadcasting messages will
be removed.
All workers share an epoll instance that is added level-triggered
to the epoll instance of each Worker. This is intended to be used
together with listening sockets.
When a listening socket is added to the shared epoll instance the
effect is that EPOLLIN will be active for it whenever there is a
connection pending on a listening socket added to that epoll
instance.
When that occurs all workers in their epoll_wait()-calls will return.
When the workers subsequently call epoll_wait() on the shared epoll
instance, that will return with an event provided some other thread(s)
has not yet called accept() on the listening socket.
As each worker extracts just one event at a time and calls accept just
once before calling epoll_wait(), it means that the client connections
will be distributed evenly across all workers, provided the load on
the workers is roughly the same. If it isn't then a worker with less
load will get more connections to handle (which will even out the load).