For #2383
1. Limit the concurrent transactions of routine load job
2. Create new routine load task when txn is VISIBLE, not after COMMITTED.
For #2267
1. All non-master daemon thread should also be started after catalog is ready.
For #2354
1. `fixLoadJobMetaError()` should be called after all meta data is read, including image and edit logs.
2. Mini load job should set to CANCELLED when corresponding transaction is not found, instead
of UNKNOWN.
Pure DocValue optimization for doris-on-es
Future todo:
Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable, should check pure_docvalue enabled for one whole scan outside, I will add this todo in future
The timeline for this question is as follows:
1. For some reason, the master have lost contact with the other two followers.
Judging from the logs of the master, for almost 40 seconds, the master did not print any logs.
It is suspected that it is stuck due to full gc or other reasons, causing the
other two followers to think that the master has been disconnected.
2. After the other two followers re-elected, they continued to provide services.
3. The master node is manually restarted afterwards. When restarting it for the first time,
it needs to rollback some committed logs, so it needs to be closed and restarted again.
After restarting again, it returns to normal.
The main reason is that the master got stuck for 40 seconds for some reason.
This issue requires further observation.
At the same time, in order to alleviate this problem, we decided to set bdbje's heartbeat timeout
as a configurable value. The default is 30 seconds. Can be configured to 1 minute,
try to avoid this problem first.
There is bug in Doris version 0.10.x. When a load job in PENDING or LOADING
state was replayed from image (not through the edit log), we forgot to add
the corresponding callback id in the CallbackFactory. As a result, the
subsequent finish txn edit logs cannot properly finish the job during the
replay process. This results in that when the FE restarts, these load jobs
that should have been completed are re-entered into the pending state,
resulting in repeated submission load tasks.
Those wrong images are unrecoverable, so that we have to reset all load jobs
in PENDING or LOADING state when restarting FE, depends on its corresponding
txn's status, to avoid submit jobs repeatedly.
If corresponding txn exist, set load job' state depends on txn's status.
If txn does not exist, may be the txn has been removed due to label expiration.
So that we don't know the txn is aborted or visible. So we have to set the job's state
as UNKNOWN, which need handle it manually.
Previously, only Master FE has node info metrics to indicate which node is alive.
But this info should be available on every FE, so that the monitor system
can get all metrics from any FE.
DecommissionJob is also a type of AlterJob.
When AlterJobV2 was introduced before, DecommissionJob was not modified accordingly.
In fact, the Decommission operation does not need to generate a Job, but only need to mark the corresponding Backend state as Decommission. After that, the tablet repair logic will try to migrate the tablet on that Backend. And SystemHandler only needs to check all nodes marked as decommission, and then drop the emptied nodes.
This variable is mainly for INSERT operation, because INSERT operation has both query and load part.
Using only the exec_mem_limit variable does not make a good distinction of memory limit between the two parts.
Enhance doris on es error message and modify some field data transform error.
For varchar/char type, sometimes elasticsearch user post some not-string value to Elasticsearch Index. because of reading value from _source, we can not process all json type and then just transfer the value to original string representation this may be a tricky, but we can workaround this issue
In the previous implementation, clone task will continue download files
even if some error happened. This may cause unexpected problem. This
Change List refactor it to that when error happends, clone task will
fail total and try to clone from another remote source.
Besides above change, I call FileUtils::remove_all and create_dir
instead of boost one, which may cause exception. What's more
AgentMasterClient is replaced with ThriftRpcHelper, by this change
conncection can be reused.
This commit will add a new sql mode named MODE_PIPES_AS_CONCAT:
Description:
1、If this mode is active, '||' will be handled different from the original way ('||' and 'or' are seen as the same symbols in Doris) that it can be used to concat two exps and returns a new string. For example, 'a' || 'b' = 'ab' and 1 || 0 = '10'.
2. User can active this mode by "SET sql_mode = PIPES_AS_CONCAT", and deactive it by "SET sql_mode = '' ".