MXS-2045: Update avrorouter tutorial

Removed unnecessary sections and updated some of the text to be more
specific. Expanded the explanations on where the replication starts and
how the avrorouter needs to be configured. Added example output from the
cdc.py client and removed the relatively useless maxavrocheck section.
This commit is contained in:
Markus Mäkelä 2018-09-20 10:49:10 +03:00
parent 1689fe7f48
commit e888bcac3b
No known key found for this signature in database
GPG Key ID: 72D48FCE664F7B19

View File

@ -4,12 +4,6 @@ This tutorial is a short introduction to the
[Avrorouter](../Routers/Avrorouter.md), how to set it up and how it interacts
with the binlogrouter.
The avrorouter can also be deployed directly on the master server which removes
the need to use the binlogrouter. This does require a lot more disk space on the
master server as both the binlogs and the Avro format files are stored there. It
is recommended to deploy the avrorouter and the binlogrouter on a remove server
so that the data streaming process has a minimal effect on performance.
The first part configures the services and sets them up for the binary log to Avro
file conversion. The second part of this tutorial uses the client listener
interface for the avrorouter and shows how to communicate with the the service
@ -22,9 +16,9 @@ over the network.
## Preparing the master server
The master server where we will be replicating from needs to have binary logging
enabled, the binary log format set to row based replication and the binary log
row image needs to contain all the changed. These can be enabled by adding the
two following lines to the _my.cnf_ file of the master.
enabled, `binlog_format` set to `row` and `binlog_row_image` set to
`full`. These can be enabled by adding the two following lines to the _my.cnf_
file of the master.
```
binlog_format=row
@ -57,6 +51,8 @@ passwd=maxpwd
type=service
router=avrorouter
source=replication-service
filestem=binlog
start_index=15
# The listener for the replication-service
[replication-listener]
@ -84,16 +80,19 @@ protocol=maxscaled
socket=default
```
You can see that the `source` parameter in the _avro-service_ points to the
_replication-service_ we defined before. This service will be the data source
for the avrorouter. The _filestem_ is the prefix in the binlog files. For more
information on the avrorouter options, read the
[Avrorouter Documentation](../Routers/Avrorouter.md).
The `source` parameter in the _avro-service_ points to the _replication-service_
we defined before. This service will be the data source for the avrorouter. The
_filestem_ is the prefix in the binlog files and _start_index_ is the binlog
number to start from. With these parameters, the avrorouter will start reading
events from binlog `binlog.000015`.
After the services were defined, we added the listeners for the
_replication-service_ and the _avro-service_. The _CDC_ protocol is a new
protocol added with the avrorouter and it is the only supported protocol for the
avrorouter.
Note that the _filestem_ and _start_index_ must point to the file that is the
first binlog that the binlogrouter will replicate. For example, if the first
file you are replicating is `my-binlog-file.001234`, set the parameters to
`filestem=my-binlog-file` and `start_index=1234`.
For more information on the avrorouter options, read the [Avrorouter
Documentation](../Routers/Avrorouter.md).
# Preparing the data in the master server
@ -104,53 +103,23 @@ binary logs before the conversion process is started.
If the binary logs contain data modification events for tables that aren't
created in the binary logs, the Avro schema of the table needs to be manually
created. There are two ways to do this:
created. There are multiple ways to do this:
- Dump the database to a slave, configure it to replicate from the master and
point MaxScale to this slave (this is the recommended method as it requires no
extra steps)
- Use the [_cdc_schema_ Go utility](../Routers/Avrorouter.md#avro-schema-generator)
and copy the generated .avsc files to the _avrodir_
- Manually create the schema
- Use the [_cdc_schema_ Go utilty](../Routers/Avrorouter.md#avro-schema-generator)
- Use the [Python version of the schema generator](../../server/modules/protocol/examples/cdc_schema.py)
and copy the generated .avsc files to the _avrodir_
All Avro file schemas follow the same general idea. They are in JSON and follow
the following format:
```
{
"namespace": "MaxScaleChangeDataSchema.avro",
"type": "record",
"name": "ChangeRecord",
"fields":
[
{
"name": "name",
"type": "string",
"real_type": "varchar",
"length": 200
},
{
"name":"address",
"type":"string",
"real_type": "varchar",
"length": 200
},
{
"name":"age",
"type":"int",
"real_type": "int",
"length": -1
}
]
}
```
The avrorouter uses the schema file to identify the columns, their names and
what type they are. The _name_ field contains the name of the column and the
_type_ contains the Avro type. Read the [Avro specification](https://avro.apache.org/docs/1.8.1/spec.html)
for details on the layout of the schema files.
All Avro schema files for tables that are not created in the binary logs need to
be in the location pointed by the _avrodir_ router_option and must use the
following naming: `<database>.<table>.<schema_version>.avsc`. For example, the
schema file name of the _test.t1_ table would be `test.t1.0000001.avsc`.
If you used the schema generator scripts, all Avro schema files for tables that
are not created in the binary logs need to be in the location pointed to by the
_avrodir_ parameter. The files use the following naming:
`<database>.<table>.<schema_version>.avsc`. For example, the schema file name of
the _test.t1_ table would be `test.t1.0000001.avsc`.
# Starting MariaDB MaxScale
@ -161,7 +130,7 @@ executing a few commands.
```
CHANGE MASTER TO MASTER_HOST='172.18.0.1',
MASTER_PORT=3000,
MASTER_LOG_FILE='binlog.000001',
MASTER_LOG_FILE='binlog.000015',
MASTER_LOG_POS=4,
MASTER_USER='maxuser',
MASTER_PASSWORD='maxpwd';
@ -169,30 +138,30 @@ CHANGE MASTER TO MASTER_HOST='172.18.0.1',
START SLAVE;
```
**NOTE:** GTID replication is not currently supported and file-and-position
replication must be used.
This will start the replication of binary logs from the master server at
172.18.0.1:3000. For more details about the details of the commands, refer
to the [Binlogrouter](../Routers/Binlogrouter.md) documentation.
172.18.0.1 listening on port 3000. The first file that the binlogrouter
replicates is `binlog.000015`. This is the same file that was configured as the
starting file in the avrorouter.
For more details about the SQL commands, refer to the
[Binlogrouter](../Routers/Binlogrouter.md) documentation.
After the binary log streaming has started, the avrorouter will automatically
start converting the binlogs into Avro files.
start processing the binlogs.
For the purpose of this tutorial, create a simple test table using the following
statement and populated it with some data.
# Creating and Processing Data
Next, create a simple test table and populated it with some data by executing
the following statements.
```
CREATE TABLE test.t1 (id INT);
INSERT INTO test.t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
```
This table will be replicated through MaxScale and it will be converted into an
Avro file, which you can inspect by using the _maxavrocheck_ utility program.
```
[markusjm@localhost avrodata]$ ../bin/maxavrocheck test.t1.000001.avro
File sync marker: caaed7778bbe58e701eec1f96d7719a
/home/markusjm/build/avrodata/test.t1.000001.avro: 1 blocks, 1 records and 12 bytes
```
To use the _cdc.py_ command line client to connect to the CDC service, we must first
create a user. This can be done via maxadmin by executing the following command.
@ -201,8 +170,29 @@ maxadmin call command cdc add_user avro-service maxuser maxpwd
```
This will create the _maxuser:maxpwd_ credentials which can then be used to
request a data stream of the `test.t1` table that was created earlier.
request a JSON data stream of the `test.t1` table that was created earlier.
```
cdc.py -u maxuser -p maxpwd -h 127.0.0.1 -P 4001 test.t1
```
The output is a stream of JSON events describing the changes done to the
database.
```
{"namespace": "MaxScaleChangeDataSchema.avro", "type": "record", "name": "ChangeRecord", "fields": [{"name": "domain", "type": "int"}, {"name": "server_id", "type": "int"}, {"name": "sequence", "type": "int"}, {"name": "event_number", "type": "int"}, {"name": "timestamp", "type": "int"}, {"name": "event_type", "type": {"type": "enum", "name": "EVENT_TYPES", "symbols": ["insert", "update_before", "update_after", "delete"]}}, {"name": "id", "type": "int", "real_type": "int", "length": -1}]}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 1, "timestamp": 1537429419, "event_type": "insert", "id": 1}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 2, "timestamp": 1537429419, "event_type": "insert", "id": 2}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 3, "timestamp": 1537429419, "event_type": "insert", "id": 3}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 4, "timestamp": 1537429419, "event_type": "insert", "id": 4}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 5, "timestamp": 1537429419, "event_type": "insert", "id": 5}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 6, "timestamp": 1537429419, "event_type": "insert", "id": 6}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 7, "timestamp": 1537429419, "event_type": "insert", "id": 7}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 8, "timestamp": 1537429419, "event_type": "insert", "id": 8}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 9, "timestamp": 1537429419, "event_type": "insert", "id": 9}
{"domain": 0, "server_id": 3000, "sequence": 11, "event_number": 10, "timestamp": 1537429419, "event_type": "insert", "id": 10}
```
The first record is always the JSON format schema for the table describing the
types and names of the fields. All records that follow it represent the changes
that have happened on the database.