Skip to content

Implement TiDB database monitoring #20826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 67 additions & 23 deletions mysql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The MySQL integration tracks the performance of your MySQL instances. It collect

Enable [Database Monitoring][32] (DBM) for enhanced insights into query performance and database health. In addition to the standard integration, Datadog DBM provides query-level metrics, live and historical query snapshots, wait event analysis, database load, and query explain plans.

MySQL version 5.6, 5.7, 8.0, and MariaDB versions 10.5, 10.6, 10.11 and 11.1 are supported.
MySQL version 5.6, 5.7, 8.0, MariaDB versions 10.5, 10.6, 10.11 and 11.1, and TiDB version 8.1+ are supported.

## Setup

Expand Down Expand Up @@ -68,34 +68,22 @@ mysql> GRANT PROCESS ON *.* TO 'datadog'@'%';
Query OK, 0 rows affected (0.00 sec)
```

Verify the replication client. Replace `<UNIQUEPASSWORD>` with the password you created above:
##### TiDB-specific setup

```shell
mysql -u datadog --password=<UNIQUEPASSWORD> -e "show slave status" && \
echo -e "\033[0;32mMySQL grant - OK\033[0m" || \
echo -e "\033[0;31mMissing REPLICATION CLIENT grant\033[0m"
```
For TiDB databases, the user setup is similar to other database like MySQL, MariaDB and so on but with some differences:

If enabled, metrics can be collected from the `performance_schema` database by granting an additional privilege:
- TiDB does not have `performance_schema`, so skip the performance_schema grant
- TiDB does not support the `REPLICATION CLIENT` privilege, but this is not needed as TiDB uses different replication mechanisms
- The `innodb_index_stats` table is not available in TiDB
- TiDB does not have STORED_PROCEDURE, so do not need to create procedure for explain.

```shell
mysql> show databases like 'performance_schema';
+-------------------------------+
| Database (performance_schema) |
+-------------------------------+
| performance_schema |
+-------------------------------+
1 row in set (0.00 sec)

mysql> GRANT SELECT ON performance_schema.* TO 'datadog'@'%';
Query OK, 0 rows affected (0.00 sec)
```

To collect index metrics, grant the `datadog` user an additional privilege:
For TiDB, create the user with these commands:

```shell
mysql> CREATE USER 'datadog'@'%' IDENTIFIED BY '<UNIQUEPASSWORD>';
Query OK, 0 rows affected (0.00 sec)

mysql> GRANT SELECT ON mysql.innodb_index_stats TO 'datadog'@'%';
mysql> GRANT PROCESS ON *.* TO 'datadog'@'%';
Query OK, 0 rows affected (0.00 sec)
```

Expand Down Expand Up @@ -141,6 +129,27 @@ For a full list of available configuration options, see the [sample `mysql.d/con

To collect `extra_performance_metrics`, your MySQL server must have `performance_schema` enabled - otherwise set `extra_performance_metrics` to `false`. For more information on `performance_schema`, see [MySQL Performance Schema Quick Start][9].

##### TiDB configuration

For TiDB instances, some configuration options should be adjusted:

```yaml
init_config:

instances:
- host: 127.0.0.1
username: datadog
password: "<YOUR_CHOSEN_PASSWORD>"
port: 4000 # Default TiDB port
options:
replication: false # TiDB uses different replication mechanisms
galera_cluster: false
extra_status_metrics: true
extra_innodb_metrics: false # TiDB doesn't have InnoDB
disable_innodb_metrics: true # Disable InnoDB metrics for TiDB
extra_performance_metrics: false # TiDB doesn't have performance_schema
```

**Note**: The `datadog` user should be set up in the MySQL integration configuration as `host: 127.0.0.1` instead of `localhost`. Alternatively, you may also use `sock`.

[Restart the Agent][10] to start sending MySQL metrics to Datadog.
Expand Down Expand Up @@ -251,6 +260,14 @@ LABEL "com.datadoghq.ad.init_configs"='[{}]'
LABEL "com.datadoghq.ad.instances"='[{"server": "%%host%%", "username": "datadog","password": "<UNIQUEPASSWORD>"}]'
```

For TiDB instances, add the appropriate configuration options:

```yaml
LABEL "com.datadoghq.ad.check_names"='["mysql"]'
LABEL "com.datadoghq.ad.init_configs"='[{}]'
LABEL "com.datadoghq.ad.instances"='[{"server": "%%host%%", "username": "datadog", "password": "<UNIQUEPASSWORD>", "port": 4000, "options": {"disable_innodb_metrics": true, "extra_performance_metrics": false}}]'
```

See [Autodiscovery template variables][12] for details on using `<UNIQUEPASSWORD>` as an environment variable instead of a label.

#### Log collection
Expand Down Expand Up @@ -551,6 +568,24 @@ The check does not collect all metrics by default. Set the following boolean con
| ---------------------- | ----------- |
| mysql.info.schema.size | GAUGE |

#### TiDB limitations

When using some extra integration with TiDB, be aware of the following limitations for TiDB:

- **InnoDB metrics**: TiDB doesn't use the InnoDB storage engine, so all InnoDB-related metrics are unavailable
- **Performance Schema**: TiDB doesn't have MySQL's `performance_schema`, so performance metrics requiring it are unavailable
- **Replication metrics**: TiDB uses a different replication mechanism (Raft consensus), so traditional MySQL replication metrics don't apply
- **MyISAM metrics**: TiDB doesn't support MyISAM, so key cache metrics are unavailable
- **Binary log metrics**: TiDB has a different binlog implementation, so traditional MySQL binlog metrics may not be available
- **Statement metrics**: TiDB uses `information_schema.cluster_statements_summary` instead of `performance_schema.events_statements_summary_by_digest`
- **Activity monitoring**: TiDB uses `information_schema.cluster_processlist` instead of `performance_schema.events_statements_current`

For Database Monitoring features:
- Query samples and explain plans are collected from `cluster_statements_summary` with some approximations
- Wait events are not available as TiDB doesn't track them in the same way as MySQL. We set 'N/A' for all.
- Some query metrics are approximated (for example, rows examined is estimated from keys processed)
- TiDB explain plans are retrieved from the `PLAN` column in `information_schema.cluster_statements_summary` table, which contains pre-collected execution plans in text format with embedded execution statistics. Please be sure it's not realtime explain plan like other database like MySQL, MariaDB.

### Events

The MySQL check does not include any events.
Expand All @@ -571,6 +606,15 @@ See [service_checks.json][22] for a list of service checks provided by this inte
- [Database user lacks privileges][29]
- [How to collect metrics with a SQL Stored Procedure?][30]

### TiDB-specific troubleshooting

**Missing metrics**: If you see warnings about missing InnoDB or `performance_schema` metrics when monitoring TiDB:
- This is expected behavior. Set `disable_innodb_metrics: true` and `extra_performance_metrics: false` in your configuration.

**Connection issues**: TiDB typically runs on port 4000 instead of MySQL's default 3306. Make sure to specify the correct port in your configuration.

**High metric collection time**: The `CLUSTER_*` tables in TiDB aggregate data from all TiDB nodes, which can be slow in large clusters. Consider increasing the collection interval if needed.

## Further Reading

Additional helpful documentation, links, and articles:
Expand Down
1 change: 1 addition & 0 deletions mysql/changelog.d/20826.changed
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Implement TiDB database monitoring
203 changes: 202 additions & 1 deletion mysql/datadog_checks/mysql/activity.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import time
from contextlib import closing
from enum import Enum
from typing import Dict, List # noqa: F401
from typing import Dict, List, Tuple # noqa: F401

import pymysql

Expand Down Expand Up @@ -130,6 +130,37 @@
)
"""

# TiDB specific constants
TIDB_ACTIVITY_QUERY_LIMIT = 100

# TiDB specific activity query
TIDB_ACTIVITY_QUERY = """\
SELECT
ID as processlist_id,
USER as processlist_user,
HOST as processlist_host,
DB as processlist_db,
COMMAND as processlist_command,
STATE as processlist_state,
INFO as sql_text,
TIME as query_time,
MEM as memory_usage,
TxnStart as txn_start_time
FROM INFORMATION_SCHEMA.CLUSTER_PROCESSLIST
WHERE
COMMAND != 'Sleep'
AND INFO IS NOT NULL
AND INFO != ''
-- Exclude our own monitoring queries
AND INFO NOT LIKE '%CLUSTER_PROCESSLIST%'
AND INFO NOT LIKE '%datadog-agent%'
-- Exclude other system queries
AND INFO NOT LIKE '%INFORMATION_SCHEMA%'
AND INFO NOT LIKE '%performance_schema%'
ORDER BY TIME DESC
LIMIT {}
""".format(TIDB_ACTIVITY_QUERY_LIMIT)


class MySQLVersion(Enum):
# 8.0
Expand Down Expand Up @@ -183,6 +214,12 @@ def run_job(self):
'Waiting for events_waits_current availability to be determined by the check, skipping run.'
)
if self._check.events_wait_current_enabled is False:
# Use TiDB-specific activity collection
if self._check._get_is_tidb(self._db):
self._log.debug("TiDB detected, using TiDB-specific activity collection")
self._collect_tidb_activity()
return

azure_deployment_type = self._config.cloud_metadata.get("azure", {}).get("deployment_type")
if azure_deployment_type != "flexible_server":
self._check.record_warning(
Expand All @@ -201,6 +238,170 @@ def run_job(self):
self._check_version()
self._collect_activity()

@tracked_method(agent_check_getter=agent_check_getter)
def _collect_tidb_activity(self):
# type: () -> None
"""Collect activity data from TiDB CLUSTER_PROCESSLIST"""
tags = [t for t in self._tags if not t.startswith('dd.internal')]

with closing(self._get_db_connection().cursor(CommenterDictCursor)) as cursor:
rows = self._get_tidb_activity(cursor)
rows = self._normalize_tidb_rows(rows)

# Group rows by TiDB node instance
rows_by_node = {}
for row in rows:
node_instance = row.get('processlist_host', 'unknown')
if node_instance not in rows_by_node:
rows_by_node[node_instance] = []
rows_by_node[node_instance].append(row)

# Create and send separate events for each TiDB node
for node_instance, node_rows in rows_by_node.items():
event = self._create_tidb_activity_event(node_rows, tags, node_instance)
payload = json.dumps(event, default=self._json_event_encoding)
self._check.database_monitoring_query_activity(payload)
self._check.histogram(
"dd.mysql.activity.collect_activity.payload_size",
len(payload),
tags=tags + ["tidb_node_instance:{}".format(node_instance)] + self._check._get_debug_tags(),
)

@tracked_method(agent_check_getter=agent_check_getter, track_result_length=True)
def _get_tidb_activity(self, cursor):
# type: (pymysql.cursor) -> List[Dict[str]]
"""Execute TiDB activity query"""
self._log.debug("Running TiDB activity query [%s]", TIDB_ACTIVITY_QUERY)
cursor.execute(TIDB_ACTIVITY_QUERY)
return cursor.fetchall()

def _derive_tidb_wait_event(self, state):
# type: (str) -> Tuple[str, str]
"""
Derive wait event and wait event group from TiDB processlist state.
Returns (wait_event, wait_event_group)
"""
return 'N/A', 'N/A'

def _normalize_tidb_rows(self, rows):
# type: (List[Dict[str]]) -> List[Dict[str]]
"""Normalize TiDB activity rows to match expected format"""
normalized_rows = []
estimated_size = 0

for row in rows:
# Generate unique identifiers for TiDB
thread_id = row.get('processlist_id', 0)

# Derive wait event from state
state = row.get('processlist_state', '')
wait_event, wait_event_group = self._derive_tidb_wait_event(state)

# Convert TiDB fields to match MySQL activity format
normalized_row = {
'thread_id': thread_id,
'processlist_id': row.get('processlist_id'),
'processlist_user': row.get('processlist_user'),
'processlist_host': row.get('processlist_host'),
'processlist_db': row.get('processlist_db'),
'processlist_command': row.get('processlist_command'),
'processlist_state': row.get('processlist_state'),
'sql_text': row.get('sql_text'),
'query_time': row.get('query_time', 0),
'memory_usage': row.get('memory_usage', 0),
'txn_start_time': row.get('txn_start_time'),
# Derived wait events
'wait_event': wait_event,
'wait_event_type': wait_event_group,
}

# Add query truncation state
if normalized_row['sql_text'] is not None:
normalized_row['query_truncated'] = get_truncation_state(normalized_row['sql_text']).value

# Obfuscate the query
normalized_row = self._obfuscate_and_sanitize_row(normalized_row)

estimated_size += self._get_estimated_row_size_bytes(normalized_row)
if estimated_size > MySQLActivity.MAX_PAYLOAD_BYTES:
return normalized_rows

normalized_rows.append(normalized_row)

return normalized_rows

def _create_tidb_activity_event(self, active_sessions, tags, node_instance):
# type: (List[Dict[str]], List[str], str) -> Dict[str]
"""Create activity event payload for TiDB"""
# Convert rows to MySQL-compatible activity format
mysql_activity = []

for row in active_sessions:
# Calculate timing information
# Use milliseconds to avoid overflow issues
current_time_ms = int(time.time() * 1000)
query_time_s = row.get('query_time', 0)
query_time_ms = int(query_time_s * 1000) if query_time_s else 0
event_start_ms = max(0, current_time_ms - query_time_ms)

# Generate event IDs based on thread_id and timestamp
event_id = hash(str(row['thread_id']) + str(current_time_ms)) % (2**31) # Keep it positive and reasonable

activity = {
# Essential identifiers
'thread_id': row['thread_id'],
'processlist_id': row['processlist_id'],
'processlist_user': row['processlist_user'],
'processlist_host': row['processlist_host'],
'processlist_db': row['processlist_db'],
'processlist_command': row['processlist_command'],
'processlist_state': row['processlist_state'],
'sql_text': row.get('sql_text'),
'current_schema': row.get('processlist_db'),
'query_signature': row.get('query_signature'),
'dd_commands': row.get('dd_commands', []),
'dd_tables': row.get('dd_tables', []),
'dd_comments': row.get('dd_comments', []),
'query_truncated': row.get('query_truncated'),
# Event identifiers
'event_id': event_id,
'end_event_id': event_id, # Same as event_id for TiDB
# Timing information
'event_timer_start': event_start_ms * 1000000, # Convert to nanoseconds
'event_timer_end': current_time_ms * 1000000, # Convert to nanoseconds
'lock_time': 0, # TiDB doesn't provide lock time in CLUSTER_PROCESSLIST
# Wait event info
'wait_event': row.get('wait_event', 'CPU'),
'wait_timer_start': event_start_ms * 1000000, # Same as event timer
'wait_timer_end': current_time_ms * 1000000,
# Additional MySQL compatibility fields
'object_name': None, # TiDB doesn't track file operations
'object_type': None,
'operation': None,
'source': '',
}

mysql_activity.append(activity)

event = {
"host": self._check.reported_hostname,
"ddagentversion": datadog_agent.get_version(),
"ddsource": "mysql",
"dbm_type": "activity",
"collection_interval": self.collection_interval,
"ddtags": tags,
"timestamp": time.time() * 1000,
"cloud_metadata": self._config.cloud_metadata,
'service': self._config.service,
"mysql_activity": mysql_activity,
}

# For TiDB, add the specific node instance for this activity event
if node_instance:
event['tidb'] = {'node_instance': node_instance}

return event

def _check_version(self):
# type: () -> None
if self._check.version.version_compatible((8,)):
Expand Down
Loading
Loading