Kafka
Introduction
Kafka monitoring involves collecting and analyzing metrics related to the performance and health of Apache Kafka clusters. By monitoring these metrics, administrators can identify bottlenecks, optimize resource usage, detect potential issues like lag or broker failure, and ensure smooth and reliable message streaming and processing. This proactive approach helps in maintaining the overall efficiency, scalability, and stability of the Kafka environment.
Getting Started
Compatibility
The Kafka O11ySource is designed to work with all versions greater than or equal to 7, and it has been tested with Kafka 7.3.
Data Collection Method
The Kafka O11ySource is configured to collect various kinds of metrics related to Kafka Broker, Kafka Consumer Group or Kafka Zookeeper Metrics both in standalone & cluster mode.
vuSmartMaps uses vumetric agent to collect Kafka Broker, Kafka Consumer Group or Kafka Zookeeper Metrics.
Prerequisites
Inputs for Configuring Data Source
- Instance Name: Please enter the name of the Kafka Package instance. This should be a unique identifier for the specific Kafka Cluster Package deployment you want to monitor.
- Kafka Cluster ID: Cluster ID for which the package is being created
- Package Type: Select the package type that needs to be deployed
- Kafka Script Path: Path of the Kafka Scripts(Eg. /bin/)
- Kafka Host: IP Address on which Kafka Broker is exposed
- Kafka Port: Port on which Kafka Broker is exposed
- Kafka Broker ID: Enter a name to uniquely identify kafka broker instance
- Zookeeper Host: IP Address on which Zookeeper Runs
- Zookeeper Port: Port on which Zookeeper is exposed
- Zookeeper Keeper ID: Enter a name to uniquely identify zookeeper instance
- Jolokia URL: Enter Jolokia Url to fetch Kafka metrics
- Polling Interval [seconds]: How frequently data is gathered. interval should be between 60 - 86400 seconds
Firewall Requirement
To collect data from this O11ySource, ensure the following ports are opened:
Source IP | Destination IP | Destination Port | Protocol | Direction |
---|---|---|---|---|
vuSmartMaps IP | IP address of the Kafka server | 9092, 2181, 8778* | TCP | Outbound |
IP address of the Kafka server | vuSmartMaps Kafka Broker IP | 9092* | TCP | Inbound |
*Before providing the firewall requirements, please update the port based on the customer environment.
Configuring the Target
Configure Metrics Collection from Kafka Server
- On each Kafka instance, Port 9092 should be open for external requests. The following metrics will be collected from running kafka instance
- *Partiton Metrics - Crucial for monitoring the performance and health of topics within a cluster. Key metrics include under-replicated partitions, which indicate potential data loss risks, and messages in/out rates, which help assess producer and consumer performance. Regularly tracking these metrics ensures effective load balancing and optimal data availability.
- *Consumer Group Metrics - Provide insights into the performance and behavior of consumer applications. Key metrics include consumer lag, which measures the difference between the latest message offset and the last consumed offset, helping to identify whether consumers are keeping up with producers. Other important metrics include the rate of messages consumed and the time taken for processing, which help assess the efficiency and responsiveness of consumer groups. Monitoring these metrics ensures that consumer applications operate smoothly and can handle the expected workload effectively.
- On each Kafka instance, Port 8778 should be open for external requests with jolokia metrics enabled.
- *Jolokia metrics for a Kafka server monitor essential performance indicators, including memory usage, garbage collection counts, and data throughput metrics like BytesInPerSec and BytesOutPerSec. Key metrics also include consumer and producer request rates, under-replicated partitions, and leader election rates, providing a comprehensive view of the broker's health and operational efficiency.
- On each Zookeeper instance, Port 2181 should be open for external requests with jolokia metrics enabled.
- *Includes connection statistics like num_alive_connections and packet counts (packets_received, packets_sent), which assess communication load. It also tracks znode and node counts to understand the data structure complexity, alongside latency metrics (latency_min, latency_max, latency_avg) for evaluating server response times. Additionally, file descriptor metrics offer insights into resource utilization, ensuring optimal performance in a distributed environment.
Configuration Steps
- Enable the O11ySource.
- Select the Sources tab and press the
+
button to add Kafka instance details that has to be monitored. - Set up metrics collection configurations which include cluster id & type of package.The O11ysource already configured with metrics to collect, though you have the flexibility to adjust metric collection intervals. Afterwards, select Save and Continue to proceed with downloading the Healthbeat agent.
- The following packages will be available for download based on the OS <Healthbeat full install package> - Downloads the full Healthbeat agent package with required configurations for a fresh installation <Healthbeat config update package> - Healthbeat the agent configuration package to update an existing Logbeat installation
- Download the agent installation or update package, then click Finish to close the data source window.
Metrics Collected
Name | Description | Data Type |
---|---|---|
timestamp | Detailed timestamp | DateTime64 |
tenant_id | Tenant ID | LowCardinality(String) |
bu_id | Business unit ID | LowCardinality(String) |
host | Hostname of the server | LowCardinality(String) |
target | Target server or service | LowCardinality(String) |
cluster_id | Cluster ID | LowCardinality(String) |
broker_id | Broker ID | LowCardinality(String) |
sub_type | Subtype of the metric | LowCardinality(String) |
no_of_brokers | Number of brokers | UInt64 |
brokerStatus | Status of the broker | LowCardinality(String) |
kafka_broker_id | Kafka broker ID | UInt64 |
kafka_host_address | Kafka host address | LowCardinality(String) |
timestamp | Detailed timestamp | DateTime64 |
tenant_id | Tenant ID | LowCardinality(String) |
bu_id | Business unit ID | LowCardinality(String) |
host | Hostname of the server | LowCardinality(String) |
target | Target server or service | LowCardinality(String) |
sub_type | Subtype of the metric | LowCardinality(String) |
cluster_id | Cluster ID | LowCardinality(String) |
kafka_broker_id | Kafka broker ID | UInt64 |
kafka_broker_address | Kafka broker address | LowCardinality(String) |
kafka_topic_name | Kafka topic name | LowCardinality(String) |
kafka_partition_topic_broker_id | Kafka partition topic broker ID | LowCardinality(String) |
kafka_partition_partition_is_leader | Indicates if the partition is leader | Boolean |
kafka_partition_partition_insync_replica | Indicates if the partition is an in-sync replica | Boolean |
kafka_partition_partition_leader | Leader of the partition | UInt64 |
kafka_partition_partition_replica | Replica of the partition | UInt64 |
kafka_partition_offset_newest | Newest offset in the partition | UInt64 |
kafka_partition_offset_oldest | Oldest offset in the partition | UInt64 |
kafka_partition_id | Partition ID | UInt64 |
kafka_partition_topic_id | Kafka partition topic ID | LowCardinality(String) |
kafka_consumergroup_offset | Consumer group offset | Int64 |
kafka_consumergroup_meta | Consumer group metadata | String |
kafka_consumergroup_consumer_lag | Consumer group lag | UInt64 |
kafka_consumergroup_error_code | Consumer group error code | UInt64 |
kafka_consumergroup_client_host | Consumer group client host | LowCardinality(String) |
kafka_consumergroup_client_member_id | Consumer group client member ID | String |
kafka_consumergroup_client_id | Consumer group client ID | String |
kafka_consumergroup_id | Consumer group ID | String |
timestamp | Detailed timestamp | DateTime64 |
tenant_id | Tenant ID | LowCardinality(String) |
bu_id | Business unit ID | LowCardinality(String) |
host | Hostname of the server | LowCardinality(String) |
target | Target server or service | LowCardinality(String) |
cluster_id | Cluster ID | LowCardinality(String) |
message | Message details | String |
GROUP | Consumer group | String |
LAG | Lag of the consumer group | UInt64 |
cOFFSET | Current offset | UInt64 |
eOFFSET | End offset | UInt64 |
cnID | Consumer node ID | String |
HOST | Host of the consumer | String |
TOPIC | Topic name | String |
PARTITION | Partition number | UInt64 |
clID | Client ID | String |
timestamp | Detailed timestamp | DateTime64 |
tenant_id | Tenant ID | LowCardinality(String) |
bu_id | Business unit ID | LowCardinality(String) |
host | Hostname of the server | LowCardinality(String) |
target | Target server or service | LowCardinality(String) |
cluster_id | Cluster ID | LowCardinality(String) |
broker_id | Broker ID | LowCardinality(String) |
type | Type of the metric | LowCardinality(String) |
sub_type | Subtype of the metric | LowCardinality(String) |
metric_type | Type of the metric | LowCardinality(String) |
mbeans | MBean information | String |
CollectionCount | Count of collections | UInt64 |
CollectionCount_diff | Collection count difference | UInt64 |
CollectionTime | Time spent in collection | UInt64 |
CollectionTime_diff | Collection time difference | UInt64 |
topic | Topic name | String |
bytes_per_sec | Bytes per second | Float64 |
messages_per_sec | Messages per second | Float64 |
jolokia_metrics_Threading_ThreadCount | Thread count in Threading metrics | UInt64 |
jolokia_metrics_Threading_TotalStartedThreadCount | Total started thread count in Threading metrics | UInt64 |
jolokia_metrics_Threading_PeakThreadCount | Peak thread count in Threading metrics | UInt64 |
jolokia_metrics_Threading_DaemonThreadCount | Daemon thread count in Threading metrics | UInt64 |
jolokia_metrics_broker_IsrShrinks_OneMinuteRate | ISR shrinks one minute rate | Float64 |
jolokia_metrics_broker_IsrShrinks_FiveMinuteRate | ISR shrinks five minute rate | Float64 |
jolokia_metrics_broker_ActiveControllerCount | Active controller count | UInt64 |
jolokia_metrics_broker_DelayedOperationPurgatory_Produce | Delayed operation purgatory for produce | UInt64 |
jolokia_metrics_broker_DelayedOperationPurgatory_Fetch | Delayed operation purgatory for fetch | UInt64 |
jolokia_metrics_broker_GlobalPartitionCount | Global partition count | UInt64 |
jolokia_metrics_broker_TotalTimeMs_Produce_50thPercentile | Total time for produce (50th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_Produce_Mean | Mean total time for produce | Float64 |
jolokia_metrics_broker_TotalTimeMs_Produce_75thPercentile | Total time for produce (75th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_Produce_Min | Minimum total time for produce | UInt64 |
jolokia_metrics_broker_TotalTimeMs_Produce_95thPercentile | Total time for produce (95th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_Produce_99thPercentile | Total time for produce (99th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_Produce_Max | Maximum total time for produce | UInt64 |
jolokia_metrics_broker_TotalTimeMs_Produce_Count | Count of total time for produce | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchConsumer_Min | Minimum total time for fetch consumer | UInt64 |
jolokia_metrics_broker_TotalTimeMs_FetchConsumer_95thPercentile | Total time for fetch consumer (95th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchConsumer_99thPercentile | Total time for fetch consumer (99th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchConsumer_Max | Maximum total time for fetch consumer | UInt64 |
jolokia_metrics_broker_TotalTimeMs_FetchConsumer_Count | Count of total time for fetch consumer | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchConsumer_50thPercentile | Total time for fetch consumer (50th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchConsumer_Mean | Mean total time for fetch consumer | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchConsumer_75thPercentile | Total time for fetch consumer (75th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchFollower_99thPercentile | Total time for fetch follower (99th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchFollower_Max | Maximum total time for fetch follower | UInt64 |
jolokia_metrics_broker_TotalTimeMs_FetchFollower_Count | Count of total time for fetch follower | UInt64 |
jolokia_metrics_broker_TotalTimeMs_FetchFollower_50thPercentile | Total time for fetch follower (50th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchFollower_Mean | Mean total time for fetch follower | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchFollower_75thPercentile | Total time for fetch follower (75th percentile) | Float64 |
jolokia_metrics_broker_TotalTimeMs_FetchFollower_Min | Minimum total time for fetch follower | UInt64 |
jolokia_metrics_broker_TotalTimeMs_FetchFollower_95thPercentile | Total time for fetch follower (95th percentile) | Float64 |
jolokia_metrics_broker_IsrExpands_FiveMinuteRate | ISR expands five minute rate | Float64 |
jolokia_metrics_broker_IsrExpands_OneMinuteRate | ISR expands one minute rate | Float64 |
jolokia_metrics_broker_OfflinePartitionsCount | Count of offline partitions | UInt64 |
jolokia_metrics_broker_RequestsPerSec_FetchConsumer_FifteenMinuteRate | Fetch consumer requests per second (15-minute rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_FetchConsumer_FiveMinuteRate | Fetch consumer requests per second (5-minute rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_FetchConsumer_MeanRate | Fetch consumer requests per second (mean rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_FetchConsumer_OneMinuteRate | Fetch consumer requests per second (1-minute rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_FetchFollower_FifteenMinuteRate | Fetch follower requests per second (15-minute rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_FetchFollower_FiveMinuteRate | Fetch follower requests per second (5-minute rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_FetchFollower_MeanRate | Fetch follower requests per second (mean rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_FetchFollower_OneMinuteRate | Fetch follower requests per second (1-minute rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_Produce_FifteenMinuteRate | Produce requests per second (15-minute rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_Produce_FiveMinuteRate | Produce requests per second (5-minute rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_Produce_MeanRate | Produce requests per second (mean rate) | Float64 |
jolokia_metrics_broker_RequestsPerSec_Produce_OneMinuteRate | Produce requests per second (1-minute rate) | Float64 |
jolokia_metrics_broker_LeaderElection_FifteenMinuteRate | Leader election rate (15-minute rate) | Float64 |
jolokia_metrics_broker_LeaderElection_FiveMinuteRate | Leader election rate (5-minute rate) | Float64 |
jolokia_metrics_broker_LeaderElection_MeanRate | Leader election rate (mean rate) | Float64 |
jolokia_metrics_broker_LeaderElection_OneMinuteRate | Leader election rate (1-minute rate) | Float64 |
jolokia_metrics_broker_UncleanLeaderElection_MeanRate | Unclean leader election rate (mean rate) | Float64 |
jolokia_metrics_broker_UncleanLeaderElection_OneMinuteRate | Unclean leader election rate (1-minute rate) | Float64 |
jolokia_metrics_broker_UncleanLeaderElection_FifteenMinuteRate | Unclean leader election rate (15-minute rate) | Float64 |
jolokia_metrics_broker_UncleanLeaderElection_FiveMinuteRate | Unclean leader election rate (5-minute rate) | Float64 |
jolokia_metrics_broker_topic_net_failed_fetch_request_per_sec | Failed fetch requests per second for topic | Float64 |
jolokia_metrics_broker_topic_net_failed_produce_request_per_sec | Failed produce requests per second for topic | Float64 |
jolokia_metrics_broker_topic_net_produce_request_per_sec | Produce requests per second for topic | Float64 |
jolokia_metrics_broker_UnderReplicatedPartitions | Under-replicated partitions count | UInt64 |
jolokia_metrics_KafkaServer_BrokerState | Broker state | UInt64 |
jolokia_metrics_Memory_HeapMemoryUsage_max | Max heap memory usage | Float64 |
jolokia_metrics_Memory_HeapMemoryUsage_used | Used heap memory | Float64 |
jolokia_metrics_Memory_HeapMemoryUsage_init | Initial heap memory | Float64 |
jolokia_metrics_Memory_HeapMemoryUsage_committed | Committed heap memory | Float64 |
jolokia_metrics_Memory_NonHeapMemoryUsage_used | Used non-heap memory | Float64 |
jolokia_metrics_Memory_NonHeapMemoryUsage_init | Initial non-heap memory | Float64 |
jolokia_metrics_Memory_NonHeapMemoryUsage_committed | Committed non-heap memory | Float64 |
jolokia_metrics_Memory_NonHeapMemoryUsage_max | Max non-heap memory | Int64 |
timestamp | Detailed timestamp | DateTime64 |
tenant_id | Tenant ID | LowCardinality(String) |
bu_id | Business unit ID | LowCardinality(String) |
host | Hostname of the server | LowCardinality(String) |
target | Target server or service | LowCardinality(String) |
sub_type | Subtype of the data | LowCardinality(String) |
cluster_id | Cluster ID | LowCardinality(String) |
keeper_id | Keeper ID | LowCardinality(String) |
service_node_name | Name of the service node | LowCardinality(String) |
service_address | Service address | LowCardinality(String) |
zookeeper_mntr_server_state | Zookeeper server state | LowCardinality(String) |
zookeeper_mntr_num_alive_connections | Number of alive connections to Zookeeper | UInt64 |
zookeeper_mntr_znode_count | Count of znodes in Zookeeper | UInt64 |
zookeeper_mntr_approximate_data_size | Approximate data size in Zookeeper | UInt64 |
zookeeper_mntr_max_file_descriptor_count | Maximum number of file descriptors in Zookeeper | UInt64 |
zookeeper_mntr_packets_received | Packets received by Zookeeper | UInt64 |
zookeeper_mntr_packets_sent | Packets sent by Zookeeper | UInt64 |
zookeeper_mntr_watch_count | Watch count in Zookeeper | UInt64 |
zookeeper_mntr_outstanding_requests | Outstanding requests in Zookeeper | UInt64 |
zookeeper_mntr_open_file_descriptor_count | Open file descriptor count in Zookeeper | UInt64 |
zookeeper_mntr_ephemerals_count | Count of ephemeral nodes in Zookeeper | UInt64 |
zookeeper_mntr_latency_min | Minimum latency in Zookeeper | UInt64 |
zookeeper_mntr_latency_max | Maximum latency in Zookeeper | UInt64 |
zookeeper_mntr_latency_avg | Average latency in Zookeeper | Float64 |
zookeeper_server_mode | Mode of the Zookeeper server | LowCardinality(String) |
zookeeper_server_outstanding | Outstanding requests on the Zookeeper server | UInt64 |
zookeeper_server_count | Zookeeper server count | UInt64 |
zookeeper_server_epoch | Zookeeper server epoch | UInt64 |
zookeeper_server_received | Packets received by Zookeeper server | UInt64 |
zookeeper_server_zxid | Zookeeper transaction ID (ZXID) | String |
zookeeper_server_node_count | Node count on the Zookeeper server | UInt64 |
zookeeper_server_sent | Packets sent by Zookeeper server | UInt64 |
zookeeper_server_connections | Connections to Zookeeper server | UInt64 |
zookeeper_mntr_packets_received_diff | Difference in received packets by Zookeeper (previous to current) | Int64 |
zookeeper_mntr_packets_sent_diff | Difference in sent packets by Zookeeper (previous to current) | Int64 |
zookeeper_server_sent_diff | Difference in packets sent by Zookeeper server (previous to current) | Int64 |
zookeeper_server_received_diff | Difference in packets received by Zookeeper server (previous to current) | Int64 |