Zabbix Monitoring Agent Performance Comparison

This article is about comparing the performance between two types of agents that can connect to a database. In this test we will use a PostgreSQL database to connect to this database and obtain the amounts of transactions per database as an example.

Introduction

The out-of-the-box solution to do database monitoring with Zabbix is to connect to the database for each and every key/value combination that is used in Zabbix:
UserParameter=mysql.size[*],bash -c 'echo "select sum($(case "$3" in both|"") echo "data_length+index_length";; data|index) echo "$3_length";; free) echo "data_free";; esac)) from information_schema.tables$([[ "$1" = "all" || ! "$1" ]] || echo " where table_schema=\"$1\"")$([[ "$2" = "all" || ! "$2" ]] || echo "and table_name=\"$2\"");" | HOME=/var/lib/zabbix mysql -N'

This example is the default way of Zabbix to collect an item from the database (MySQL in this case) and return it to the server. We will call this way of information collection a single value collection.

Specially for information per database, this means that you have to enter MySQL for each column for each database in the database. This results in may login/logout actions. Another way of collecting data is to collect all the values in one database query. This we will call multi value collection.

If we compare a single item retrieval to multiple values, we get the following result:

Single value collection
real 0m0.263s
user 0m0.061s
sys 0m0.077s
Multi value collection
real 0m0.019s
user 0m0.001s
sys 0m0.004s

The problem is; we need some “caching” to collect all the values on the agent until the Zabbix Agent requests those values. This however is not a database related issue. If CPU/IO is really an issue, the monitoring agent can also be ran from an outside machine. This would alleviate the load on the database server.

Test

We are writing a test to loop through the two different monitoring options:
while true
do
$1
sleep 10
done

The parameter is either the single item pickup or the multivalue query. When we run this test multiple times, we can compare the behaviour of the two processes.

The sleep is to simulate an interval on which each process is executed. Usually an interval of 30 seconds is the smallest in used, but for testing purposes we used 10 second intervals.

The actual tests we are running:

single_item_test.sh
psql -U robert postgres -c "select tup_inserted from pg_stat_database where datname = 'postgres'"
psql -U robert postgres -c "select tup_deleted from pg_stat_database where datname = 'postgres'"
psql -U robert postgres -c "select tup_updated from pg_stat_database where datname = 'postgres'"
psql -U robert postgres -c "select tup_fetched from pg_stat_database where datname = 'postgres'"
psql -U robert postgres -c "select tup_returned from pg_stat_database where datname = 'postgres'"
psql -U robert postgres -c "select tup_inserted from pg_stat_database where datname = 'template0'"
psql -U robert postgres -c "select tup_deleted from pg_stat_database where datname = 'template0'"
psql -U robert postgres -c "select tup_updated from pg_stat_database where datname = 'template0'"
psql -U robert postgres -c "select tup_fetched from pg_stat_database where datname = 'template0'"
psql -U robert postgres -c "select tup_returned from pg_stat_database where datname = 'template0'"
psql -U robert postgres -c "select tup_inserted from pg_stat_database where datname = 'template1'"
psql -U robert postgres -c "select tup_deleted from pg_stat_database where datname = 'template1'"
psql -U robert postgres -c "select tup_updated from pg_stat_database where datname = 'template1'"
psql -U robert postgres -c "select tup_fetched from pg_stat_database where datname = 'template1'"
psql -U robert postgres -c "select tup_returned from pg_stat_database where datname = 'template1'"
psql -U robert postgres -c "select tup_inserted from pg_stat_database where datname = 'gotest'"
psql -U robert postgres -c "select tup_deleted from pg_stat_database where datname = 'gotest'"
psql -U robert postgres -c "select tup_updated from pg_stat_database where datname = 'gotest'"
psql -U robert postgres -c "select tup_fetched from pg_stat_database where datname = 'gotest'"
psql -U robert postgres -c "select tup_returned from pg_stat_database where datname = 'gotest'"
psql -U robert postgres -c "select tup_inserted from pg_stat_database where datname = 'powa'"
psql -U robert postgres -c "select tup_deleted from pg_stat_database where datname = 'powa'"
psql -U robert postgres -c "select tup_updated from pg_stat_database where datname = 'powa'"
psql -U robert postgres -c "select tup_fetched from pg_stat_database where datname = 'powa'"
psql -U robert postgres -c "select tup_returned from pg_stat_database where datname = 'powa'"
psql -U robert postgres -c "select tup_inserted from pg_stat_database where datname = 'pgbench'"
psql -U robert postgres -c "select tup_deleted from pg_stat_database where datname = 'pgbench'"
psql -U robert postgres -c "select tup_updated from pg_stat_database where datname = 'pgbench'"
psql -U robert postgres -c "select tup_fetched from pg_stat_database where datname = 'pgbench'"
psql -U robert postgres -c "select tup_returned from pg_stat_database where datname = 'pgbench'"
psql -U robert postgres -c "select tup_inserted from pg_stat_database where datname = 'replication'"
psql -U robert postgres -c "select tup_deleted from pg_stat_database where datname = 'replication'"
psql -U robert postgres -c "select tup_updated from pg_stat_database where datname = 'replication'"
psql -U robert postgres -c "select tup_fetched from pg_stat_database where datname = 'replication'"
psql -U robert postgres -c "select tup_returned from pg_stat_database where datname = 'replication'"

multi_item_test.sh
psql -U robert postgres -c "select datname, tup_inserted, tup_deleted, tup_updated, tup_fetched, tup_returned from pg_stat_database"

Comparison

If we want to trace the calls, we immediately see that we’re basically not comparing the psql process, but more the process waiting for the psql process to finish. Diving into the two processes gives the following result (abbreviated for readability):

$ strace -c -fF ./single_item_test.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
75.82 0.128002 1829 70 35 wait4
5.29 0.008922 2 3619 mmap
4.77 0.008049 3 2388 mprotect
3.89 0.006566 3 2074 665 open
1.56 0.002638 2 1406 read
1.22 0.002055 20 105 socket


$ strace -c -fF ./multi_item_test.sh > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
70.07 0.001957 979 2 1 wait4
3.29 0.000092 1 117 mmap
2.79 0.000078 4 20 8 stat
2.65 0.000074 1 68 19 open
2.61 0.000073 1 76 mprotect
2.26 0.000063 4 18 9 access

We can see that we made more calls wait for “something” in the single test item, than in the multi item test. And apart from that; each wait took longer per call. Wich leads to the difference in duration of the total test. Almost all the time (respectively 75% to 70%) is spend here.

So we need to dive into the wait4 process:

$ strace -e wait4 ./multi_item_test.sh > /dev/null
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10248
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10248, si_uid=499, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffc3f7ebc50, WNOHANG, NULL) = -1 ECHILD (No child processes)
+++ exited with 0 +++


$ strace -e wait4 ./single_item_test.sh > /dev/null
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10274
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10274, si_uid=499, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffcc3089c50, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10276
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10276, si_uid=499, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffcc3089c50, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10278
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10278, si_uid=499, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffcc3089c50, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10280
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10280, si_uid=499, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffcc3089c50, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10282
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=10282, si_uid=499, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffcc3089c50, WNOHANG, NULL) = -1 ECHILD (No child processes)
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10284
...

The actual definition of the wait4 call is:

pid_t wait4(pid_t pid, int *stat_loc, int options, struct rusage *rusage);

Where the pid = -1 means:

If pid has a nonzero value, then wait4() returns status only for the indicated process ID, but not for any other child processes. The status can be evaluated using the macros listed on the wstat() reference page.

So it’s actually waiting for the psql process to finish executing the query.

The option -fF in strace makes strace investigate all the child processes as well. These are the “client side” processes of the PostgreSQL database call. The “server side” processes for PostgreSQL are much harder to spot, since they live so short… The only thing we can watch is the “client side” process waiting for the “server side” process to deliver the data. This we can see in the socket waits of the client:

Single Item Test
1.21 0.001889 18 105 socket
Multi Item Test
0.72 0.000035 12 3 socket

The difference is 0.001854. In other words: The multi value collection is waiting around 2% of the time on the server process. Even though it’s only 0.0018 seconds in our test case.

Conclusion

When we first started these tests I was under the impression that doing many small queries to the database would cause an overhead to the database compared to the single value lookup. There is an overhead to see in the different approaches to collecting statistics from a PostgreSQL database. The single item collection takes 0.25 seconds longer to run than the multi value item collection which takes only 0.019 seconds.

However, I think that with the result of these tests we can see that a significant amount of time is spend waiting for the query results and that only a marginal 0.0018 second is spend more on server side processing. I don’t think this should be a reason to invest in creating a complex monitoring agent caching mechanism.

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Vereiste velden zijn gemarkeerd met *