System impact of monitoring

Statistics collection impact

If you look through the chapter of the Tuning Guide, where the individual statistic reports are described, you will see some of them containing sentences like:

The collection of mutex locking statistics imposes a slight performance penalty and is not enabled by default.

Some statistics collection, turns on extra code/storage paths in the runtime. For each of these, an effort has been made in the runtime to minimize the costs, but they are clearly non-0. Application performance measurement is the best way to characterize these effects.

For a few of the statistics (e.g. eventbus detailed=true), there is also unbounded memory consumption. In the documentation you will see a statement like:

The collection of these statistics imposes a slight performance penalty and consumes shared memory for each method invocation, and is not enabled by default.

This is an area where, by leaving the statistic enabled, one risks running the node out of memory.

Statistics reporting impact

Generally, statistics reporting can be more expensive (both computationally and in terms of contention) than statistics collection. There reasons for this are:

  • Most statistics are stored as a set of integer values. Relatively inexpensive to update. But nearly all reporting involves row by row string formatting of the statistic data, often including type look-ups.

  • The synchronization for the collection of some statistics (particularly those in performance sensitive paths) is parallelized to allow concurrent writers. Where possible, this is done using atomic memory counter primitives, otherwise using pools of mutexes, or in some cases a single mutex. For the runtime execution path, the locking is minimized. But the for statistics clear path, in paths where the statistic is protected by mutexes, one or all of the mutexes get locked. The data collection with the worst impact on the system would be the allocator report with the detailed=true option. This data reporting can cause expensive contention in the shared memory allocator.

  • The returning of report data through the administration plugin service uses the same runtime support that an application does. Creating objects, consuming shared memory, etc... A large report (generally those that use the detailed=true option) can temporarily consume a great deal of shared memory.

Recommendations

Run statistics reporting judiciously in a production system. Do not design an operator console where a room full of operators are all continuously issuing statistics commands.

Unless there is a good reason, avoid the detailed=true reporting.

Measure. Using your existing application performance testing, measure the impact of collecting the desired statistics. Also measure the impact of reporting the desired statistics.