Metrics

Contents

Architecture

Real-time metrics are captured during execution of all services in the ModelOps environment. These are two types of metrics captured:

  • technical - metrics on resource consumption and response times. These metrics are used to determine if a deployed environment has the resources required to achieve its service level agreements (SLAs) defined by the business.
  • model quality - qualitative measurement of how the model(s) in a scoring pipeline are performing. These metrics are used to improve the effectiveness of a deployed scoring flow and its associated models.

Metrics are reported by the ModelOps components as they execute. They are collected by a Metrics Store provided by Prometheus that is installed in the ModelOps cloud infrastructure.

LiveView monitors the Metrics Store in real-time and aggregates the raw metric values to provide support for a rich visualization of a subset of the metrics on the ModelOps UI.

Architecture

LiveView

LiveView provides continuous query access to metric data captured by the Metrics Store to support real-time visualization directly in the ModelOps UI.

The metrics loaded from the Metrics Store into LiveView are controlled by a white-list file containing regular expressions; one per line in the file. All metric names matching the white-list regular expressions are loaded. The white-list file is shipped in the LiveView application archive.

The metrics data is stored in these LiveView tables:

  • Metrics - stores metric values
  • MetricsMetadata - stores metric meta-data

The Metrics table has these fields:

Field Type Description
EventTime timestamp Timestamp of value
Name string Metric name (see Data Model)
Label string Metric label, a comma separated list of <name> = <value> pairs identify a specific metric value (see Data Model)
Value double Metric value

The MetricsMetadata table has these fields:

Field Type Description
Name string Metric name
Type string Metric type, one of counter, gauge, histogram, or summary (see Metric Types)
Description string Metric description
Units string Metric unit (see Base Units)

A size limit for the Metrics table is automatically maintained using these LiveView alerts:

  • Time window metrics trimming - removes all metrics older than a configurable time (defaults to 5 minutes).
  • Memory limit metrics trimming - fail safe to ensure that metric table does not exceed a configurable maximum size even with time window trimming (defaults to 50 megabytes).

The is no size limit enforced on the MetricsMetadata table.

Technical

The technical metrics captured by the components in the ModelOps environment are summarized below. These technical metrics are used to support elastic scaling of ModelOps components as required using the standard Kubernetes Horizontal Pod Autoscaler (HPA).

General

Every running container has these metrics captured:

Scoring Flows

Containers running scoring flows capture these additional metrics:

Metric Name Metric Type Description
builtin_cpu_idle_utilization_percentage HISTOGRAM Percent idle CPU utilization for machine hosting node
builtin_cpu_system_utilization_percentage HISTOGRAM Percent system CPU utilization for machine hosting node
builtin_cpu_user_utilization_percentage HISTOGRAM Percent user CPU utilization for machine hosting node
builtin_engine_<engine-name>_heap_memory_utilization_bytes HISTOGRAM Heap memory used (bytes) for engine <engine-name>
builtin_engine_<engine-name>_heap_memory_utilization_percentage HISTOGRAM Percent heap memory used for engine <engine-name>
builtin_engine_<engine-name>_queue_<queue-name>_depth_second METER Queue <queue-name> depth per second for engine <engine-name>
builtin_engine_<engine-name>_tuples_rate METER Scoring pipeline request rate for engine <engine-name>
builtin_node_shared_memory_kilobytes HISTOGRAM Shared memory used (kilobytes) for node
builtin_node_shared_memory_percentage HISTOGRAM Percent shared memory used for node
builtin_node_transactions_deadlocks_rate METER Transaction deadlock rate for node
builtin_node_transactions_latency_average_microseconds METER Average transaction latency (microseconds) for node
builtin_node_transactions_total_rate METER Transaction rate for node

Model Quality

Scoring flows may optionally publish calculated metrics to support monitoring of model quality. The metrics are calculated by comparing observed, or expected, values with predicted, or calculated, values. These are defined as:

  • Observed Values - previously recorded desired values. Observed values are contained in the input request, along with the data to score.
  • Predicted Values - values produced by scoring given input data with a model. Predicted values are contained in the response from a scoring server after scoring.

Calculated metrics are available in real-time via the metrics store, and also in the result data stored in a data sink.

Model Quality Metrics

The diagram above shows both the observed and input values received from a data source, which are then processed in a scoring flow. The Score processing step in the scoring flow adds the predicted values and model identifier to the request data after scoring, which is then passed on to the Compute Metrics processing step. The Compute Metrics processing step uses the observed and predicted values to calculate model quality. The calculated metrics are added to the results data and published to the metrics store in the Publish Metrics processing step. Finally the result data is sent to the data sink, where all values are stored to facilitate post-processing of the data.

The supported calculated metrics are summarized in the tables below for different model types.

Classification Models

Supported metrics for classification models.

Metric Name Metric Type Description
modelops_model_quality_classification_misclassification_rate GAUGE Misclassification rate. The proportion of misclassified instances in the dataset scored by the classification model.
modelops_model_quality_classification_chi_square GAUGE Chi square. A measure of the difference between the observed and predicted frequencies of the outcomes of a set of input variables.
modelops_model_quality_classification_g_square GAUGE G square. The likelihood-ratio or maximum likelihood statistical significance tests that are increasingly being used in situations where chi-squared tests were previously recommended. The G–test of goodness-of-fit is also known as the likelihood ratio test, the log-likelihood ratio test, or the G2 test and is preferred when the sample size is large.
modelops_model_quality_classification_f1_score GAUGE F1 score. A weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal.

Regression Models

Supported metrics for regression models.

Metric Name Metric Type Description
modelops_model_quality_regression_mean_error GAUGE Mean error. The mean of the prediction errors for the dataset scored by the regression model. Here, a prediction error is the difference between the value predicted by the regression model and the actual or true value, available in the input row. e(i) = y_o(i) - y_p(i) e(i) = Prediction error for the ith row, y_p(i) = Predicted value for the ith row, y_o(i) = Observed value in the ith row, i = Row index (i.e. 1,2,3 … ∞)
modelops_model_quality_regression_mean_absolute_error GAUGE Mean absolute error. The mean of the absolute prediction errors for the dataset scored by the regression model. Here, an absolute error is the absolute difference between the value predicted by the regression model and the actual or true value, available in the input row. e(i) = abs(y_o(i) - y_p(i)), e(i) = Absolute prediction error for the ith row, y_p(i) = Predicted value for the ith row, y_o(i) = Observed value in the ith row, i = Row index (i.e. 1,2,3 … ∞)
modelops_model_quality_regression_mean_squared_error GAUGE Mean squared error. The mean of the squared prediction errors for the dataset scored by the regression model. Here, a squared prediction error represents the square of the difference between the value predicted by the regression model and the actual or true value, available in the input row. e(i) = (y_o(i) - y_p(i))^2, e(i) = Squared prediction error for the ith row, y_p(i) = Predicted value for the ith row, y_o(i) = Observed value in the ith row, i = Row index (i.e. 1,2,3 … ∞)
modelops_model_quality_regression_root_mean_squared_error GAUGE Root mean squared error. Square root of the mean squared error, also known as the standard error value for the y estimate.
modelops_model_quality_regression_r_squared GAUGE R squared. R2 (coefficient of determination) regression score function. R-square is a comparison of the residual sum of squares (SSE) with the total sum of squares(TSS). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the mean value of y, disregarding the input features, would get a R2 score of 0.0
modelops_model_quality_regression_sum_of_squared_errors GAUGE Sum of squared errors. The sum of squares of residual errors for the dataset scored by the regression model. Here, a residual error is the difference between the actual or true value, available in the input row and value predicted by the regression mode. e(i)Residual = y_o(i) - y_p(i), e(i)Residual = Residual error for the ith row, y_o(i) = Observed value in the ith row, y_p(i) = Predicted value for the ith row, i = Row index (i.e. 1,2,3 … ∞)
modelops_model_quality_regression_total_sum_of_squares GAUGE Total sum of squares. The sum of squared differences between the actual or true values and their overall mean.

Clustering Models

Supported metrics for clustering models.

Metric Name Metric Type Description
modelops_model_quality_clustering_sum_of_squared_errors GAUGE Sum of squared error for clustering. A prototype-based cohesion measure where the squared Euclidean distance is used. For each point, the error is the distance to the nearest cluster.
modelops_model_quality_clustering_silhouette_score GAUGE Silhouette score. A measure of how well samples are clustered with samples that are similar to themselves. Clustering models with a high Silhouette Coefficient are said to be dense, where samples in the same cluster are similar to each other, and well separated, where samples in different clusters are not very similar to each other. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

Visualization

Visual model monitoring can be done directly against the Metrics Store using any tool that supports Prometheus, for example Grafana. This allows support for a broad range of model monitoring tools and rich customizations.

In addition, there is built-in support for visual model monitoring using a sub-set of the captured metrics described above. This provides an out-of-the-box high-level overview of model quality and a rough indication of resource utilization.

Metrics

All of the model quality metrics are available. In addition, these technical metrics are available:

  • builtin_engine_<engine-name>_tuples.rate
  • builtin_node_transactions_latency_average_microseconds
  • builtin_engine_<engine-name>_heap_memory_utilization_percentage
  • builtin_cpu_idle_utilization_percentage
  • builtin_cpu_system_utilization_percentage
  • builtin_cpu_user_utilization_percentage

Labels

All of the visualized metrics support these labels.

  • container - scoring pipeline name
  • instance - scoring pipeline instance
  • namespace - scoring pipeline namespace
  • pod - scoring flow pod name

Labels define separate indexes on the metrics to support aggregation of the values in an easy to understand visualization.