Telemetry to solve dynamic analysis of a distributed system

. In the modern software development world, implementing distributed solutions has become quite common due to the flexibility it brings to big companies. The downside is that when developing such systems, especially in many teams, global design problems may not be obvious and lead to a slowdown in the development process or even problems with the location of errors or degradation of overall system performance. In addition, the timely reaction to system degradation is complicated by the distributed nature of the architecture; while manually configuring rules for reporting problematic situations can be time-consuming and still incomplete, automatic detection of possible system anomalies will give engineers (especially Software Reliability Engineers) the focus on problems. For this reason, applications that can dynamically analyse the system for problems have great potential. Currently, the topic of using telemetry for system analysis is actively studied and gaining traction, so further research is valuable. The work aims to theoretically and practically prove the possibility of using telemetry to analyse a distributed information system and detect harmful architectural practices and anomalous events. To do this, firstly, a detailed overview of the problems related to the topic and the feasibility of using telemetry is provided; the next section briefly describes the history of the development of monitoring systems and the key points of the latest OpenTelemetry standard, reviews popular application performance monitoring systems, and defines innovative features to be further researched. The main part includes an explanation of the approach used to collect and process telemetry, a reasoning behind the usage of Neo4j as a data storage solution, a practical overview of graph theory algorithms that help in the analysis of the collected data, and a description outlining how the PCA algorithm is employed to detect unusual situations in the whole system instead of individual metrics. The results provide an example of using the software presented with Neo4j Bloom to visualise and analyse the data collected over several hours from the OpenTelemetry Demo test system. The last section contains additional remarks on the results of the study. 1


Introduction
In recent years, distributed architectures such as microservices have received much attention and popularity due to the opportunities that the architectural pattern opens up in terms of optimization, technology stack diversification, and more [5].When built correctly, distributed systems simplify the development process when many teams are involved, reduce the complexity of changes or the dependence of teams on each other, and speed up development.Other commonly mentioned advantages such as the reliability of a whole system when separate components go down and easiness of understanding (dubious due to the scattered nature of the use case logic) -are secondary because of the additional complexity that microservices bring [27] slows down the development of such a system.In contrast, the general advantage of technological heterogeneity only increases the effort required to maintain the codebase [3,28,32].On the other hand, while prone to losing the overall application structure and distinct module separation, monolith architecture is preferable at the start of new project development when there is much uncertainty involved and frequent global changes are required.Physical boundaries between services complicate the refactoring process because of the distributed nature of the application state and dependencies.Therefore, more tools are needed to analyze the system and respond to problems.The development of distributed information systems requires more effort, especially when it comes to monitoring the entire system and finding problematic areas [27], because, unlike a monolith, such a system has many components developed in parallel, which may have structural flaws [13,25], also referred to as architectural smells -design decisions that hinder maintainability and extensibility.For global problems location, an information system analysis is often conducted to find and quickly address such shortcomings [21].A couple of approaches exist, such as static analysis of the codebase of each system component or analysis of the system logs.Both options are complex because they require adjustment for each system, technology, and programming language.However, the static approach, unlike the dynamic one, allows you to analyze the system without the need to run the whole system, which allows you to correct some local code smells but not the problems of the system as a whole due to the low accuracy and insufficient information about the runtime behaviour [4].At the same time, dynamic analysis is based on the information gathered in runtime, presenting a more accurate representation of the system utilization.The most prominent dynamic analysis approach is telemetry, which combines three pillars of system observability: logs, metrics, and traces.Therefore, the purpose of the work is the theoretical and practical substantiation of the possibility of using the OpenTelemetry standard to analyze a distributed information system: detecting and quickly responding to harmful architectural practices and anomalous events.Next, we outline the tasks: • research of state-of-art approaches of telemetry analysis; • modeling extract, transform and load (ETL) and further telemetry analysis process; • analysis of the received data to identify harmful practices and anomalies.

Theoretical background
The topic of system observability is far from new.In the world of distributed systems, Google is considered a pioneer in the study of the topic of observability.In 2010, Google engineers published a paper called "Dapper -a Large-Scale Distributed Systems Tracing Infrastructure" [24], which prompted the emergence of the first systems for request trace visualization: Jaeger and Zipkin.However, these applications solved the same problem while being incompatible, causing vendor lock, so over time, the development of the OpenTracing [30] standard began.The new standard provides a layer between the application and monitoring systems to track and collect requests.This standard did not solve the whole problem, so the OpenCensus standard was later developed to focus on system metrics and logs collection but also included an alternative implementation of trace collection, which ultimately created more problems, as developers now had to choose between the two standards.For this reason, in 2019, both standards were combined into OpenTelemetry [15] to solve the following tasks: • gathering traces, metrics, and logs in one place; • finding anomalies through charts; • finding the location and cause of anomalies through the review of problematic request traces.
Also, at that time, quite a lot of applications helping in system monitoring had already been presented on the market; for this reason, one of the tasks was to maintain compatibility with them, so the latest standard provides a specification that describes approaches for metrics collection, conventional descriptions of processes such as interaction using the HTTP protocol, RPC [19] without forcing vendor lock.As a result, OpenTelemetry is currently the most active project of the Cloud Native Computing Foundation [18].
One of the central notions introduced in the standards is telemetry -a set of metrics, logs and, most importantly, traces [14], which, in the case of our topic, can be used to build a system model in the form of a directed graph [4] and later used to analyze and identify bad practices and problem areas of the system.The OpenTelemetry standard is relatively new, so there is still active research on the possible use cases, but the central area of use is the visualization of requests (figure 1) with the ability to search for problematic areas, for example, the cause of poor service performance or the root cause of an incorrectly working business process [14].
The idea of using telemetry to improve the structure of a system can be traced back to several research papers released in recent years [6,7,20], and has a relatively small list of problems that can be identified, which opens up opportunities for further study of this topic [21].
To understand the scope of future research, a review of popular telemetry visualization and application monitoring solutions was conducted.The critical overview is presented below.
Signoz is a relatively new open-source product offering mostly basic monitoring capabilities but with more details than other open-source solutions.It supports advanced filtering and customizable dashboards, enables notifications and has simple system graph visualization, where service dependencies, error and request rates are shown in the figure 2.
ServiceNow Cloud Observability is a closed platform offering comprehensive multi-functional metrics analysis and charting capabilities.It includes a correlation engine that allows detecting and analyzing anomalies by comparing a problematic period of time with a base one and finding differences in attributes among the equivalent trace spans.
Honeycomb (figure 1) focuses on identifying the causes of abnormal situations in the system, which helps in finding performance problems much quicker.Otherwise, it has standard capabilities like system map visualization with some filters, alerting, and notification.
New Relic is an enterprise-level application monitoring system with numerous capabilities, from monitoring to anomaly analysis.Compared to previous systems, anomaly analysis is automatic and is included in many places to show differences between groups of services, similar requests, and degradation of performance and quality.The analysis considers load seasonality to exclude expected spikes from results.
Based on the review above, the features found were split into two categories: base features, industry standard for such systems and innovative features -valuable capabilities that make systems stand out.
A comparison of the applications' capabilities is shown in table 1.
Base features include: • list of system services with key metrics shown for each separately (percentage of errors for some time, percentiles of service request execution speed); • the possibility of building dashboards with custom queries; • review of problems and exceptions encountered in system requests.In more advanced systems, issues management and collaboration capabilities are present; https://doi.org/10.55056/jec.728• filtering and viewing traces visualization.Such views usually include a couple of representations like a request graph that shows the path and services involved, various diagrams to show the time that was spent in a particular service; • alerts management.Usually, the use case implies calculation of the compliance rate for

SLO measurements, notification;
• system graph visualization may be comprehensive and include multiple layers for visualization of physical relations (part of a particular node, pod. . . ) and logical with call dependencies.
Innovative features include: • comparison of groups of request traces to find factors contributing to changes in performance.The example may be that very slow queries are seen for some small portion of users that use additional parameters; • automatic detection of problems in the system (anomalies detection).More advanced implementations may include the ability to adapt to expected changes, for example, day and night load difference, increased usage during some period of a day; • root cause analysis -automatic determination of the causes of problems in the system pointing to a problematic service, endpoint, or release; • integration with the infrastructure -to display more information about the location of service instances more server related metrics; • integration with external systems such as GitHub, Continuous Integration (CI), and Continuous Delivery (CD) platforms to be able to quickly jump from one system to a contextually related place of another (e.g.file source), collect deployment events, track service versions, display additional metadata related to the services that are stored centrally in the repository.
After analyzing the above functionality, two areas of improvement were found: • identification of bad architectural practices that cause a big problem in distributed systems development because when individual teams work on separate parts of the application, it may be tough to track dependencies and see the bigger picture, which leads to degradation of performance and maintainability.The architectural smells that will be visualized are the following: bottleneck -a component on which many other components depend using synchronous requests.This may lead to system fragility when this component is unavailable; -cyclic dependency -a cluster of components highly dependent on each other, causing high coupling.This practice indicates incorrectly separated responsibilities of the components; -nano-service -a service that depends on many others through synchronous requests.
Often, this means the service is too small but still requires efforts to support, not to mention the overhead the synchronous requests add due to slower network speed compared to in-process invocation; • anomalies detection in the whole system -analysis of key metrics of system components to find problematic areas.Compared to other available solutions, the current implementation will consider the whole system instead of separate requests and use cases.

Defining data and storage for architectural smells detection
The proposed analytical system receives a constant stream of telemetry data and aggregates it by updating the system model in the form of a directed graph stored in a graph database management system (DBMS).Then, the model can be used for analysis, searching for structural anti-patterns.
Constructing the system's graph model involves processing traces of requests (figure 4).Once they are received, the process creates or updates information about available resources (services, storages, proxies) and stores information about changes in the storage.Operations available in the service (operation) and individual sub-requests (hop).
To build a system graph for further analysis, the data storage must have the following information: • resources are interacting components of the system.A resource must have a name, type (service, storage), date of creation and last use; • operations are defined by one resource and called by other ones; have statistics on the number of calls, errors, the last date of creation, and use; • calls -connections between a resource and an operation.They have the creation date, last use, type (synchronous, asynchronous), and number of errors.Neo4j was chosen as the storage of the system model since it physically stores data as a graph, which makes it possible to use graph traversal algorithms to find bad practices in the system, namely: • clustering coefficient -measures the degree of vertex connectivity; will help show service groups in the system [11]; • degree centrality -measure the number of connections between vertices; makes it possible to calculate the affinity (coupling) metrics of components in the system [10]; • strongly connected components -finds groups where each vertex is accessible from any other; helps to identify cyclic dependencies in the system [12].
The graph DBMS structure is presented in figure 5. Data storage has two types of nodes: resources and operations.Resources are related to the operations with the "Provides" relation.To show calls, the "Calls" relationship is used, which aggregates statistics for all identical calls from one resource to an operation of another resource (figure 6).The ETL process begins with the system's instrumentation -the installation of modules for popular libraries that will collect the telemetry and manual changes in service code to provide more details of a particular process in the system.Later, the telemetry is sent to the OpenTelemetry Collector [16] -a separate modular application developed by the authors of the standard, which allows you to unify the process of collecting, transforming, and exporting telemetry into various popular monitoring systems.In figure 7, we can see how metrics (blue) and traces (red) are emitted and occasionally sent by every service in the open telemetry demo project to the collector.Telemetry is received by modules called "receivers", which can receive or extract data from various systems, like Jaeger and Prometheus.However, the OTLP protocol is developed explicitly for telemetry transportation in this case.In the collector, there are two other modules: processors, which help to transform and filter telemetry and exporters, which send the telemetry to external systems.There are numerous available modules, but our task is to create a custom exporter that takes batches of traces, extracts necessary data, and unloads it into neo4j.The developed module takes a group of trace objects as input (the detailed structure of a trace with an explanation can be found in the standard source code [31]) and loops through each span.A span defines some operation in the system (see figure 4); it can be the start of the service operation ("server" span for direct request, "consumer" for async events handling), call another server operation ("client" span for a request, "producer" for async event) or in process operation (internal span).Since the trace is a chain of consecutive spans, all but the first root spans have a parent.Following the chain, we can distinguish individual operations, resources, and calls.All this data is inserted into the database as follows (snippet of a Cypher request for upserting a resource in the system):

Methods of anomalies detection
The problem of finding and analyzing anomalies is quite common in computer science and often varies depending on the domain in which the analysis takes place.For example, when reading data from sensors for further analysis, it is essential to find and correct outliers.When analyzing a business process, it is sometimes necessary to find unusual events to analyze what led to them.In software reliability engineering, the topic of mean time to detect is one of the most critical indicators because if the problem is found earlier, it is fixed earlier.
Analysis of anomalous changes is already present, at least in New Relic.However, it is present at the level of individual services, not the entire system.Although there is not enough data to confirm this, the platform analyzes metrics, including key metrics, using the Exponential Smoothing [1], which is a method of predicting a single variable and, depending on the type, can take into account seasonality [2].However, it is also possible to find anomalies oppositefrom a larger scale, using multivariate algorithms, which will be used in this work.
An anomaly is an abnormal situation defined as a substantial difference between expected and actual measurements.Therefore, the process of finding an anomaly includes the process of predicting the value of a particular measurement based on historical data [8].
The problem of finding anomalies in multivariate datasets is quite popular and critical because little to no measurements are univariate [26].
Algorithms are divided into the following training approaches: • unsupervised -the dataset used for model training does not include labels indicating anomalous situations; • semi-supervised -the dataset has anomalous situations labelled; • supervised -the whole dataset is labelled, the least commonly used type of algorithm, as it is difficult to get fully labelled data.
Due to the difficulty of obtaining labelled data, unsupervised models are the most popular.At the same time, it is also possible to add the possibility of providing feedback and a correction loop when using models for semi-labelled datasets.As part of this work, the unsupervised model is reviewed.While "None of the unsupervised methods is statistically better than the others" [8], which is due to the complexity of training on unlabeled data in which extra parameters only interfere, it was decided to choose Principal Component Analysis (PCA) -a statistical method of multivariate analysis used to identify the main structural components in a dataset.The main goal of PCA is to reduce the dimensionality of data while explaining the dataset in as much detail as possible, which, due to the simplicity of the approach, is well-suited for multivariate datasets and makes it the most common algorithm.
Essentially, PCA converts the initial correlated variables into new linear combinations called principal components.The first principal component is defined in such a way that it explains the most significant part of the data variance.Each successive principal component is chosen to be orthogonal to the previous ones and explain the residual variance as much as possible.
In the case of identifying anomalies in the system, we are interested in the following information: • calls -number of incoming, outgoing, and internal calls (synchronous and asynchronous when using a queue or other message brokers) with and without errors; • duration -time spent processing requests.
To obtain the necessary data in metrics and group collected data, you need to use a unique connector component that transforms traces into call and duration metrics.Thus, the collector receives information about the request via the Open Telemetry Protocol (OTLP) and then groups and extracts the necessary metrics to export later.Each of the system's components (resources) collects metrics for a certain period.Metrics have different types of values.For example, the number of calls has the sum type, which is a counter of certain events for a period and, in this case, is a monotonous sequence because the number of calls never decreases.
It is also important to note that the metrics are returned as a delta (the value of aggregation-Temporality is 1) and not a cumulative value because we are interested in the number of calls in a certain period, not the absolute value.Each metric can have multiple points that represent different attribute-defined dimensions (dimensions are customizable), so separate counters have been set up for different request types (span.kind)and statuses (status.code).
But in this form, we will not be able to use this data.Firstly, all the metrics for individual services are separated (figure 8) and converted to time series (figure 9) to later be combined based on timestamp (figure 10).From the intermediate results, you can clearly see the correlation between the different metrics of the system components (figure 11), which is confirmed by a correlation map (figure 12).
The process of identifying anomalies occurs by splitting the data sample into two periods, the first is used to train the PCA statistical model, the second is used to compare with the predicted values obtained from the model and, estimate the error for all and specific metrics.
Snippet of model training:

Results
The OpenTelemetry Demo project was used as a test system [17], specially designed for testing applications working with telemetry.This distributed system has components built with different technologies and is automatically loaded using a load generator service.

Visualization of the service graph using Neo4j tools
After running the whole system, the graph database has the following data (figure 13).You can see that the graph has many nodes with the type of operation (orange circles) and slightly fewer services (purple circles).You can see the "calls" and "provides" relationships depicted as arrows between them.To simplify the graph, a function from the APOC library is used [9] for Neo4j in order to visualize the graph projection and show service dependencies (figure 14).
A snippet of a virtual relationship visualization query: In the resulting diagram (figure 15), clusters are marked with distinct colours, and their size indicates the dependence of services on peers.From the diagram, it is also clear that the checkout service has many dependencies.This way, you can quickly analyze the application's architecture and see parts that must be refactored to prevent the whole application halts due to a single bottleneck component.

Time interval anomalies analysis
A few hours-long time interval was chosen to detect anomalies.It has been processed using the PCA algorithm, and after receiving errors for each time point, a visual analysis can be performed for the presence of spikes in the error values (figure 16).
As you can see in the plot, between 6:50 a.m. and 7 a.m., there were some changes that led to a relatively big error.From the error graph for each of the features, it can be seen that feature 44 is involved in this error, so by conducting a more detailed analysis of the values of this metric, we can see that all values are kept near 0, while there is an outlier with a value of about 12.

Discussion
Compared to static analysis approaches, dynamic analysis allows you to see the accurate picture of the entire system, all possible query paths that are used, and accurately indicate the components that cause a problem in the performance of the system at a particular moment, in contrast to static analysis of individual modules, which is better suited for the tasks of identifying code smells.Telemetry, in turn, allows you to combine all key indicators and add the additional context that allows you to get more information for analysis.
The practical use of a simple statistical unsupervised PCA algorithm has demonstrated the possibility of using such a model to identify anomalies, which can significantly simplify the work of engineers because instead of looking at dozens of charts and responding to user messages in support, this statistical analysis suggests the occurrence of anomalous situations in the system automatically.When compared with the approaches of analyzing each metric of the system separately (using appropriate statistical methods, for example, those used in NewRelic [1]), this method gives the general picture, allowing you to understand the situation in the entire system, but also provides the cause of the problem.Compared to supervised algorithms, especially neural networks [22,23], using the proposed method removes the need to retrain the model to adapt to regular changes (e.g., a natural increase in the number of users of the system), because the analysis takes place in a specific window, although undoubtedly this window should be of a particular size to cover a sufficient amount of data for training and analysis and at the same time not be too sensitive to seasonal changes (for example, activity during the day vs. activity at night), which needs to be tested and determined on a natural system.

Conclusions
The paper discusses the use of telemetry for dynamic analysis of the system for anomalous events and architectural smell detection.
An analysis of the problems related to distributed systems development with a detailed summary of the comparison of the monolithic and distributed architectures was carried out, which made it possible to determine the need for applications for monitoring and rapid response to problems in an extensive system.Studies on the use of telemetry for dynamic system analysis, which have been published in recent years, have shown the potential of this approach.The history of the system monitoring topic development and the main aspects of the latest OpenTelemetry standard were reviewed as well as popular applications performance monitoring solutions were compared to later list the features presented in the systems, divide them into groups of essential and innovative, and define the tasks for the study.
Later, the primary data flows required for analysis were identified, and a model of a graph DBMS was built.The model includes the following entities: operations, resources, and relationships, which determine the direction of resource dependence and ownership of operations.After that, telemetry extraction, processing and unloading using the OpenTelemetry Collector was reviewed.The main types of anomaly detection algorithms were studied, and the multivariate PCA statistical method was chosen to analyse unlabeled telemetry data.A custom component of the collector application was developed to transform and insert information into the Neo4j datastore.The necessary features to be used are the process of collecting appropriate numbers and the duration of calls within the system to find anomalies.An algorithm for collecting

Figure 2 :
Figure 2: Signoz system graph (visual reconstruction of the result to improve readability).

Figure 4 :
Figure 4: Example of a request tree.

Figure 5 :
Figure 5: Simplified diagram of the structure of a graph DBMS.

Figure 8 :
Figure 8: Results of a metric timeseries per service extraction .

Figure 11 :Figure 12 :
Figure 11: Chart of metric values over time.

Figure 13 :
Figure 13: Visualization of the full graph of services, operations and connections between them using Neo4j Browser (visual reconstruction of the result to improve readability).

Figure 14 :
Figure 14: Visualization of the dependency graph in Neo4j Browser (visual reconstruction of the result to improve readability).

Figure 15 :
Figure 15: Visualization of the dependency graph of services considering clustering and centrality algorithms in Neo4j Bloom (visual reconstruction of the result to improve readability).

Figure 16 :
Figure 16: The result of displaying the data reconstruction error for all and individual metrics.

Table 1
Application features comparison.
14 figure14, you can see the dependence of the checkout service on many others.To confirm this, let us use Neo4j Bloom to visualize Local Clustering Coefficient [11] and Degree Centrality [10] algorithms.