Improving Observability in a Software System

Russ Cam
Russ Cam
27 September 2024

Applying a combination of techniques and best practices to improve observability

A pane of glass

Ensuring the performance and reliability of systems is critical. The ability to observe and measure the internal state of a system based on the data it produces, has become a fundamental practice in modern software engineering. At the heart of observability lies telemetry data— logs, metrics, and traces, often referred to as "the three pillars", each offering a different windowed perspective into system behaviour.

While many organizations already collect telemetry data through platforms like the Elastic Stack (ELK), the challenge often lies in optimizing this data to make it more actionable. Simply collecting it isn't enough— how can you extract meaningful insights efficiently? This post explores some techniques and practices to improve observability.

Specifically, we'll focus on:

Let's dive into these concepts and how they can enhance the observability of your software system.

Single Pane of Glass: Unifying Logs, Metrics, and Tracing

When issues arise in a complex software system, isolating the root cause requires visibility into different forms of telemetry— logs, metrics, and traces. Unfortunately, these often live in separate systems, making the debugging process cumbersome and fragmented.

The concept of a Single Pane of Glass in observability refers to providing a unified interface that allows engineers to navigate seamlessly between these data types. While logs, metrics, and traces may not necessarily live in the same system (and there may be many reasons why this can be the case), you should aim for a tightly integrated experience that simplifies navigation between them.

Why It's Important

When something goes wrong in a distributed system, the symptoms manifest in different ways— through anomalous metrics, application logs, or traces of distributed requests. For instance, you might notice a spike in CPU usage (a metric), which correlates with a specific error in your application logs. Similarly, tracing can help pinpoint how a specific request flows through different services and where it might be experiencing latency or errors.

Without a single pane of glass, you would have to:

  1. Open your metrics dashboard (e.g., Prometheus, Grafana, Datadog) to identify when the CPU spike occurred that you were alerted for.

  2. Jump to your logging system (e.g. Elastic Stack, Splunk) to find related logs.

  3. Navigate to your tracing tool (e.g., Jaeger, OpenTelemetry, Dynatrace) to correlate with sampled traces for that time period.

This siloed approach wastes valuable time, and increases the risk of missing critical information.

Implementing a Single Pane of Glass

While tools like Elastic Stack, Datadog, and Honeycomb provide a unified solution for logs, metrics, and traces, not all observability stacks will be fully unified. However, you can still build an integrated workflow by linking these tools together where possible by:

In short, while these different observability data types may be collected in separate systems, the user experience should make it feel like they're all part of a cohesive whole.

Message Fingerprinting: Aggregating Similar Log Messages

Logs are invaluable for understanding the state of your system, but when it comes to identifying patterns or diagnosing issues, it's easy to become overwhelmed by the sheer volume of logs produced by modern applications; large distributed applications can produce hundreds of millions of log messages daily. Which warn or error level log messages are happening most frequently? Which info level messages are produced in large volumes that are costing you a significant amount to store and process, but are no longer providing value?

Message fingerprinting is an approach to help address this by grouping together logs with the same structure but different variable data. The idea is to generate a unique fingerprint or hash for each type of log message, allowing you to easily aggregate and analyze patterns across your logs.

Example

Consider the following log entries:

User 1234 encountered error: Database connection failed.
User 5678 encountered error: Database connection failed.

produced by the following SLF4J logger in Java

var userID = 1234;
logger.info("User {} encountered error: Database connection failed.", userID);

While these log messages differ by User ID, they are essentially reporting the same issue. With message fingerprinting, you can group these logs based on a fingerprint of the raw log message format, which excludes variable data like the user ID.

Implementing Message Fingerprinting

There are a few steps to implement message fingerprinting:

Define log message templates

Ensure that log messages follow a consistent format. Use placeholders for variable information (e.g., user ID, error message) that will be ignored when generating the fingerprint. There's a good chance you're already doing this with your logging libraries of choice.

Generate fingerprints

Use a hashing algorithm to create a unique identifier for each log message based on its structure, ignoring the variable parts. For the example above, hash the message format User {} encountered error: Database connection failed. A fast non-cryptographic hash function such as MurmurHash is a reasonable choice since the function will be called on every log event.

An example if you're using Logback is to use a Converter

package com.searchpioneer;

import ch.qos.logback.classic.spi.ILoggingEvent;
import ch.qos.logback.core.pattern.DynamicConverter;
import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;

public class FingerprintConverter extends DynamicConverter<ILoggingEvent> {
    @Override
    public String convert(ILoggingEvent event) {
        // hash the unformatted message
        return hashMessage(event.getMessage());
    }

    private String hashMessage(String message) {
        return Hashing.murmur3_32_fixed().hashString(message, StandardCharsets.UTF_8).toString();
    }
}

and then register the converter to be able to use it in an encoder. Here it's registered in code

package com.searchpioneer;

import ch.qos.logback.classic.LoggerContext;
import ch.qos.logback.classic.encoder.PatternLayoutEncoder;
import ch.qos.logback.classic.spi.ILoggingEvent;
import ch.qos.logback.core.ConsoleAppender;
import ch.qos.logback.core.CoreConstants;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.HashMap;
import java.util.Map;

public class Main {
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args) {
        configureLogbackFingerprint();
        
        logger.info("Application started");

        try {
            processBusinessLogic();
        } catch (Exception e) {
            logger.error("An error occurred during processing", e);
        }

        logger.info("Application finished successfully");
    }

    private static void processBusinessLogic() {
        logger.debug("Starting business logic processing...");
        for (int i = 0; i < 5; i++) {
            logger.info("Processing item {}", i + 1);
        }
        
        if (Math.random() > 0.5) {
            throw new RuntimeException("Simulated processing error");
        }
        
        logger.debug("Business logic processing completed");
    }

    private static void configureLogbackFingerprint() {
        var context = (LoggerContext) LoggerFactory.getILoggerFactory();
        context.reset();

        @SuppressWarnings("unchecked")
        Map<String, String> patternRuleRegistry = (Map<String, String>) context.getObject(CoreConstants.PATTERN_RULE_REGISTRY);
        if (patternRuleRegistry == null) {
            patternRuleRegistry = new HashMap<>();
        }

        context.putObject(CoreConstants.PATTERN_RULE_REGISTRY, patternRuleRegistry);
        patternRuleRegistry.put("fingerprint",  FingerprintConverter.class.getName());

        PatternLayoutEncoder encoder = new PatternLayoutEncoder();
        encoder.setContext(context);
        encoder.setPattern("%d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %msg - fingerprint:%fingerprint%n");
        encoder.start();

        var consoleAppender = new ConsoleAppender<ILoggingEvent>();
        consoleAppender.setName("console");
        consoleAppender.setEncoder(encoder);
        consoleAppender.setContext(context);
        consoleAppender.start();

        var rootLogger = context.getLogger("ROOT");
        rootLogger.addAppender(consoleAppender);
    }
}

which outputs the following logs

2024-09-25 22:06:31 [main] INFO  com.searchpioneer.Main - Application started - fingerprint:a99b5c06
2024-09-25 22:06:31 [main] DEBUG com.searchpioneer.Main - Starting business logic processing... - fingerprint:c8b3e191
2024-09-25 22:06:31 [main] INFO  com.searchpioneer.Main - Processing item 1 - fingerprint:4e593e5c
2024-09-25 22:06:31 [main] INFO  com.searchpioneer.Main - Processing item 2 - fingerprint:4e593e5c
2024-09-25 22:06:31 [main] INFO  com.searchpioneer.Main - Processing item 3 - fingerprint:4e593e5c
2024-09-25 22:06:31 [main] INFO  com.searchpioneer.Main - Processing item 4 - fingerprint:4e593e5c
2024-09-25 22:06:31 [main] INFO  com.searchpioneer.Main - Processing item 5 - fingerprint:4e593e5c
2024-09-25 22:06:31 [main] ERROR com.searchpioneer.Main - An error occurred during processing - fingerprint:f87f72f2
java.lang.RuntimeException: Simulated processing error
	at com.searchpioneer.Main.processBusinessLogic(Main.java:38)
	at com.searchpioneer.Main.main(Main.java:23)
2024-09-25 22:06:31 [main] INFO  com.searchpioneer.Main - Application finished successfully - fingerprint:cfe8ff06

Notice that the fingerprint generated for each log message with the format "Processing item {}" is the same.

Aggregate logs

Once logs are fingerprinted, you can group and count similar log entries, providing a more aggregated view of what's happening in your system.

This approach helps in identifying the most pertinent recurring issues and filtering out redundant information. It's a useful operational practice to incorporate an audit of the most frequent error messages into your regular ops review.

MessageTemplates: Persisting Variable Data for Querying

While fingerprinting helps group logs with the same message format, MessageTemplates go a step further by extracting and persisting variable data from log messages as separate field values. This makes it easier to query logs based on specific criteria, such as user IDs, statuses, query terms, etc.

MessageTemplates aren't a new idea (the website and repository were created in 2016, and the idea is older still) but from our experience often underutilized in production systems.

Example

Using the same log example from above:

User 1234 encountered error: Database connection failed.

With MessageTemplates, you could define the log format as:

User {UserID} encountered error: {ErrorMessage}.

Here, {UserID} and {ErrorMessage} are placeholders for variable data. By persisting this variable information as structured fields, you can easily search for logs where ErrorMessage is "Database connection failed" or where UserID is 1234.

Implementing MessageTemplates

Structured logging

Use a structured logging library that supports message templating to do the heavy lifting for you. Instead of logging free-form text, log structured objects with variables extracted as fields. The MessageTemplates site has a list of implementations in different languages.

In Java, the Logstash Logback Encoder provides StructuredArguments to achieve a similar outcome to MessageTemplates:

package com.searchpioneer;

import ch.qos.logback.classic.LoggerContext;
import ch.qos.logback.classic.spi.ILoggingEvent;
import ch.qos.logback.core.ConsoleAppender;
import net.logstash.logback.argument.StructuredArguments;
import net.logstash.logback.encoder.LogstashEncoder;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args) {
        configureLogbackLogstash();

        logger.info("Application started");

        try {
            processBusinessLogic();
        } catch (Exception e) {
            logger.error("An error occurred during processing", e);
        }

        logger.info("Application finished successfully");
    }

    private static void processBusinessLogic() {
        logger.debug("Starting business logic processing...");
        for (int i = 0; i < 5; i++) {
            logger.info("Processing item {}", StructuredArguments.value("message_properties.item", i + 1));
        }

        if (Math.random() > 0.5) {
            throw new RuntimeException("Simulated processing error");
        }

        logger.debug("Business logic processing completed");
    }
    
    private static void configureLogbackLogstash() {
        var context = (LoggerContext) LoggerFactory.getILoggerFactory();
        context.reset();

        var logstashEncoder = new LogstashEncoder();
        logstashEncoder.setContext(context);
        logstashEncoder.start();

        var consoleAppender = new ConsoleAppender<ILoggingEvent>();
        consoleAppender.setName("console");
        consoleAppender.setEncoder(logstashEncoder);
        consoleAppender.setContext(context);
        consoleAppender.start();

        var rootLogger = context.getLogger("ROOT");
        rootLogger.addAppender(consoleAppender);
    }
}

which outputs

{"@timestamp":"2024-09-27T14:29:28.5142659+10:00","@version":"1","message":"Application started","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"INFO","level_value":20000}
{"@timestamp":"2024-09-27T14:29:28.5187664+10:00","@version":"1","message":"Starting business logic processing...","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"DEBUG","level_value":10000}
{"@timestamp":"2024-09-27T14:29:28.519767+10:00","@version":"1","message":"Processing item 1","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"INFO","level_value":20000,"message_properties.item":1}
{"@timestamp":"2024-09-27T14:29:28.5202663+10:00","@version":"1","message":"Processing item 2","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"INFO","level_value":20000,"message_properties.item":2}
{"@timestamp":"2024-09-27T14:29:28.5202663+10:00","@version":"1","message":"Processing item 3","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"INFO","level_value":20000,"message_properties.item":3}
{"@timestamp":"2024-09-27T14:29:28.5202663+10:00","@version":"1","message":"Processing item 4","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"INFO","level_value":20000,"message_properties.item":4}
{"@timestamp":"2024-09-27T14:29:28.5207658+10:00","@version":"1","message":"Processing item 5","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"INFO","level_value":20000,"message_properties.item":5}
{"@timestamp":"2024-09-27T14:29:28.5207658+10:00","@version":"1","message":"An error occurred during processing","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"ERROR","level_value":40000,"stack_trace":"java.lang.RuntimeException: Simulated processing error\r\n\tat com.searchpioneer.Main.processBusinessLogic(Main.java:36)\r\n\tat com.searchpioneer.Main.main(Main.java:21)\r\n"}
{"@timestamp":"2024-09-27T14:29:28.521766+10:00","@version":"1","message":"Application finished successfully","logger_name":"com.searchpioneer.Main","thread_name":"main","level":"INFO","level_value":20000}

Observe that the field "message_properties.item" is included in the JSON output with the structured argument value.

Log parsing

Ensure that your logging system is set up to recognize and parse message templates from logs, or receive structured log messages with parsed fields, and can store variable data as searchable fields.

Indexing

In systems like Elastic Stack, make sure to index these fields to allow querying, filtering, and aggregating. The majority of the time, this means indexing as a "keyword" data type with doc_values enabled.

In a large distributed application maintained by many teams, you often need to be judicious about how many different fields each team is allowed to log, and also avoid conflicting field names for additional data. The latter can be addressed by prefixing fields with some team identifier, and the former can often be constrained in the logging stack e.g. limit the number of fields indexed with the prefix of a given team identifier.

By using MessageTemplates, you'll gain more control over how you query logs, leading to more precise troubleshooting and monitoring.

Using a Common Schema

As observability matures, the need for consistency in how logs, metrics, and traces are structured becomes critical. If each team in an organization logs data in its own format, it can be challenging to correlate logs across different services or to understand what's happening at a system level.

This is where adopting a common schema such as the Elastic Common Schema (ECS), or OpenTelemetry's Semantic Conventions can provide immense benefits. Not only can it bring consistency across an organization, it also helps to unify logs, metrics, and traces into a single data type, allowing them to be collectively indexed and queried.

Elastic Common Schema (ECS)

The Elastic Common Schema defines a consistent format for structuring log and event data. By using ECS, you ensure that logs from different systems and services share the same field names and structure. This makes it easier to search, visualize, and analyze data across your entire stack.

For example, ECS specifies field names like:

ECS is an open source specification and is converging with OpenTelemetry Semantic Conventions over time, making it a reasonable choice. The power in having a convention is less about the rules of the convention and more about the standardization and consistency it enforces. Less time debating minutiae and more time extracting value.

Implementing ECS

Define logging guidelines

Establish ECS as the standard schema for all services and teams in your organization. Larger organizations often have dedicated observability teams to manage logs, metrics, and traces centrally which can help in establishing improved practices.

Configure loggers

Set up your logging libraries to automatically use ECS field names. Many log aggregation tools like Filebeat already support ECS out of the box.

For a Java application wishing to log events using ECS, there are ECS logging libraries to work with the most common logging libraries.

If you're using Logback, the following example integrates ECS along with message fingerprinting and MessageTemplates, by deriving from the EcsEncoder provided by the ECS logging library:

package com.searchpioneer;

import ch.qos.logback.classic.spi.ILoggingEvent;
import co.elastic.logging.JsonUtils;
import co.elastic.logging.logback.EcsEncoder;
import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonGenerator;
import com.google.common.hash.Hashing;
import net.logstash.logback.marker.ObjectAppendingMarker;

import java.io.IOException;
import java.io.Writer;
import java.nio.charset.StandardCharsets;

public class EcsWithFingerprintEncoder extends EcsEncoder {
    private static final JsonFactory factory = new JsonFactory();

    @Override
    protected void addCustomFields(ILoggingEvent event, StringBuilder builder) {
        var fingerprint = hashMessage(event.getMessage());
        builder.append("\"message_fingerprint\":\"");
        JsonUtils.quoteAsString(fingerprint, builder);
        builder.append("\",");

        var messageArgs = event.getArgumentArray();
        if (messageArgs != null && messageArgs.length > 0 ) {
            builder.append("\"message_properties\":");
            var writer = new StringBuilderWriter(builder);
            try {
                JsonGenerator generator = factory.createGenerator(writer);
                generator.writeStartObject();
                for (Object o : messageArgs) {
                    if (o instanceof ObjectAppendingMarker objectAppendingMarker) {
                        objectAppendingMarker.writeTo(generator);
                    }
                }
                generator.writeEndObject();
                generator.flush();
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            builder.append(",");
        }
    }

    private String hashMessage(String message) {
        return Hashing.murmur3_32_fixed().hashString(message, StandardCharsets.UTF_8).toString();
    }

    private static class StringBuilderWriter extends Writer {
        private final StringBuilder sb;

        public StringBuilderWriter(StringBuilder sb) {
            this.sb = sb;
        }

        @Override
        public void write(char[] cbuf, int off, int len) {
            sb.append(cbuf, off, len);
        }

        @Override
        public void flush() {
        }

        @Override
        public void close() {
        }
    }
}

With the encoder configured in code

package com.searchpioneer;

import ch.qos.logback.classic.LoggerContext;
import ch.qos.logback.classic.spi.ILoggingEvent;
import ch.qos.logback.core.ConsoleAppender;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Main {
    private static final Logger logger = LoggerFactory.getLogger(Main.class);

    public static void main(String[] args) {
        configureLogbackEcs();

        logger.info("Application started");

        try {
            processBusinessLogic();
        } catch (Exception e) {
            logger.error("An error occurred during processing", e);
        }

        logger.info("Application finished successfully");
    }

    private static void processBusinessLogic() {
        logger.debug("Starting business logic processing...");
        for (int i = 0; i < 5; i++) {
            logger.info("Processing item {}", i + 1);
        }

        if (Math.random() > 0.5) {
            throw new RuntimeException("Simulated processing error");
        }

        logger.debug("Business logic processing completed");
    }

    private static void configureLogbackEcs() {
        var context = (LoggerContext) LoggerFactory.getILoggerFactory();
        context.reset();

        var ecsEncoder = new EcsWithFingerprintEncoder();
        ecsEncoder.setContext(context);
        ecsEncoder.setServiceName("java-logging-example");
        ecsEncoder.setServiceVersion("1.0.0");
        ecsEncoder.setIncludeMarkers(true);
        ecsEncoder.setIncludeOrigin(true);
        ecsEncoder.start();

        var consoleAppender = new ConsoleAppender<ILoggingEvent>();
        consoleAppender.setName("console");
        consoleAppender.setEncoder(ecsEncoder);
        consoleAppender.setContext(context);
        consoleAppender.start();

        var rootLogger = context.getLogger("ROOT");
        rootLogger.addAppender(consoleAppender);
    }
}

the output looks as follows

{"@timestamp":"2024-09-27T04:50:15.967Z","log.level": "INFO","message":"Application started","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":24},"function":"main"}},"message_fingerprint":"a99b5c06"}
{"@timestamp":"2024-09-27T04:50:15.971Z","log.level":"DEBUG","message":"Starting business logic processing...","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":36},"function":"processBusinessLogic"}},"message_fingerprint":"c8b3e191"}
{"@timestamp":"2024-09-27T04:50:15.973Z","log.level": "INFO","message":"Processing item 1","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":38},"function":"processBusinessLogic"}},"message_fingerprint":"4e593e5c","message_properties":{"i":1}}
{"@timestamp":"2024-09-27T04:50:15.978Z","log.level": "INFO","message":"Processing item 2","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":38},"function":"processBusinessLogic"}},"message_fingerprint":"4e593e5c","message_properties":{"i":2}}
{"@timestamp":"2024-09-27T04:50:15.978Z","log.level": "INFO","message":"Processing item 3","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":38},"function":"processBusinessLogic"}},"message_fingerprint":"4e593e5c","message_properties":{"i":3}}
{"@timestamp":"2024-09-27T04:50:15.979Z","log.level": "INFO","message":"Processing item 4","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":38},"function":"processBusinessLogic"}},"message_fingerprint":"4e593e5c","message_properties":{"i":4}}
{"@timestamp":"2024-09-27T04:50:15.979Z","log.level": "INFO","message":"Processing item 5","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":38},"function":"processBusinessLogic"}},"message_fingerprint":"4e593e5c","message_properties":{"i":5}}
{"@timestamp":"2024-09-27T04:50:15.979Z","log.level":"ERROR","message":"An error occurred during processing","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":29},"function":"main"}},"message_fingerprint":"f87f72f2","error.type":"java.lang.RuntimeException","error.message":"Simulated processing error","error.stack_trace":"java.lang.RuntimeException: Simulated processing error\r\n\tat com.searchpioneer.Main.processBusinessLogic(Main.java:42)\r\n\tat com.searchpioneer.Main.main(Main.java:27)\r\n"}
{"@timestamp":"2024-09-27T04:50:15.980Z","log.level": "INFO","message":"Application finished successfully","ecs.version": "1.2.0","service.name":"java-logging-example","service.version":"1.0.0","event.dataset":"java-logging-example","process.thread.name":"main","log.logger":"com.searchpioneer.Main","log":{"origin":{"file":{"name":"Main.java","line":32},"function":"main"}},"message_fingerprint":"cfe8ff06"}

Observe that all log messages conform to the ECS schema, and include a message fingerprint. For those log messages with StructuredArgument placeholders, those values have been indexed separately in "message_properties".

By adopting a common schema like ECS, you reduce the friction involved in analyzing logs from different sources and improve overall observability.

Conclusion

Improving observability is not just about collecting more logs. it's about making those logs, along with metrics and traces, more actionable. By unifying logs, metrics, and tracing into a single pane of glass, fingerprinting similar logs, extracting variable information with MessageTemplates, and adopting a common schema like ECS, you can significantly enhance your ability to monitor and troubleshoot your system.

These practices empower your team to identify issues faster, reduce noise in log data, and gain a comprehensive view of your software's behaviour. Whether you're working with Elastic Stack, Prometheus, Jaegar, Datadog or other observability platforms, the techniques discussed here can help you achieve better insights and, ultimately, more reliable software systems.