Ken Muse

The Ultimate Debugging Hack for Developers


Developers usually want to be able to directly connect to a system, start the debugger, and step their way through the code. Being able to log into a machine and debug is the way most of us learn to troubleshoot and fix issues. But what if you can’t do that? And what if I told you there’s a better way to debug?

Doing it the hard way is easy

When we are learning to debug (or working locally), being able to step into the code makes it easier to understand what’s happening and why. We can see each step and understand the flow of the code. We can even look up and down the call stack to (hopefully) see the variables that were set and trace the origins of the data. Because it’s the first approach we learn – often in a GUI debugger – it’s the easiest path.

The problem is that this approach doesn’t scale. It requires tools to be available on the machine being debugged. This creates an inherent security risk, especially since debugging may require permissions that aren’t normally available to a typical service or web application. Having the tools installed and waiting expands the attack surface of the system. If you’re relying on an IDE, you’re also likely connecting to the machine through a remote connection or (if it’s Windows), Remote Desktop Protocol (RDP). That means that you’ve also opened privileged ports that could normally remain locked down,

If you do connect in this way, you often lose track of the full state. How does that happen? When a debugger connects, one or more threads are suspended to enable you to step through the code. If all of the threads are suspended, you have forced the entire application to stop while you step through the code. The other activities could have been contributing to the issue being explored. If you can suspend individual threads, you may have activities continuing to run in the background that state while you debug. Again, this can hide the issues. In both cases, the system being debugged may no longer be processing all of the incoming data in the same way. This can lead to a degraded experience.

And all of this ignores what happens when a process spawns other executables. If you’re debugging, it can be very hard to connect to the new process and continue your work. Git is a perfect example of that. Most of the commands are launching other executables, making it tough to debug the full flow of the application.

Capturing the running application

There is a second approach to debugging that is available: you get a memory dump of the process at the moment that the issue occurs. By doing this, you can see the full state of the application at the moment of the issue. For mobile and desktop apps, this can provide a way to understand what happened on the device. For web applications, it provides a way to see application state without impacting end users.

This approach doesn’t require you to be actively connected to the code with a debugger. You can connect to the memory dump and review the state of the application and it’s variables. You can find the reasons for a specific error or exception, but you can’t see the full interactions that are occurring or continue the process with modified variables. The .NET platform provides dotnet-dump (or on Azure, Snapshot Debugger). Java provides jcmd (for heap and thread dumps). Python has a tool called pystack that can be used to analyze a process or Linux core dump. Other languages may have similar tools.

How to do it at scale

None of these approaches are easy to enable, span processes, or show you the full details of the activities, even when disconnected from the process. If you want to have a non-obtrusive way to understand what’s happening across a complex system – including interactions with underlying components – then you need visibility into what’s happening. This is where distributed tracing comes in.

By instrumenting your code with tracing, you can see the full flow of a request as it moves through the system. This can help you understand where the bottlenecks are, where the errors are occurring, and how the system is behaving. This approach is common with tools such as Kubernetes. In general, most every programming language has some level of support for semantic logging. This means that data can be captured in a structured way that can be easily analyzed. Most systems also support some level of configurability, allowing the tracing level to be increased or decreased as needed.

What is semantic logging? It’s the idea that logs should be structured and contain enough information to be useful. Look at a traditional log output:

1[RUNNER 2024-07-01 11:14:44Z INFO MessageListener] No message retrieved from session 'bdaa13ad-d5f3-4514-b7dc-7be438dbb5b4' within last 30 minutes.

There’s a lot of information there an it’s parsable, but none of the data is readily accessible or queryable. Now look at a structured log output:

1{
2  "source": "RUNNER",
3  "timestamp": "2024-07-01 11:14:44Z",
4  "level": "INFO",
5  "component": "MessageListener",
6  "session": "bdaa13ad-d5f3-4514-b7dc-7be438dbb5b4",
7  "message": "No message retrieved from session 'bdaa13ad-d5f3-4514-b7dc-7be438dbb5b4' within last 30 minutes."
8}

Notice that the fields are now accessible. If I wanted to find all logs related to a given MessageListener for that session, I could search on those particular fields. Ingested into a system to make it searchable, it makes logs more actionable, issues more discoverable, and alerts easier to create. If there’s not enough information in the logs to make an accurate determination of the issue, then more logging is required. The logging may be hidden behind “levels” to make it less verbose until it’s needed.

This is the primary way that tools like Actions Runner Controller (ARC) capture the details needed to debug and correlate a problem. The ARC team may also rely on the logs of other system components to understand the broader picture of what was happening on a given Kubernetes cluster. By relying on logging and tracing, they can understand the full flow of a request and the interactions with other components. This makes it easier to identify and resolve issues.

Taking this further, if the logs can be correlated by a shared identifier, it can make it even easier to understand what’s happening at scale. This is one of the promises of OpenTelemetry. It unifies the approach to capturing and correlating information. This in turn enables highly distributed systems to be debugged and analyzed, even with thousands or millions of concurrent activities occurring. And all of this can be done on a live system without needing to attach a debugger!

The ultimate hack

The ultimate hack for debugging is to implement observability in your code. By having a solid logging approach (and metrics, when appropriate), it becomes possible to gather detailed information about issues with minimal impact. Of course, that means making sure you take some time to understand the tools that help you to pull apart logs and turn them into actionable data. But that’s a post for another time!