Weblogic Authors: Yeshim Deniz, Elizabeth White, Michael Meiner, Michael Bushong, Avi Rosenthal

Related Topics: Java IoT

Java IoT: Blog Post

Hunting Lost Treasures: Understanding and Finding Memory Leaks

Week 5 of our 2010 Application Performance Almanac

Searching for memory leaks can easily become an adventure – fighting through a jungle of objects and references. When the leak occurs in production time is short and you have to act fast. Like in a treasure hunt, we have to interpret signs, unravel mysteries to finally find the “lost” memory.

Memory leaks – together with inefficient object creation and incorrect garbage collector configuration – are the top memory problems. While they are a typical runtime problem, their analysis and resolution worries developers. Therefore I will focus in this post on how to analyze memory problems by covering how to find those problems and providing some insights into the anatomy of memory leaks.

Packing Our Equipment
What do we need for effective memory diagnosis? We need a heap analyzer for analyzing heap content and a console to collect and visualize runtime performance metrics. Then we are well-equipped for our expedition. Which tools you are going to choose depends on the platform you are using, the money you want to spend and your personal preferences. The range goes from JVM tools, to open source tools to professional performance management solutions.

The Heap Dump
A heap dump allows us to get a snapshot from the JVM memory to analyze its content. Heap dumps can be triggered in multiple ways. There are JVM parameters like XX:+HeapDumpOnOutOfMemoryError which will trigger a heap dump in case of an OutOfMemoryError. Unfortunately this option is not enabled by default; however, I recommend switching it on by default. There is nothing more frustrating than trying to reproduce a problem just because you failed to get all necessary information upfront. Alternatively you can also trigger a heap dump while the JVM is running. These tools use the JVM Tooling Interface (JVMTI) to retrieve this information from the JVM.

The biggest issue with Heap Dumps is that their format is not standardized and is different between different JVMs. JSR 326 is working on a standardized way to access heap information. Defining a standardized API to access heap dump information should enable the use of a single tool to work with different heap dump formats. If you cannot wait for the JSR to be implemented you have to choose a tool which supports the required formats or use tools like dynaTrace which access the heap information directly and therefore work across JVM implementations.

What We Get
The information within the heap dump may also vary based on the JVM as well as the JVM version. However there is certain information which is contained in every heap dump. We get information about the objects – their classes – as well as references on the heap. Additionally we get information about the size of an object. This size is often referred to as shallow size – the size of the object itself without any referenced data structures. Newer JVMs additionally support the collection of values for primitive data types like Strings or Integers as well. Some tools also indicate the number of survived Garbage Collection cycles by performing special heap profiling.

Depending on the size of your JVM’s heap, the amount of information can be huge. This affects the heap dump creation time as well as the processing time of the dump; needless to mention that the analysis itself gets more complex. Therefore some tools provide means to collect only the number of objects. While providing less detail, this approach has the advantage of being much faster. By creating a series of dumps over time and then comparing the object counts of the dumps, we can immediately see which objects are growing.

Naturally this will show a lot of primitive data types and a number of classes we might have never seen before because they are internal to the JDK or other libraries we are using. We skip those classes and look for objects of our own classes which grow. This already provides a good indication of a potential memory leak. If we then additionally can see the allocation stack of these object, we might be able to identify the memory leak without even having to analyze a full heap dump.

JVM Metrics
In addition to heap dumps we will also use JMX-based memory metrics of the JVM to monitor the heap at runtime. Based on the changes of memory consumption over time we can see whether there is a memory leak at all. Monitoring memory usage of the JVM is essential in any diagnosis process. These metrics should – no, must – be collected by default during load tests and also in production. Relating this metric to monitoring data – like the types of request at a certain time – will also be a good indicator for potential memory problems. While monitoring will not prevent you from running into OutOfMemoryErrors, it can help to proactively resolve performance problems.

I recall a customer situation where we have seen sudden spikes in heap usage. While they never caused an OutOfMemoryError they still made us feel uncomfortable. We then correlated this information to other monitoring data to find out what was different when the spikes occurred. We then realized that there were some bulk processing operations going on. Diagnosing this transaction we realized that submitted XML was transformed into a DOM tree for further processing. As processing time depended on the amount of data, these objects potentially stayed in memory for minutes – or longer. The issues could then be fixed, tested and deployed into production without users every being affected by it.

The only potential shortcoming of monitoring heap usage is that slowly-growing memory leaks might be more difficult to spot. This is especially true if you happen to look at the data in the wrong granularity. In order to overcome this issue I use two different charting intervals; the last 32 days for visualizing long term trends and the last 72 hours for short-term and more fine-granular information.

Besides potential memory leaks JVM metrics also help us to spot potential Garbage Collector configuration problems. Our primary metrics are the number and the time of Garbage Collections.

Let’s Go Hunting
As I’ve already discussed in another post, memory leaks in Java are not “classical” leaks. As the Garbage Collector automatically frees up unreferenced objects, it has taken this burden away from us. However we as developers have to ensure that all references to objects are freed up if we no longer need them. While this sounds very simple it turns out to be quite difficult in reality.

Looking a bit closer at the problem we realize that it is a specific kind of reference which causees memory leaks. Every object we allocate within the scope of our execution will be freed up automatically after leaving the method scope. So memory leaks are caused by references which exist beyond our current execution scope like Servlet sessions or caches and any objects stored in static references.

A central concept in understanding the origins of memory leaks is Garbage Collection roots. A GC root is a reference which only has outgoing and no incoming references. Every object on the heap has at least one GC root. If an object is no longer referenced by a GC root it is marked as unreachable and ready for Garbage Collection. There are three main types of GC roots.

  • Temporary variables on stack of threads
  • Static fields of classes
  • Native references in JNI
Garbage Collection Roots

Garbage collection roots and other heap objects

A single object however will not cause a memory leak. For the heap to fill up continuously over time we have to add more and more objects over time. Collections are the critical part here as they allow us to grow continuously over time, while holding an ever-increasing number of references. So this means that most memory leaks are caused by collections which are directly or indirectly referenced by static fields.

Enough of theory; let’s look at an example. The figure below shows the reference chain of a HTTP Session object – specifically its implementation in Apache Tomcat. The session object is key in ConcurrentHashmap which is referenced by the ThreadLocal storage of the Servlet threads. They are then kept within a Thread array, which is again part of a ThreadGroup. The ThreadGroup is then referenced by the Thread class itself. You can see even more details looking at the figure below.

Heap Root Walk of an HTTP Session Object

Heap Root Walk of an HTTP Session Object

This shows that most memory problems can be tracked back to a specific object on the heap. In memory analysis you will in this context often hear about the concept of dominators or the dominator tree.

The concept of a dominator comes from graph theory and is defined as follows: A node dominates another node if it can only be reached via this node. For memory management this means that A is a dominator of B if B is only referenced by A. A dominator tree is then a whole tree of objects where this is true for the root object and all referenced objects. The image below shows an example of a dominator tree. (You might want to get a coffee now and think about this :- )).

Dominator Tree Example

Dominator Tree Example

In case there are no more references to a dominator object all referenced objects will be freed up as well. Large dominator trees are therefore good candidates for memory leaks.

Post Mortem versus Runtime Analysis

When diagnosing memory leaks we can basically follow two approaches. Which one to choose depends mostly on the situation. In case we already ran into an OutOfMemoryError we can only perform a post-mortem analysis, if we started our JVM with the proper JVM argument as stated above. While this option has been added in Java 6, JVM vendors have back-ported this functionality also in older JVM versions. You should check whether your JVM version supports this feature.

The “advantage” of post-mortem memory dumps is that the leak is already contained in the dump and you need not spend a lot of time reproducing it. Especially in case of slowly-appearing memory leaks or problems which occur just in very specific situations, it can become close to impossible to reproduce the problem. Having a dump available right after the error occurred can save a lot of time (and nerves).

The biggest disadvantage – besides crashing a production system – is that you will miss a lot of additional runtime information. The dominator tree however is highly valuable to find the objects responsible for the memory leak more or less easily. This information combined with good knowledge of the source code often helps to resolve the problem.

Alternatively, continuously increasing memory consumption already indicates a memory leak. Well, this does not change the situation that the JVM would crash eventually, but we can already start to search for the leak proactively. Additionally we can prevent users from being affected by the memory leak by restarting the JVM for example.

As creating these heap dumps means that all running threads have to be suspended, it is good advice to redirect user traffic to other JVMs. Very often the collected data will be sufficient for identifying the leak. Additionally we can create a number of snapshots to identify objects growing continuously. Solutions like dynaTrace additionally allow tracking the size of specific collections including information where they have been instantiated. This information very often helps experienced developers to identify the problem without extensive heap analysis.

Size Does Matter
A central factor in heap dump analysis is the heap size. Bigger does not mean better. 64bit JVMs represent a special challenge here. The huge number of objects results in more data to be dumped. This means that dump take longer and more space is required for storing the dump output. At the same time analysis of dumps takes longer as well. In particular algorithms for calculating garbage collection sizes or dynamic sizes of objects show decreasing runtime performance for bigger heaps. Some tools – at least in my experience – already have problems even opening dumps bigger than about 6 GB. The generation of heap dumps also requires memory within the JVM itself. In the worst case this can mean that the generation of a dump is no longer possible at all. The main reason lies within the implementation of the JVMTI heap dump methods.

First every object needs a unique tag. This tag is later used to analyze which objects are referenced by others. The tag is of the JNI type jlong which is 8 bytes in size. On top of that there is also the memory consumption of JVM internal structures. The size of these structures depends on the JVM implementation and can be up to 40 bytes per object. This is why we at dynaTrace specifically focus on supporting the analysis of bigger and bigger heap dumps.

The general advice is to work with smaller heaps. They are easier to manage and in case of errors easier to analyze. Memory problems also show up faster than in large JVMs. If possible it is better to work with a number of smaller JVMs instead of one huge single JVM. If however you have to work with a large JVM it is indispensible to test in advance whether it is possible to analyze a memory dump. A good test is to create a heap dump from a production-sized instance and calculate the GC size of all HTTP sessions. In case you have problems solving this simple problem, you should either upgrade your tooling or decrease your heap size. Otherwise you might end up in a situation where you have no means to diagnose a memory leak in your application.

The best memory leak is the one you do not have. So the best approach is to already test during development for potential memory leaks. The best means are long-running load tests. As our goal is less about getting performance results but rather finding potential problems we can work with smaller test environments. It might even be enough to have the application and the load generator on the same machine. We should however ensure that we cover all major use cases. Some memory leaks, however, might only occur in special situations and are therefore hard to find in testing environments. Regularly capturing heap dumps during the test run and comparing them to find growing objects, however, helps to identify potential leaks.

Comparison of Heap Dumps over Time

Comparison of Heap Dumps over Time

Memory leaks are amongst the top performance-related problems in application development. At the beginning analysis might look extremely complex. However a proper understanding of the “anatomy” of a memory leak helps to find those problems easily, as they follow common patterns. We however have to ensure that we can work with the information when we need it. This means dumps have to be generated and we must be able to analyze them. Long term testing also does a good job in finding leaks proactively. Increasing the heap is not a solution at all. It might even make the problem worse. There are a lot of tools out there that support in memory analysis; each one with their strength and weaknesses. I might be a bit biased here, but for a general overview of available functionality I recommend looking at memory diagnosis in dynaTrace . It provides a good overview of different approaches towards memory analysis.

This article is based on the performance series I did with Mirko Novakovic of codecentric.  Mirko also did a great post on OutOfMemoryErrors!

Related reading:

  1. SharePoint: Identifying memory problems introduced by custom code SharePoint is a great platform that makes it easy to...
  2. Can you trust your .NET Heap Performance Counters? Memory Management is a tough topic in managed runtime environments...
  3. Java Memory Problems Memory Leaks and other memory related problems are among the...
  4. Memory Leak in EntityDataSource when controlling lifetime of your ObjectContext The EntityDataSource is a control you can use on your...
  5. .NET Performance Analysis: A .NET Garbage Collection Mystery Memory Management in .NET is a broad topic with a...

More Stories By Alois Reitbauer

Alois Reitbauer is Chief Technical Strategist at Dynatrace. He has spent most of his career building monitoring tools and fine-tuning application performance. A regular conference speaker, blogger, author, and sushi maniac, Alois currently shares his professional time between Linz, Boston, and San Francisco.

IoT & Smart Cities Stories
The hierarchical architecture that distributes "compute" within the network specially at the edge can enable new services by harnessing emerging technologies. But Edge-Compute comes at increased cost that needs to be managed and potentially augmented by creative architecture solutions as there will always a catching-up with the capacity demands. Processing power in smartphones has enhanced YoY and there is increasingly spare compute capacity that can be potentially pooled. Uber has successfully ...
The deluge of IoT sensor data collected from connected devices and the powerful AI required to make that data actionable are giving rise to a hybrid ecosystem in which cloud, on-prem and edge processes become interweaved. Attendees will learn how emerging composable infrastructure solutions deliver the adaptive architecture needed to manage this new data reality. Machine learning algorithms can better anticipate data storms and automate resources to support surges, including fully scalable GPU-c...
We are seeing a major migration of enterprises applications to the cloud. As cloud and business use of real time applications accelerate, legacy networks are no longer able to architecturally support cloud adoption and deliver the performance and security required by highly distributed enterprises. These outdated solutions have become more costly and complicated to implement, install, manage, and maintain.SD-WAN offers unlimited capabilities for accessing the benefits of the cloud and Internet. ...
Dion Hinchcliffe is an internationally recognized digital expert, bestselling book author, frequent keynote speaker, analyst, futurist, and transformation expert based in Washington, DC. He is currently Chief Strategy Officer at the industry-leading digital strategy and online community solutions firm, 7Summits.
As IoT continues to increase momentum, so does the associated risk. Secure Device Lifecycle Management (DLM) is ranked as one of the most important technology areas of IoT. Driving this trend is the realization that secure support for IoT devices provides companies the ability to deliver high-quality, reliable, secure offerings faster, create new revenue streams, and reduce support costs, all while building a competitive advantage in their markets. In this session, we will use customer use cases...
Machine learning has taken residence at our cities' cores and now we can finally have "smart cities." Cities are a collection of buildings made to provide the structure and safety necessary for people to function, create and survive. Buildings are a pool of ever-changing performance data from large automated systems such as heating and cooling to the people that live and work within them. Through machine learning, buildings can optimize performance, reduce costs, and improve occupant comfort by ...
René Bostic is the Technical VP of the IBM Cloud Unit in North America. Enjoying her career with IBM during the modern millennial technological era, she is an expert in cloud computing, DevOps and emerging cloud technologies such as Blockchain. Her strengths and core competencies include a proven record of accomplishments in consensus building at all levels to assess, plan, and implement enterprise and cloud computing solutions. René is a member of the Society of Women Engineers (SWE) and a m...
With 10 simultaneous tracks, keynotes, general sessions and targeted breakout classes, @CloudEXPO and DXWorldEXPO are two of the most important technology events of the year. Since its launch over eight years ago, @CloudEXPO and DXWorldEXPO have presented a rock star faculty as well as showcased hundreds of sponsors and exhibitors! In this blog post, we provide 7 tips on how, as part of our world-class faculty, you can deliver one of the most popular sessions at our events. But before reading...
Poor data quality and analytics drive down business value. In fact, Gartner estimated that the average financial impact of poor data quality on organizations is $9.7 million per year. But bad data is much more than a cost center. By eroding trust in information, analytics and the business decisions based on these, it is a serious impediment to digital transformation.
Charles Araujo is an industry analyst, internationally recognized authority on the Digital Enterprise and author of The Quantum Age of IT: Why Everything You Know About IT is About to Change. As Principal Analyst with Intellyx, he writes, speaks and advises organizations on how to navigate through this time of disruption. He is also the founder of The Institute for Digital Transformation and a sought after keynote speaker. He has been a regular contributor to both InformationWeek and CIO Insight...