Allocation on the JVM: Down the rabbit hole
11 Jul 2016Lets say we have a simple function that allocates an object:
How much memory does this function allocate per call? The number may vary based on the version of your JVM, the hardware, and various settings, but on recent versions of OpenJDK on x86-64 it’ll be 24 bytes (…probably).
Now some more experienced readers will remember a feature of the Hotspot known as Escape Analysis, which can remove some object allocations which don’t escape the function’s scope. And indeed, we can see that in action here: the c2 compiler cleans this up into a simple addition:
(If you are interested in reading the JVM’s JITted assembly read this)
However, if you try to reproduce this behavior in a profiler you will never see this optimization. This is because escape analysis is turned off when running with most profiling agents, regardless of the setting of DoEscapeAnalysis 1. In fact, even if it wasn’t, the agent wouldn’t know whether or not the allocation was eliminated because bytecode rewriting isn’t capable of detecting it. (See Nitsan Wakart’s detailed rundown of this interaction to learn more)
So what is one to do you if you want to know real-world allocation information? Here lie dragons…
Background: object allocation in c2 JITted code
A normal object allocation in compiled code looks something like this:
This is taking advantage of what is known as a Thread Local Allocation Buffer, a per-thread space for allocating objects that requires no synchronization. Here, we are allocating a 24 byte object pointed to by %rax
. To demystify this a bit, here is some approximate psuedocode:
object_start = buffer_start
object_end = object_start + 24
if object_end > buffer_end
goto handle_full_buffer
buffer_start = object_end
In compiled code, the %r15
register points to the thread’s Thread class, which holds various relevant running data, and the 0x60
and 0x70
offsets point to the top
and end
fields of the thread’s ThreadLocalAllocBuffer, which is where objects are allocated.
Escaping the box
So we know there is some really interesting information stored at 0x60(%r15)
, can we just read it? Well, with a little JNI magic…
Backed by
For steps to generate JNI stubs and compile them, see the Makefile
For all the crazy setup work that the JNI wrappers do, it turns out that they do in fact pass the current thread pointer in %r15
to our native function. In fact, they don’t even save the old value, since by x86-64 calling convention %r15
is a callee-save register.
Once over the initial disgust and horror, this does in fact actually (somewhat) work - we can find the number of bytes allocated by looking at the difference between the pointers:
Now, we can actually see the effects of the eliminated allocation:
$ java -agentpath:levil.so EvilDemo
24
0
Its worth remembering that this trick is subject to a very long list of limitations, as we are just assuming that nothing has changed about the TLAB in between calls.
Walking the heap
Knowing the number of bytes allocated interesting, but alone is not super helpful. Wouldn’t it be nice to know more about the allocated objects? Well, as long as we keep assuming that nothing about the memory layout has changed during the profiling period, if we know the start and end pointers we know the memory range the objects lie in. How do we get more interesting information? JNI & JVMTI to the rescue, of course! We can get the size of an object via the JVMTI GetObjectSize function:
jvmtiError
GetObjectSize(jvmtiEnv* env,
jobject object,
jlong* size_ptr)
And get an object’s class via the JNI GetObjectClass function:
jclass GetObjectClass(JNIEnv *env, jobject obj);
Yet another round of trickery is required here though, as jobject
isn’t a pointer to the actual object, but rather to the handle which contains the object pointer (so the VM may safely move objects around while native extensions keep references to them). Fortunately for us, the dereferencing functions only do basic sanity checks:
which means passing an address of the pointer on the stack is sufficient to bypass this. Now we can iterate over the range by inspecting objects, then bumping the pointer by the object’s size (despite warnings to the contrary, OpenJDK’s GetObjectSize seems to pretty much always return the expected size including header and buffer space). Here is the full proof of concept (link):
And a simple demo that shows the result of the String concatenation optimization (also not observable via standard profilers, and fodder for a future blog post):
Which ends up something like this:
java -cp target/ -agentpath:target/ldsagent.so Demo
136
Ljava/lang/StringBuilder;: 24
[C: 48
Ljava/lang/String;: 24
[C: 40
64
[C: 40
Ljava/lang/String;: 24
Now we can see the improvements from the c2 compile: the StringBuilder incantation gets thrown out in favor of just constructing a single characater array of the correct length and giving that to the String.
The full source including this demo is on my github.
Takeaways?
- Your profiler is probably being a bit pessimistic about the amount of garbage generated - but by how much is going to depend on the situation.
- Consider this more of an interesting education lesson than a useful tool.
- You definitely shouldn’t try using this anywhere where crashing the JVM would be a problem.
- But if you are interested in testing it with non-trivial cases, try setting
MinTLABSize
to a big number. - Read 👏 the 👏 source 👏 code 👏
(Standard shameless plug - if you find this stuff interesting, follow me on twitter)
1. Specifically, when asking for the JVMTI can_access_local_variables
capability (source)