Allocation on the JVM: Down the rabbit hole

11 Jul 2016

Lets say we have a simple function that allocates an object:

public static long demo(long a, long b) {
  return new Long(a+b).longValue();
}

How much memory does this function allocate per call? The number may vary based on the version of your JVM, the hardware, and various settings, but on recent versions of OpenJDK on x86-64 it’ll be 24 bytes (…probably).

Now some more experienced readers will remember a feature of the Hotspot known as Escape Analysis, which can remove some object allocations which don’t escape the function’s scope. And indeed, we can see that in action here: the c2 compiler cleans this up into a simple addition:

# {method} {0x00007f7c99001680} 'demo' '(JJ)J' in 'Test'
# parm0:    rsi:rsi   = long
# parm1:    rdx:rdx   = long
#           [sp+0x20]  (sp of caller)
sub    $0x18,%rsp
mov    %rbp,0x10(%rsp)    #*synchronization entry
                          # - Test::demo@-1 (line 14)

mov    %rsi,%rax
add    %rdx,%rax          #*ladd
                          # - Test::demo@6 (line 14)

add    $0x10,%rsp
pop    %rbp
test   %eax,0x169ae163(%rip)        # 0x00007f7cb3ac2000
                          #   {poll_return}
retq

^{(If you are interested in reading the JVM’s JITted assembly read this)}

However, if you try to reproduce this behavior in a profiler you will never see this optimization. This is because escape analysis is turned off when running with most profiling agents, regardless of the setting of DoEscapeAnalysis ¹. In fact, even if it wasn’t, the agent wouldn’t know whether or not the allocation was eliminated because bytecode rewriting isn’t capable of detecting it. (See Nitsan Wakart’s detailed rundown of this interaction to learn more)

So what is one to do you if you want to know real-world allocation information? Here lie dragons…

Background: object allocation in c2 JITted code

A normal object allocation in compiled code looks something like this:

mov    0x60(%r15),%rax
lea    0x18(%rax),%rdi
cmp    0x70(%r15),%rdi
ja     0x00007f424d19e66f
mov    %rdi,0x60(%r15)

This is taking advantage of what is known as a Thread Local Allocation Buffer, a per-thread space for allocating objects that requires no synchronization. Here, we are allocating a 24 byte object pointed to by %rax. To demystify this a bit, here is some approximate psuedocode:

object_start = buffer_start
object_end = object_start + 24
if object_end > buffer_end
  goto handle_full_buffer
buffer_start = object_end

In compiled code, the %r15 register points to the thread’s Thread class, which holds various relevant running data, and the 0x60 and 0x70 offsets point to the top and end fields of the thread’s ThreadLocalAllocBuffer, which is where objects are allocated.

Escaping the box

So we know there is some really interesting information stored at 0x60(%r15), can we just read it? Well, with a little JNI magic…

public class Evil {
  public static native long readTLABPointer();
}

Backed by

JNIEXPORT jlong JNICALL Java_Evil_readTLABPointer(JNIEnv *env, jclass klass) {
  jlong addr;
  asm("movq 0x60(%%r15), %0;":"=r"(addr)::);
  return addr;
}

^{For steps to generate JNI stubs and compile them, see the Makefile}

For all the crazy setup work that the JNI wrappers do, it turns out that they do in fact pass the current thread pointer in %r15 to our native function. In fact, they don’t even save the old value, since by x86-64 calling convention %r15 is a callee-save register.

Once over the initial disgust and horror, this does in fact actually (somewhat) work - we can find the number of bytes allocated by looking at the difference between the pointers:

public class EvilDemo {
  public EvilDemo() {}
  public static long demo(long a, long b) {
    return new Long(a+b).longValue();
  }
  public static void main(String[] args) {
    // Warmup to avoid counting classloading allocations
    demo(1L, 2L);
    
    long start = Evil.readTLABPointer();
    demo(1L, 2L);
    long end = Evil.readTLABPointer();
    System.out.println(end-start);
    // Hopefully force a c2 compile:
    for(long i = 0; i < 10000000; i++) {
      demo(i, i+1);
    }
    start = Evil.readTLABPointer();
    demo(1L,2L);
    end = Evil.readTLABPointer();
    System.out.println(end-start);
  }
}

Now, we can actually see the effects of the eliminated allocation:

$ java -agentpath:levil.so EvilDemo
24
0

Its worth remembering that this trick is subject to a very long list of limitations, as we are just assuming that nothing has changed about the TLAB in between calls.

Walking the heap

Knowing the number of bytes allocated interesting, but alone is not super helpful. Wouldn’t it be nice to know more about the allocated objects? Well, as long as we keep assuming that nothing about the memory layout has changed during the profiling period, if we know the start and end pointers we know the memory range the objects lie in. How do we get more interesting information? JNI & JVMTI to the rescue, of course! We can get the size of an object via the JVMTI GetObjectSize function:

jvmtiError
GetObjectSize(jvmtiEnv* env,
            jobject object,
            jlong* size_ptr)

And get an object’s class via the JNI GetObjectClass function:

jclass GetObjectClass(JNIEnv *env, jobject obj);

Yet another round of trickery is required here though, as jobject isn’t a pointer to the actual object, but rather to the handle which contains the object pointer (so the VM may safely move objects around while native extensions keep references to them). Fortunately for us, the dereferencing functions only do basic sanity checks:

inline oop JNIHandles::resolve_non_null(jobject handle) {
  assert(handle != NULL, "JNI handle should not be null");
  oop result = *(oop*)handle;
  assert(result != NULL, "Invalid value read from jni handle");
  assert(result != badJNIHandle, "Pointing to zapped jni handle area");
  // Don't let that private _deleted_handle object escape into the wild.
  assert(result != deleted_handle(), "Used a deleted global handle.");
  return result;
};

which means passing an address of the pointer on the stack is sufficient to bypass this. Now we can iterate over the range by inspecting objects, then bumping the pointer by the object’s size (despite warnings to the contrary, OpenJDK’s GetObjectSize seems to pretty much always return the expected size including header and buffer space). Here is the full proof of concept (link):

#include <assert.h>
#include <jvmti.h>
#include <stdio.h>

jvmtiEnv* jvmti;
char* start = 0;
short first = 1;

JNIEXPORT void JNICALL Java_is_jcdav_darkseer_DarkSeer_start(JNIEnv *env,
    jclass klass) {
  asm("movq 0x60(%%r15), %0;":"=r"(start)::);
}

JNIEXPORT void JNICALL Java_is_jcdav_darkseer_DarkSeer_end(JNIEnv *env,
    jclass klass) {
  //To avoid printing classloading allocations from the static init, skip printing
  if (first) {
    first = 0;
    return;
  }
  char* end;
  asm("movq 0x60(%%r15), %0;":"=r"(end)::);
  long allocated = (long)end - (long)start;
  printf("%ld\n", allocated);

  char* current = start;
  while (end > current) {
    jclass objKlass = (*env)->GetObjectClass(env, (jobject)&current);
    jlong size = -1;
    (*jvmti)->GetObjectSize(jvmti, (jobject)&current, &size);

    char* signature;
    char* generic_signature;
    (*jvmti)->GetClassSignature(jvmti, objKlass, &signature,
      &generic_signature);
    printf("%s: %ld\n", signature, size);
    (*jvmti)->Deallocate(jvmti, signature);
    (*jvmti)->Deallocate(jvmti, generic_signature);
    assert(size > 0 && size % 8 == 0); // this assert is not portable
    current += size;
  }
}

JNIEXPORT jint JNICALL Agent_OnLoad(JavaVM *vm, char *options, void *reserved) {
  return (*vm)->GetEnv(vm, (void**)&jvmti, JVMTI_VERSION_1_0);
}

And a simple demo that shows the result of the String concatenation optimization (also not observable via standard profilers, and fodder for a future blog post):

import is.jcdav.darkseer.DarkSeer;

public class Demo {
  public static String test(String s1, String s2) {
    return s1 + s2;
  }

  public static void main(String[] args) {
    test("hello", "world");
    DarkSeer.start();
    test("hello", "world");
    DarkSeer.end();
    for(int i = 0; i < 500000; i++) {
      test("hello", "world");
    }
    DarkSeer.start();
    test("hello", "world");
    DarkSeer.end();
  }
}

Which ends up something like this:

java -cp target/ -agentpath:target/ldsagent.so Demo
136
Ljava/lang/StringBuilder;: 24
[C: 48
Ljava/lang/String;: 24
[C: 40
64
[C: 40
Ljava/lang/String;: 24

Now we can see the improvements from the c2 compile: the StringBuilder incantation gets thrown out in favor of just constructing a single characater array of the correct length and giving that to the String.

The full source including this demo is on my github.

Takeaways?

Your profiler is probably being a bit pessimistic about the amount of garbage generated - but by how much is going to depend on the situation.
Consider this more of an interesting education lesson than a useful tool.
You definitely shouldn’t try using this anywhere where crashing the JVM would be a problem.
But if you are interested in testing it with non-trivial cases, try setting MinTLABSize to a big number.
Read 👏 the 👏 source 👏 code 👏

(Standard shameless plug - if you find this stuff interesting, follow me on twitter)

^{1. Specifically, when asking for the JVMTI can_access_local_variables capability (source)}

Jackson Davis

Allocation on the JVM: Down the rabbit hole

Background: object allocation in c2 JITted code

Escaping the box

Walking the heap