How the JVM compares your strings using the craziest x86 instruction you've never heard of

We’ve all probably seen Java’s String comparison function before. It compares strings by the first differing character, falling back to the length difference when they are identical up to the end of the shorter string:

public int compareTo(String anotherString) {
    int len1 = value.length;
    int len2 = anotherString.value.length;
    int lim = Math.min(len1, len2);
    char v1[] = value;
    char v2[] = anotherString.value;
	
    int k = 0;
    while (k < lim) {
        char c1 = v1[k];
        char c2 = v2[k];
        if (c1 != c2) {
            return c1 - c2;
        }
        k++;
    }
    return len1 - len2;
}

But did you know there is also a secret second implementation? String.compareTo is one of a few methods that is important enough to also get a special hand-rolled assembly version. On my machine, it something like this:

# {method} 'compare' '(Ljava/lang/String;Ljava/lang/String;)I' in 'Test'
# parm0:    rsi:rsi   = 'java/lang/String'
# parm1:    rdx:rdx   = 'java/lang/String'
#           [sp+0x20]  (sp of caller)
7fe3ed1159a0: mov    %eax,-0x14000(%rsp)
7fe3ed1159a7: push   %rbp
7fe3ed1159a8: sub    $0x10,%rsp        
7fe3ed1159ac: mov    0x10(%rsi),%rdi  
7fe3ed1159b0: mov    0x10(%rdx),%r10
7fe3ed1159b4: mov    %r10,%rsi
7fe3ed1159b7: add    $0x18,%rsi
7fe3ed1159bb: mov    0x10(%r10),%edx
7fe3ed1159bf: mov    0x10(%rdi),%ecx
7fe3ed1159c2: add    $0x18,%rdi
7fe3ed1159c6: mov    %ecx,%eax
7fe3ed1159c8: sub    %edx,%ecx
7fe3ed1159ca: push   %rcx
7fe3ed1159cb: cmovle %eax,%edx
7fe3ed1159ce: test   %edx,%edx
7fe3ed1159d0: je     0x00007fe3ed115a6f
7fe3ed1159d6: movzwl (%rdi),%eax
7fe3ed1159d9: movzwl (%rsi),%ecx
7fe3ed1159dc: sub    %ecx,%eax
7fe3ed1159de: jne    0x00007fe3ed115a72
7fe3ed1159e4: cmp    $0x1,%edx
7fe3ed1159e7: je     0x00007fe3ed115a6f
7fe3ed1159ed: cmp    %rsi,%rdi
7fe3ed1159f0: je     0x00007fe3ed115a6f
7fe3ed1159f6: mov    %edx,%eax
7fe3ed1159f8: and    $0xfffffff8,%edx
7fe3ed1159fb: je     0x00007fe3ed115a4f
7fe3ed1159fd: lea    (%rdi,%rax,2),%rdi
7fe3ed115a01: lea    (%rsi,%rax,2),%rsi
7fe3ed115a05: neg    %rax
7fe3ed115a08: vmovdqu (%rdi,%rax,2),%xmm0
7fe3ed115a0d: vpcmpestri $0x19,(%rsi,%rax,2),%xmm0
7fe3ed115a14: jb     0x00007fe3ed115a40
7fe3ed115a16: add    $0x8,%rax
7fe3ed115a1a: sub    $0x8,%rdx
7fe3ed115a1e: jne    0x00007fe3ed115a08
7fe3ed115a20: test   %rax,%rax
7fe3ed115a23: je     0x00007fe3ed115a6f
7fe3ed115a25: mov    $0x8,%edx
7fe3ed115a2a: mov    $0x8,%eax
7fe3ed115a2f: neg    %rax
7fe3ed115a32: vmovdqu (%rdi,%rax,2),%xmm0
7fe3ed115a37: vpcmpestri $0x19,(%rsi,%rax,2),%xmm0
7fe3ed115a3e: jae    0x00007fe3ed115a6f
7fe3ed115a40: add    %rax,%rcx
7fe3ed115a43: movzwl (%rdi,%rcx,2),%eax
7fe3ed115a47: movzwl (%rsi,%rcx,2),%edx
7fe3ed115a4b: sub    %edx,%eax
7fe3ed115a4d: jmp    0x00007fe3ed115a72
7fe3ed115a4f: mov    %eax,%edx
7fe3ed115a51: lea    (%rdi,%rdx,2),%rdi
7fe3ed115a55: lea    (%rsi,%rdx,2),%rsi
7fe3ed115a59: dec    %edx
7fe3ed115a5b: neg    %rdx
7fe3ed115a5e: movzwl (%rdi,%rdx,2),%eax
7fe3ed115a62: movzwl (%rsi,%rdx,2),%ecx
7fe3ed115a66: sub    %ecx,%eax
7fe3ed115a68: jne    0x00007fe3ed115a72
7fe3ed115a6a: inc    %rdx
7fe3ed115a6d: jne    0x00007fe3ed115a5e
7fe3ed115a6f: pop    %rax
7fe3ed115a70: jmp    0x00007fe3ed115a73
7fe3ed115a72: pop    %rcx
7fe3ed115a73: add    $0x10,%rsp
7fe3ed115a77: pop    %rbp
7fe3ed115a78: test   %eax,0x17ed6582(%rip)
7fe3ed115a7e: retq

The code that generates this, MacroAssembler::string_compare in macroAssembler_x86.cpp is well-documented for the curious. Its worth noting that there is an even fancier version for modern systems using AVX2 (with its 256bit vectorized registers) that I’m not going to cover here.

PCMPESTRIwhat?

Introduced in SSE4.2, pcmpestri is a member of the pcmpxstrx family of vectorized string comparison instructions. With a control byte to specify options for their complex functionality, they are complicated enough to get their own subsection in the x86 ISR. Intel even provides a flow diagram for our viewing pleasure:

Now that’s really putting the C in CISC!

The option bits for the control byte are specified as follows:

-------0b 128-bit sources treated as 16 packed bytes.
-------1b 128-bit sources treated as 8 packed words.
------0-b Packed bytes/words are unsigned.
------1-b Packed bytes/words are signed.
----00--b Mode is equal any.
----01--b Mode is ranges.
----10--b Mode is equal each.
----11--b Mode is equal ordered.
---0----b IntRes1 is unmodified.
---1----b IntRes1 is negated (1’s complement).
--0-----b Negation of IntRes1 is for all 16 (8) bits.
--1-----b Negation of IntRes1 is masked by reg/mem validity.
-0------b Index of the least significant, set, bit is used
          (regardless of corresponding input element validity).
          IntRes2 is returned in least significant bits of XMM0.
-1------b Index of the most significant, set, bit is used
          (regardless of corresponding input element validity).
          Each bit of IntRes2 is expanded to byte/word.
0-------b This bit currently has no defined effect, should be 0.
1-------b This bit currently has no defined effect, should be 0.

1. If you want to learn more, Section 4.1 of the Instruction Set Reference covers these options in detail.

compareTo uses 0x19, which means doing the “equal each” (aka string comparison) operation across 8 unsigned words (thanks UTF-16!) with a negated result. This monster of an instruction takes in 4 registers of input: the 2 strings themselves as parameters, plus their lengths in %rax and %rdx (‘e’ meaning explicit length - pcmpistri & pcmpistrm instead look for terminating nulls). The result (the index generated from IntRes2) is placed in %ecx. And just in case that wasn’t enough, pcmpxstrx also reappropriate flags as well:

CFlag – Reset if IntRes2 is equal to zero, set otherwise
ZFlag – Set if absolute-value of EDX is < 16 (8), reset otherwise
SFlag – Set if absolute-value of EAX is < 16 (8), reset otherwise
OFlag – IntRes2[0]
AFlag – Reset
PFlag – Reset

With all of out of our way, lets look at the main loop in detail with some setup before it for context:

7fe3ed1159f6: mov    %edx,%eax
7fe3ed1159f8: and    $0xfffffff8,%edx
7fe3ed1159fd: lea    (%rdi,%rax,2),%rdi
7fe3ed115a01: lea    (%rsi,%rax,2),%rsi
7fe3ed115a05: neg    %rax
7fe3ed115a08: vmovdqu (%rdi,%rax,2),%xmm0
7fe3ed115a0d: vpcmpestri $0x19,(%rsi,%rax,2),%xmm0
7fe3ed115a14: jb     0x00007fe3ed115a40
7fe3ed115a16: add    $0x8,%rax
7fe3ed115a1a: sub    $0x8,%rdx
7fe3ed115a1e: jne    0x00007fe3ed115a08

Going in, %rax% is the minimum of the strings’ lengths, and %rdx is that minimum masked by ~0x7 (so 8x the maximum number of iterations). It then bumps the pointers in the character arrays (%rsi and %rdi) by that many characters and then negates %rax, so the indexing into the array in the main loop is actually backwards. After loading 8 characters of the first string into %xmm0, it then does the comprison against 8 characters of the second, jumping out if CFlag is set (which means the index of the differing character is in %ecx), and then adjusts the 2 length registers and checks to see if this was the last iteration (which would make %rdx 0). How does a negative number make a valid length? Oops, almost forgot to mention that pcmpestri actually considers the lengths to be the absolute value:

The length of each input is interpreted as being the absolute-value of the value in
the length register.

Following the main loop, there is a fallthrough case to check the remaining characters when the minimum length isn’t a multiple of 8, and then the final case of diffing the lengths when the strings are identical up the shortest’s length. Phew!

More matching fun

If this wasn’t complicated enough for you, have a quick gander at the indexOf implementations (there are 2, depending on the size of the matching string), which use control byte 0x0d, which does “equal ordered” (aka substring) matching.

As always, if you are crazy enough to find wierd JVM internals interesting you should totally follow me on twitter