Tuesday, March 4, 2008

JNI Made Simple

I want to measure the performance of Java vector to compare it with STL. Java of course lacks a really high resolution timer, so I needed to use JNI - the Java native interface.

After reading Sun's spec here: http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/jniTOC.html, I decided that there needs to be a simplified version. So here it goes. A simple, 3-step recipe.

(1) Create a DLL. Here's an example of a DLL that contains one function that does nothing:
nop.cxx:

__declspec(naked)
unsigned __int64 __cdecl nop(void) {
__asm {
ret
}
}

nop.def:
LIBRARY "nop"
EXPORTS
Java_Test_nop = nop
Notice the decorated name. Java_ is required for all, Test is the name of the class, "nop" is the name of the function. Straightforward.

(2) Compile this using Visual Studio, and copy the resulting nop.dll to the root directory of your Java project.

(3) Call it from Java:
class Test {
native static void nop();

static {
System.loadLibrary("nop");
}

public static void main(String[] args) {
nop();
}
}
That's all there is to it. Let's now look at how fast the function call is actually made. To do this, I am going to use the same framework I used for testing STL's vector performance: http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html, except translated to Java.

We'll need 3 functions in the native library though - one that does nothing (this is the function we're going to test), the other will be returning the results of Pentium's RDTSC instruction that measures the clock count, and the third one will be fixing the thread's affinity to the first core, because if this is not done and thread jumps the core while executing, we won't be able to compare the results of RDTSC, since clock counters on multiple cores is not synchronized.

Here's how the native code looks like:
jnipc.cxx:
#include <windows.h>

__declspec(naked)
unsigned __int64 __cdecl rdtsc(void) {
__asm {
rdtsc
ret
}
}

__declspec(naked)
unsigned __int64 __cdecl nop(void) {
__asm {
ret
}
}

void __cdecl fixthread(void) {
SetThreadAffinityMask(GetCurrentThread(), 1);
}


jnipc.def:
LIBRARY "jnipc"
EXPORTS
Java_Test_rdtsc = rdtsc
Java_Test_nop = nop
Java_Test_fixthread = fixthread
The Java code is more involved, but less complicated :-):
import java.util.Arrays;

class Test {
native static long rdtsc();
native static void nop();
native static void fixthread();

static {
System.loadLibrary("jnipc");
}

static final int ITERATIONS = 100000;
static final int SAMPLES = 1000;
static final int BUCKETS = 20;
static final int BUCKET_SCALE = 10000;

public static void main(String[] args) {
fixthread();
long [] clocks = new long[SAMPLES];
for (int i = 0; i < SAMPLES; ++i) {
long start = rdtsc();
for (int j = 0; j < ITERATIONS; ++j)
nop();
long stop = rdtsc();
clocks[i] = stop - start;
}

Arrays.sort(clocks);
double average = 0.0;

long [] buckets = new long[BUCKETS];

double log2 = Math.log(2.0);

for (int i = 0 ; i < SAMPLES; ++i) {
average += clocks[i];
int ndx = (int)(Math.log((double)clocks[i]
/ (double)BUCKET_SCALE) / log2);
if (ndx < 0)
buckets[0]++;
else if (ndx < BUCKETS-1)
buckets[ndx + 1]++;
else
buckets[BUCKETS-1]++;
}

average /= SAMPLES;

System.out.println("Median for " +
ITERATIONS + " iterations: " +
clocks[SAMPLES / 2]);
System.out.println("Average for " +
ITERATIONS + " iterations: " +
average);
System.out.println("Median per iteration: "
+ clocks[SAMPLES / 2] / ITERATIONS);
System.out.println("Average per iteration: "
+ average / ITERATIONS);

System.out.println("Histogram:");
int prev = 0;
int curr = BUCKET_SCALE;
for (int i = 0; i < BUCKETS - 1; ++i) {
System.out.println(prev + " - " + curr
+ ": " + buckets[i]);
prev = curr;
curr *= 2;
}

System.out.println(" > " + prev + ": " +
buckets[BUCKETS - 1]);

}
}
On my T2600 Core 2 Duo laptop this ran in approximately 38 clocks per iteration.
Median for 100000 iterations: 3815903
Average for 100000 iterations: 3854832.709
Median per iteration: 38
Average per iteration: 38.54832709
Histogram:
0 - 10000: 0
10000 - 20000: 0
20000 - 40000: 0
40000 - 80000: 0
80000 - 160000: 0
160000 - 320000: 0
320000 - 640000: 0
640000 - 1280000: 0
1280000 - 2560000: 0
2560000 - 5120000: 998
5120000 - 10240000: 2
10240000 - 20480000: 0
20480000 - 40960000: 0
40960000 - 81920000: 0
81920000 - 163840000: 0
163840000 - 327680000: 0
327680000 - 655360000: 0
655360000 - 1310720000: 0
1310720000 - -1673527296: 0
> -1673527296: 0


C:\bin>java -version
java version "1.6.0_03"
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) Client VM (build 1.6.0_03-b05, mixed mode, sharing)
To compare it with C, I've added the following case to my perf program in http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html:
case 4:
printf("Function call\n");
HMODULE hm = LoadLibraryW(L"c:\\bin\\jnipc.dll");
void (*func)() = (void (*)())
GetProcAddress(hm, "Java_Test_nop");
for (int i = 0; i < NUM_SAMPLES; ++i) {
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j)
func();
unsigned __int64 end = rdtsc();
clocks[cptr++] =
(unsigned int)(end - begin);
}
...and got approximately 6 clocks per call, making JNI about 6 times slower on this simple test:
Function call
Median: 60112
Median, CPI: 6
Average: 60262
Average, CPI: 6
Histogram:
0 - 1000: 0
1000 - 2000: 0
2000 - 4000: 0
4000 - 8000: 0
8000 - 16000: 0
16000 - 32000: 0
32000 - 64000: 998
64000 - 128000: 1
128000 - 256000: 1
256000 - 512000: 0
512000 - 1024000: 0
1024000 - 2048000: 0
2048000 - 4096000: 0
4096000 - 8192000: 0
8192000 - 16384000: 0
16384000 - 32768000: 0
32768000 - 65536000: 0
65536000 - 131072000: 0
131072000 - 262144000: 0
> 262144000: 0

5 comments:

Anonymous said...

Since you tested it with C++'s vector, you should also provide a test with a slightly more convenient Java container, java.util.Vector for example.

Egor Pasko said...

I am sorry, but the whole idea of the post does not make sense.

1. JNI transition is slow, a JVM makes a lot of extra work to make a JNI call (interning live references for GC, maintaining Java env, etc), so, you won't get a high-res timer

2. benchmarks do not need high-res timers, just run the same stuff many times and divide. Timer overhead will be negligible.

Sergey Solyanik said...

Egor,

The reason I use rdtsc is because it's the simplest way to measure things in CPU clocks which are more universal than time - for example, the output of this counter does not depend on CPU's clock rate.

As far as JNI being slow, the benchmark does not show it. 38 vs. 6 does not point to a lot of work being done - perhaps an attempt to marshal an empty arguments list, but not much more than that.

Egor Pasko said...

Sergey,

the output of this counter _depends_ on the clock rate of your memory bus.

solex said...

Hi!

I'm sorry to leave personal comment to professional post, but I wasn't able to find your personal email.

My last name is Solyanik too and I'm currently very interested in history of this surname.

If you will be so kind to answer to several questions, please write me a message: alex at draft.com.ua