After reading Sun's spec here: http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/jniTOC.html, I decided that there needs to be a simplified version. So here it goes. A simple, 3-step recipe.
(1) Create a DLL. Here's an example of a DLL that contains one function that does nothing:
nop.cxx:Notice the decorated name. Java_ is required for all, Test is the name of the class, "nop" is the name of the function. Straightforward.
__declspec(naked)
unsigned __int64 __cdecl nop(void) {
__asm {
ret
}
}
nop.def:
LIBRARY "nop"
EXPORTS
Java_Test_nop = nop
(2) Compile this using Visual Studio, and copy the resulting nop.dll to the root directory of your Java project.
(3) Call it from Java:
class Test {That's all there is to it. Let's now look at how fast the function call is actually made. To do this, I am going to use the same framework I used for testing STL's vector performance: http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html, except translated to Java.
native static void nop();
static {
System.loadLibrary("nop");
}
public static void main(String[] args) {
nop();
}
}
We'll need 3 functions in the native library though - one that does nothing (this is the function we're going to test), the other will be returning the results of Pentium's RDTSC instruction that measures the clock count, and the third one will be fixing the thread's affinity to the first core, because if this is not done and thread jumps the core while executing, we won't be able to compare the results of RDTSC, since clock counters on multiple cores is not synchronized.
Here's how the native code looks like:
jnipc.cxx:The Java code is more involved, but less complicated :-):
#include <windows.h>
__declspec(naked)
unsigned __int64 __cdecl rdtsc(void) {
__asm {
rdtsc
ret
}
}
__declspec(naked)
unsigned __int64 __cdecl nop(void) {
__asm {
ret
}
}
void __cdecl fixthread(void) {
SetThreadAffinityMask(GetCurrentThread(), 1);
}
jnipc.def:
LIBRARY "jnipc"
EXPORTS
Java_Test_rdtsc = rdtsc
Java_Test_nop = nop
Java_Test_fixthread = fixthread
import java.util.Arrays;On my T2600 Core 2 Duo laptop this ran in approximately 38 clocks per iteration.
class Test {
native static long rdtsc();
native static void nop();
native static void fixthread();
static {
System.loadLibrary("jnipc");
}
static final int ITERATIONS = 100000;
static final int SAMPLES = 1000;
static final int BUCKETS = 20;
static final int BUCKET_SCALE = 10000;
public static void main(String[] args) {
fixthread();
long [] clocks = new long[SAMPLES];
for (int i = 0; i < SAMPLES; ++i) {
long start = rdtsc();
for (int j = 0; j < ITERATIONS; ++j)
nop();
long stop = rdtsc();
clocks[i] = stop - start;
}
Arrays.sort(clocks);
double average = 0.0;
long [] buckets = new long[BUCKETS];
double log2 = Math.log(2.0);
for (int i = 0 ; i < SAMPLES; ++i) {
average += clocks[i];
int ndx = (int)(Math.log((double)clocks[i]
/ (double)BUCKET_SCALE) / log2);
if (ndx < 0)
buckets[0]++;
else if (ndx < BUCKETS-1)
buckets[ndx + 1]++;
else
buckets[BUCKETS-1]++;
}
average /= SAMPLES;
System.out.println("Median for " +
ITERATIONS + " iterations: " +
clocks[SAMPLES / 2]);
System.out.println("Average for " +
ITERATIONS + " iterations: " +
average);
System.out.println("Median per iteration: "
+ clocks[SAMPLES / 2] / ITERATIONS);
System.out.println("Average per iteration: "
+ average / ITERATIONS);
System.out.println("Histogram:");
int prev = 0;
int curr = BUCKET_SCALE;
for (int i = 0; i < BUCKETS - 1; ++i) {
System.out.println(prev + " - " + curr
+ ": " + buckets[i]);
prev = curr;
curr *= 2;
}
System.out.println(" > " + prev + ": " +
buckets[BUCKETS - 1]);
}
}
Median for 100000 iterations: 3815903To compare it with C, I've added the following case to my perf program in http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html:
Average for 100000 iterations: 3854832.709
Median per iteration: 38
Average per iteration: 38.54832709
Histogram:
0 - 10000: 0
10000 - 20000: 0
20000 - 40000: 0
40000 - 80000: 0
80000 - 160000: 0
160000 - 320000: 0
320000 - 640000: 0
640000 - 1280000: 0
1280000 - 2560000: 0
2560000 - 5120000: 998
5120000 - 10240000: 2
10240000 - 20480000: 0
20480000 - 40960000: 0
40960000 - 81920000: 0
81920000 - 163840000: 0
163840000 - 327680000: 0
327680000 - 655360000: 0
655360000 - 1310720000: 0
1310720000 - -1673527296: 0
> -1673527296: 0
C:\bin>java -version
java version "1.6.0_03"
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) Client VM (build 1.6.0_03-b05, mixed mode, sharing)
case 4:...and got approximately 6 clocks per call, making JNI about 6 times slower on this simple test:
printf("Function call\n");
HMODULE hm = LoadLibraryW(L"c:\\bin\\jnipc.dll");
void (*func)() = (void (*)())
GetProcAddress(hm, "Java_Test_nop");
for (int i = 0; i < NUM_SAMPLES; ++i) {
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j)
func();
unsigned __int64 end = rdtsc();
clocks[cptr++] =
(unsigned int)(end - begin);
}
Function call
Median: 60112
Median, CPI: 6
Average: 60262
Average, CPI: 6
Histogram:
0 - 1000: 0
1000 - 2000: 0
2000 - 4000: 0
4000 - 8000: 0
8000 - 16000: 0
16000 - 32000: 0
32000 - 64000: 998
64000 - 128000: 1
128000 - 256000: 1
256000 - 512000: 0
512000 - 1024000: 0
1024000 - 2048000: 0
2048000 - 4096000: 0
4096000 - 8192000: 0
8192000 - 16384000: 0
16384000 - 32768000: 0
32768000 - 65536000: 0
65536000 - 131072000: 0
131072000 - 262144000: 0
> 262144000: 0
5 comments:
Since you tested it with C++'s vector, you should also provide a test with a slightly more convenient Java container, java.util.Vector for example.
I am sorry, but the whole idea of the post does not make sense.
1. JNI transition is slow, a JVM makes a lot of extra work to make a JNI call (interning live references for GC, maintaining Java env, etc), so, you won't get a high-res timer
2. benchmarks do not need high-res timers, just run the same stuff many times and divide. Timer overhead will be negligible.
Egor,
The reason I use rdtsc is because it's the simplest way to measure things in CPU clocks which are more universal than time - for example, the output of this counter does not depend on CPU's clock rate.
As far as JNI being slow, the benchmark does not show it. 38 vs. 6 does not point to a lot of work being done - perhaps an attempt to marshal an empty arguments list, but not much more than that.
Sergey,
the output of this counter _depends_ on the clock rate of your memory bus.
Hi!
I'm sorry to leave personal comment to professional post, but I wasn't able to find your personal email.
My last name is Solyanik too and I'm currently very interested in history of this surname.
If you will be so kind to answer to several questions, please write me a message: alex at draft.com.ua
Post a Comment