Thursday, March 20, 2008

Adam Smith on wars...

"In great empires the people who live in the capital, and in the provinces remote from the scene of action, feel, many of them, scarce any inconveniency from the war; but enjoy, at their ease, the amusement of reading in the newspapers the exploits of their own fleets and armies. To them this amusement compensates the small difference between the taxes which they pay on account of the war, and those which they had been accustomed to pay in time of peace. They are commonly dissatisfied with the return of peace, which puts an end to their amusement, and to a thousand visionary hopes of conquest and national glory from a longer continuance of the war."

-- An Inquiry into the Nature And Causes of the Wealth of Nations

Dr. Strangelove... in reverse!

President Bush rarely comments about the Democratic presidential contest, but he said that he had to speak up about Clinton's red phone ads because he found them "so confusing."

"If I answered the red phone every time it rang, I would never get any sleep," Bush said. "Sometimes it starts ringing at 9 p.m., and I am already tucked in by then."

Bush said that "there's nothing so important that it can't wait until tomorrow, or whenever I remember to check my voicemail."


http://news.yahoo.com/s/uc/20080308/cm_uc_crabox/op_475477;_ylt=AmW7uqcqRkkMlQ.hJ9SIda9xFb8C

Monday, March 10, 2008

This is a loop that never ends - it just goes on and on my friends!


for (double d = 0.0; d != 1.0; d += 0.1)
System.out.println(d);


:-)

Saturday, March 8, 2008

10,000 BC: a movie review

Short version: terrible.

Medium version: think 300. This movie is very similar to it in terms of both connection to reality and the overall flow.

Long version (WARNING: a plot spoiler): there's this really advanced civilization in the middle of desert 12000 years ago. They know metalworks (and it looks like iron, too), shipbuilding, navigation, astronomy, writing, cloth manufacture. Basically, Atlantis, except in Africa. They build huge gold-tipped pyramids, elevated roads, and babylonian-looking temples and are rulled by a priest class reporting to chief priest AKA the "Almighty". To build all this they need, obviously, slaves, and so they send the roaming bands to snatch them from neighboring tribes.

The neighbors are more authentically looking 10000BC-ers (at least they wear fur, and appear to not use iron weapons). The only problem - their men are either shaved, or have neatly trimmed beards and goatees.

Anyways, one of these roving bands steals this good-looking girl, and it proves to be their undoing: her boyfriend comes to rescue her, incites the global slave revolt, easily overcomes the few armed (iron swords!) guards that are there. In the thick of the fight the gilfriend gets shot, but then is magically revived by a magician thousands of miles away.

Oh, yes, and at the end of all this our hero's tribe gets a bag of magic beans and starts the agricultural revolution. Viva Monsanto!

Java collections perf redux

I finally got some time to run my microbenchmark for various Java collection objects.

Here's the source:

package com.solyanik.perftesting;

import java.util.Arrays;
import java.util.Vector;
import java.util.ArrayList;

/**
* Implements a micro-benchmark using
* high-resolution RDTSC timer.
*
* @author sergey@solyanik.com (Sergey Solyanik)
*/
public class MicroBenchmark {

/*
* RDTSC - returns number of clocks since
* the core was last reset.
*
* NOTE: no synchronizing instruction
* is executed before RDTSC, so this
* is not suitable for really short
* runs as the second RDTSC can execute
* before the instruction we're trying
* to measure.
*
* NOTE2: Different cores have
* different clock counters, and they
* are not synchronized.
*/
private native static long rdtsc();

/**
* A function that consists of one
* instruction - ret.
*/
private native static void nop();

/**
* Fixes the thread to the first CPU
* and core.
*/
private native static void fixthread();

/**
* Loads the library containing our
* native code.
*/
static {
System.loadLibrary("jnipc");
}

/**
* Prints the statistics for an array
* of clock intervals.
*
* @param clocks The array of clocks
* @param iterations Number of iterations
* @param bucket_number How many buckets
* to use for the histogram
* @param scale Scale for the histogram
*/
private static void Analyze(
String header, long [] clocks,
int iterations, int bucket_number,
int scale) {
Arrays.sort(clocks);
double average = 0.0;

long [] buckets = new long[bucket_number];

double log2 = Math.log(2.0);

for (int i = 0 ; i < clocks.length; ++i) {
average += clocks[i];
int ndx = (int)(Math.log(
(double)clocks[i]
/ (double)scale) / log2);
if (ndx < 0)
buckets[0]++;
else if (ndx < buckets.length - 1)
buckets[ndx + 1]++;
else
buckets[buckets.length - 1]++;
}

average /= clocks.length;

System.out.println(header + ":");
System.out.println("Median for " +
iterations + " iterations: " +
clocks[clocks.length / 2]);
System.out.println("Average for " +
iterations + " iterations: " +
average);
System.out.println("Median per iteration: "
+ clocks[clocks.length / 2]
/ iterations);
System.out.println("Average per iteration: "
+ average / iterations);

System.out.println("Histogram:");
int prev = 0;
int curr = scale;
for (int i = 0; i < buckets.length - 1; ++i) {
System.out.println(prev + " - " + curr
+ ": " + buckets[i]);
prev = curr;
curr *= 2;
}

System.out.println(" > " + prev + ": " +
buckets[buckets.length - 1]);
}

/**
* Number of loop iterations per sample.
*/
private static final int ITERATIONS = 10000;

/**
* Number of samples.
*/
private static final int SAMPLES = 1000;

/**
* Number of buckets for the histogram.
*/
private static final int BUCKETS = 20;

/**
* Bucket scale for the histogram.
*/
private static final int BUCKET_SCALE = 1000;

/**
* Main function.
*
* @param args
*/
public static void main(String[] args) {
fixthread();
System.out.println("Let the fun begin!");

long [] clocks = new long[SAMPLES];

Boolean b = new Boolean(true);

for (int i = 0; i < clocks.length; ++i) {
Boolean [] a = new Boolean[ITERATIONS];
long start = rdtsc();
for (int j = 0; j < ITERATIONS; ++j)
a[j] = b;
long stop = rdtsc();
clocks[i] = stop - start;
}

Analyze("Array",
clocks, ITERATIONS, BUCKETS,
BUCKET_SCALE);

for (int i = 0; i < clocks.length; ++i) {
Vector<Object> v = new Vector<Object>();

long start = rdtsc();
for (int j = 0; j < ITERATIONS; ++j)
v.add(b);
long stop = rdtsc();
clocks[i] = stop - start;
}

Analyze("Expanding Vector", clocks,
ITERATIONS, BUCKETS, BUCKET_SCALE);

for (int i = 0; i < clocks.length; ++i) {
Vector<Object> v = new Vector<Object>();
v.setSize(ITERATIONS);
long start = rdtsc();
for (int j = 0; j < ITERATIONS; ++j)
v.set(j, b);
long stop = rdtsc();
clocks[i] = stop - start;
}

Analyze("Pre-allocated Vector",
clocks, ITERATIONS, BUCKETS,
BUCKET_SCALE);

for (int i = 0; i < clocks.length; ++i) {
ArrayList<Object> al = new ArrayList<Object>();
long start = rdtsc();
for (int j = 0; j < ITERATIONS; ++j)
al.add(b);
long stop = rdtsc();
clocks[i] = stop - start;
}

Analyze("Expanding ArrayList", clocks,
ITERATIONS, BUCKETS, BUCKET_SCALE);

for (int i = 0; i < clocks.length; ++i) {
ArrayList<Object> al = new ArrayList<Object>();
al.ensureCapacity(ITERATIONS);
long start = rdtsc();
for (int j = 0; j < ITERATIONS; ++j)
al.add(b);
long stop = rdtsc();
clocks[i] = stop - start;
}

Analyze("Pre-allocated ArrayList", clocks,
ITERATIONS, BUCKETS, BUCKET_SCALE);
}
}

And here's what I've got:

Median Average
Array 14 15
Expanding vector 124 133
Pre-allocated vector 115 119
Expanding ArrayList 57 66
Pre-allocated ArrayList 43 46
To remind, C++ numbers were like this:

Median Average
Array 2 2
Pre-allocated vector,
assigment 2 2
Pre-allocated vector,
push_back 8 8
Expanding vector,
push_back 145 148 (*)

(*) Obviously, this number very much depends on number of expansions, because STL vector grows by 50%, a very big vectors will have smaller cost. See my previous posts for details on C++ testing.

Synchronization in a web application

UPDATED 03/08/2008 - source code completely rewritten.

While I was writing the code for the Project Guttenberg proxy, one of the problems I needed to solve was the concurrency in URL processing.

When a user requests a URL from the proxy, the proxy sends the request to the real Project Gutenberg web site, and writes the result into a file.

Then it processes the file - if it's an HTML document, it replaces the references to the original web site with links that point to the proxy. If it was a request for a LRF file, the text file that was fetched from the Project Gutenberg would be translated.

Then there's cache - the web pages that have already been fetched, but have not expired get fetched from that cache.

In other words, there's a lot of activity, that is by and large independent - unless the two requests come for the same page. Then of course we do not want to process them at the same time, because they are going to be modifying the same state, corrupting each other in the process.

The solution is not that hard, but I thought it would be instructive to put it here.


package com.solyanik.sync;

import java.util.HashSet;

/** The general object lock class.
* Ensures that any two objects
* that are equal to each other as
* in a.equals(b) cannot be locked
* simultaneously.
*
* This is a complement to the
* standard monitor functionality
* that protects objects that are
* equal in == sense (i. e. the
* same object).
*
* IMPORTANT NOTE: The same thread
* trying to lock the same object
* twice = INSTANT DEADLOCK!
*
* Usage:
* // in constructor
* ObjectLock lock = new ObjectLock();
* ...
* void handleRequest(final Object o) {
* try {
* lock.lock(o);
* ... // do processing
* } finally {
* lock.unlock(o);
* }
*
* @author sergey@solyanik.com (Sergey Solyanik)
*/
class ObjectLock {
/**
* Stores all currently locked objects
*/
private HashSet<Object> locked =
new HashSet<Object>();

public void lock(Object o)
throws InterruptedException {
synchronized(locked) {
for ( ; ; ) {
if (locked.contains(o)) {
// We are already locked.
locked.wait();
} else {
locked.add(o);
break;
}
}
}
}

public void unlock(Object o)
throws IllegalMonitorStateException {
synchronized(locked) {
if (!locked.remove(o))
throw new
IllegalMonitorStateException(
"The object is not locked.");
locked.notifyAll();
}
}
}


The code is tiny and by and large self explanatory. We keep a hash set, and whenever we are about to lock the object, we check if it is in the hash set already. If it is not, we put it there and continue on our merry way.

If it is, we're joining the queue of threads waiting on the hash set (it can actually be absolutely any object). When the object is unlocked, it will be removed from the hash, and the threads waiting will be notified.

To preempt obvious question - this assumes that the level of contention is not super high, so there are not a lot of threads pending at any given time. In the case of a high level of contention (10s of threads), we would want to wake up only the threads that are waiting for that specific object that has just been unlocked, not all of the threads.

How is this different from synchronized(o)? The ObjectLock will prevent collisions of code operating on objects that are equal in the sense that o1.equals(o2). The synchronized clause prevents operations on the objects that are the same (o1 == o2).

In the case of a string, by the way, an inefficient way of doing the same that the lock class above is trying to accomplish is this:

void handle_request(String url) {
String lock = url.toLowerCase().intern();
synchronized(lock) {
... do processing ...
}
}


Interned string have the property of s1 == s2 iff s1.equals(s2). The problem of course is that every url that has ever been processed will stay in your process's string heap forever, so it is not very efficient.

And the unit test:

package com.solyanik.sync;

import java.util.HashSet;

public class UnitTest {
static class Executor extends Thread {
private static HashSet<String> inuse
= new HashSet<String> ();

private String url;
private int id;
private ObjectLock lock;

public Executor(String url, int id,
ObjectLock lock) {
this.url = url;
this.id = id;
this.lock = lock;
}

public void run() {
try {
System.out.println("[" + id +
"] Started to process " + url);
System.out.println("[" + id +
"] Locking " + url);
lock.lock(url);
System.out.println("[" + id +
"] Lock acquired " + url);
// Yes, I know this is unnecessary.
// But just to be absolutely sure,
// this is a test after
// all...
String internedUrl = url.intern();

if (inuse.contains(internedUrl)) {
System.out.println("[" + id +
"] ERROR: COLLISION " + url);
throw new AssertionError(
"Bad lock implementation!");
}

inuse.add(internedUrl);
Thread.sleep(5000);
inuse.remove(internedUrl);

System.out.println("[" + id +
"] Releasing " + url);
lock.unlock(url);
System.out.println("[" + id +
"] Lock released " + url);
System.out.println("[" + id +
"] Thread finished ");
} catch (InterruptedException ex) {
System.out.println("[" + id +
"] Interrupted: " + ex);
}
}
}

/**
* @param args Not used.
*/
public static void main(String[] args) {
String [] urls = {
new String("http://www.google.com"),
new String("http://www.google.com"),
new String("http://www.google.com"),
new String("http://www.google.com"),
new String("http://www.google.com"),
new String("http://www.live.com"),
new String("http://www.live.com"),
new String("http://www.live.com"),
new String("http://www.yahoo.com"),
new String("http://www.yahoo.com")
};
ObjectLock lock = new ObjectLock();
for (int i = 0; i < 128; ++i) {
(new Executor(urls[i % urls.length],
i, lock)).start();
}
}
}

Tuesday, March 4, 2008

JNI Made Simple

I want to measure the performance of Java vector to compare it with STL. Java of course lacks a really high resolution timer, so I needed to use JNI - the Java native interface.

After reading Sun's spec here: http://java.sun.com/j2se/1.5.0/docs/guide/jni/spec/jniTOC.html, I decided that there needs to be a simplified version. So here it goes. A simple, 3-step recipe.

(1) Create a DLL. Here's an example of a DLL that contains one function that does nothing:
nop.cxx:

__declspec(naked)
unsigned __int64 __cdecl nop(void) {
__asm {
ret
}
}

nop.def:
LIBRARY "nop"
EXPORTS
Java_Test_nop = nop
Notice the decorated name. Java_ is required for all, Test is the name of the class, "nop" is the name of the function. Straightforward.

(2) Compile this using Visual Studio, and copy the resulting nop.dll to the root directory of your Java project.

(3) Call it from Java:
class Test {
native static void nop();

static {
System.loadLibrary("nop");
}

public static void main(String[] args) {
nop();
}
}
That's all there is to it. Let's now look at how fast the function call is actually made. To do this, I am going to use the same framework I used for testing STL's vector performance: http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html, except translated to Java.

We'll need 3 functions in the native library though - one that does nothing (this is the function we're going to test), the other will be returning the results of Pentium's RDTSC instruction that measures the clock count, and the third one will be fixing the thread's affinity to the first core, because if this is not done and thread jumps the core while executing, we won't be able to compare the results of RDTSC, since clock counters on multiple cores is not synchronized.

Here's how the native code looks like:
jnipc.cxx:
#include <windows.h>

__declspec(naked)
unsigned __int64 __cdecl rdtsc(void) {
__asm {
rdtsc
ret
}
}

__declspec(naked)
unsigned __int64 __cdecl nop(void) {
__asm {
ret
}
}

void __cdecl fixthread(void) {
SetThreadAffinityMask(GetCurrentThread(), 1);
}


jnipc.def:
LIBRARY "jnipc"
EXPORTS
Java_Test_rdtsc = rdtsc
Java_Test_nop = nop
Java_Test_fixthread = fixthread
The Java code is more involved, but less complicated :-):
import java.util.Arrays;

class Test {
native static long rdtsc();
native static void nop();
native static void fixthread();

static {
System.loadLibrary("jnipc");
}

static final int ITERATIONS = 100000;
static final int SAMPLES = 1000;
static final int BUCKETS = 20;
static final int BUCKET_SCALE = 10000;

public static void main(String[] args) {
fixthread();
long [] clocks = new long[SAMPLES];
for (int i = 0; i < SAMPLES; ++i) {
long start = rdtsc();
for (int j = 0; j < ITERATIONS; ++j)
nop();
long stop = rdtsc();
clocks[i] = stop - start;
}

Arrays.sort(clocks);
double average = 0.0;

long [] buckets = new long[BUCKETS];

double log2 = Math.log(2.0);

for (int i = 0 ; i < SAMPLES; ++i) {
average += clocks[i];
int ndx = (int)(Math.log((double)clocks[i]
/ (double)BUCKET_SCALE) / log2);
if (ndx < 0)
buckets[0]++;
else if (ndx < BUCKETS-1)
buckets[ndx + 1]++;
else
buckets[BUCKETS-1]++;
}

average /= SAMPLES;

System.out.println("Median for " +
ITERATIONS + " iterations: " +
clocks[SAMPLES / 2]);
System.out.println("Average for " +
ITERATIONS + " iterations: " +
average);
System.out.println("Median per iteration: "
+ clocks[SAMPLES / 2] / ITERATIONS);
System.out.println("Average per iteration: "
+ average / ITERATIONS);

System.out.println("Histogram:");
int prev = 0;
int curr = BUCKET_SCALE;
for (int i = 0; i < BUCKETS - 1; ++i) {
System.out.println(prev + " - " + curr
+ ": " + buckets[i]);
prev = curr;
curr *= 2;
}

System.out.println(" > " + prev + ": " +
buckets[BUCKETS - 1]);

}
}
On my T2600 Core 2 Duo laptop this ran in approximately 38 clocks per iteration.
Median for 100000 iterations: 3815903
Average for 100000 iterations: 3854832.709
Median per iteration: 38
Average per iteration: 38.54832709
Histogram:
0 - 10000: 0
10000 - 20000: 0
20000 - 40000: 0
40000 - 80000: 0
80000 - 160000: 0
160000 - 320000: 0
320000 - 640000: 0
640000 - 1280000: 0
1280000 - 2560000: 0
2560000 - 5120000: 998
5120000 - 10240000: 2
10240000 - 20480000: 0
20480000 - 40960000: 0
40960000 - 81920000: 0
81920000 - 163840000: 0
163840000 - 327680000: 0
327680000 - 655360000: 0
655360000 - 1310720000: 0
1310720000 - -1673527296: 0
> -1673527296: 0


C:\bin>java -version
java version "1.6.0_03"
Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
Java HotSpot(TM) Client VM (build 1.6.0_03-b05, mixed mode, sharing)
To compare it with C, I've added the following case to my perf program in http://1-800-magic.blogspot.com/2008/02/stl-vector-performance.html:
case 4:
printf("Function call\n");
HMODULE hm = LoadLibraryW(L"c:\\bin\\jnipc.dll");
void (*func)() = (void (*)())
GetProcAddress(hm, "Java_Test_nop");
for (int i = 0; i < NUM_SAMPLES; ++i) {
unsigned __int64 begin = rdtsc();
for (int j = 0; j < NUM_ITERATIONS; ++j)
func();
unsigned __int64 end = rdtsc();
clocks[cptr++] =
(unsigned int)(end - begin);
}
...and got approximately 6 clocks per call, making JNI about 6 times slower on this simple test:
Function call
Median: 60112
Median, CPI: 6
Average: 60262
Average, CPI: 6
Histogram:
0 - 1000: 0
1000 - 2000: 0
2000 - 4000: 0
4000 - 8000: 0
8000 - 16000: 0
16000 - 32000: 0
32000 - 64000: 998
64000 - 128000: 1
128000 - 256000: 1
256000 - 512000: 0
512000 - 1024000: 0
1024000 - 2048000: 0
2048000 - 4096000: 0
4096000 - 8192000: 0
8192000 - 16384000: 0
16384000 - 32768000: 0
32768000 - 65536000: 0
65536000 - 131072000: 0
131072000 - 262144000: 0
> 262144000: 0