Quantcast
Channel: UnintentionalObjectRetention
Viewing all articles
Browse latest Browse all 4

WideFinder and Java

$
0
0
Update: I realized there is a stupid bug in the concurrent version that I posted. ConcurrentHashMap.putIfAbsent returns null if there was no previous value - not the new value - so I was missing one increment and the Executor seemed to be eating the NullPointerException. I've updated the code below.

Everyone seems to getting in on Tim Bray's WideFinder challenge; there is even a C++ implementation.
I think it was originally supposed to be a competition between all those new and exciting dynamic languages, but they have not be overly impressive thus far. So I've decided to post up my various attempts at a solution in Java. Here is the basic WideFinder in Java (all import statements are omitted for brevity):


public class WideFinder {
public static void main(String[] args) throws IOException {
Map<String, Integer> counts = new HashMap<String, Integer>();
Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");
BufferedReader in = new BufferedReader(new InputStreamReader(
new FileInputStream(args[0]), "US-ASCII"));

String line = null;
while ((line = in.readLine()) != null) {
Matcher m = p.matcher(line);
if (m.find()) {
String key = m.group();
Integer currentCount = counts.get(key);
counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));
}
}
in.close();

List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counts.entrySet());
Collections.sort(results, new Comparator<Entry<String, Integer>>() {

public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)
{
return o2.getValue().compareTo(o1.getValue());
}

});

for(int i = 0; i < 10; i++) {
System.out.println(results.get(i));
}
}
}



The basic version is very similar to the Ruby one, if a little more verbose. Like all of the others, it is primarily IO bound. One important thing I found, was that specifying the encoding as being ASCII made about 30% improvement in the time taken. This is because Java uses unicode strings internally, and if you read text without specifying an encoding it will use the platform default. ASCII is a good bit quicker to convert than ISO-8859-1.

For a file with 1 million entries I got the following times:

real 0m3.208s
user 0m3.460s
sys 0m0.215s

Taking the regexp out:

real 0m2.267s
user 0m2.329s
sys 0m0.210s

So we can potentially save up to 1 second, almost 1/3 of the time. My first solution was to try to keep things as close to the original as possible. For each line, instead of matching it directly, I submit a new task to an ExecutorService. To make the map safe to access from multiple threads I use a ConcurrentHashMap, with AtomicIntegers as values.


public class WideFinderConcurrent {
public static void main(String[] args) throws IOException, InterruptedException {
int threads = Integer.parseInt(args[1]);
ExecutorService executorService = Executors.newFixedThreadPool(threads);

final ConcurrentHashMap<String, AtomicInteger> counts = new ConcurrentHashMap<String, AtomicInteger>();
final Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]), "US-ASCII"));

String line = null;
while ((line = in.readLine()) != null) {
final String lineForTask = line;
executorService.execute(new Runnable() {
public void run() {
Matcher m = p.matcher(lineForTask);
if (m.find()) {
String key = m.group();

AtomicInteger currentCount = counts.get(key);
if (currentCount == null) {
currentCount = new AtomicInteger(0);
AtomicInteger old = counts.putIfAbsent(key, currentCount);
if (old != null) {
currentCount = old;
}
}
currentCount.incrementAndGet();
}
}
});
}

in.close();
executorService.shutdown();
executorService.awaitTermination(100000, TimeUnit.SECONDS);

List<Entry<String, AtomicInteger>> results = new ArrayList<Map.Entry<String, AtomicInteger>>(counts.entrySet());
Collections.sort(results, new Comparator<Entry<String, AtomicInteger>>() {
public int compare(Entry<String, AtomicInteger> o1, Entry<String, AtomicInteger> o2) {
int anotherVal = o1.getValue().get();
int thisVal = o2.getValue().get();
return (thisVal<anotherVal ? -1 : (thisVal==anotherVal ? 0 : 1));
}
});

for(int i = 0; i < 10; i++) {
System.out.println(results.get(i));
}
}
}


The performance for this version is pretty poor:
real 0m5.407s
user 0m11.921s
sys 0m4.023s

Most of the locking is probably uncontended but there is still significant overhead. It also is probably not a great idea to use lots and lots of AtomicIntegers. With more work being done in the tasks it might be more reasonable. The next step was to get my hands dirty with low level threading. Many bloggers will have you believe this is impossible to get right, but somehow I've managed to write a fair bit of multithreaded code with manual synchronization without causing horrible errors. For this version I just used two threads: the main thread doing the reading, and another thread doing the regexp match and map update. I didn't bother trying to parallelize the sort. The main thread will pass lines on a queue to the regexp thread. Once the file is finished a special object is put on the queue to indicate that. We then wait for the regexp thread to finish and get the map back. Obviously locking overhead will be a problem so instead of putting lines on the queue one by one batches of lines are used.


public class WideFinder2ThreadsBulk {
public static void main(String[] args) throws IOException, InterruptedException {
int batchSize = Integer.parseInt(args[1]);
Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");

LineCounter counter = new LineCounter(p);
Thread t = new Thread(counter);
t.start();

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]), "US-ASCII"));

String line = null;
List<String> batch = new ArrayList<String>(batchSize);
while ((line = in.readLine()) != null) {
batch.add(line);
if (batch.size() == batchSize) {
counter.addLines(batch);
batch = new ArrayList<String>(batchSize);
}
}
counter.addLines(batch);
counter.finished();

t.join();
in.close();

List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counter.getCounts().entrySet());
Collections.sort(results, new Comparator<Entry<String, Integer>>() {
public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)
{
return o2.getValue().compareTo(o1.getValue());
}
});

for(int i = 0; i < 10; i++) {
System.out.println(results.get(i));
}
}

static class LineCounter implements Runnable {
private static final List<String> FINISHED = new ArrayList<String>();
private final BlockingQueue<List<String>> queue = new ArrayBlockingQueue<List<String>>(100);
private final Pattern pattern;
private final Map<String, Integer> counts = new HashMap<String, Integer>();

LineCounter(Pattern pattern) {
this.pattern = pattern;
}

public void run() {
List<String> lines = null;
try {
while ((lines = queue.take()) != FINISHED) {
for (String line : lines) {
Matcher m = pattern.matcher(line);
if (m.find()) {
String key = m.group();
Integer currentCount = counts.get(key);
counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));
}
}
}
}
catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}

void addLines(List<String> batch) throws InterruptedException {
queue.put(batch);
}

void finished() throws InterruptedException {
queue.put(FINISHED);
}

Map<String, Integer> getCounts() {
return counts;
}
}
}



Finally we get some speedup (for batches of 500 lines):
real 0m2.597s
user 0m4.137s
sys 0m0.294s

You could generalize this for multiple threads, but it likely would not be worth the effort. Really, this whole exercise probably isn't worth the effort. In fact, I feel like we have all failed one of Steve Yegge's interview questions. A commenter on Tim Bray's blog shows that Unix command line utilities are the way to go, while we have been writing big programs, and trying to optimize file IO. Another - probably easy - way to do this would be to suck the data into your preferred DB, and run a single (simple) SQL query. But that wouldn't be fun, would it?

Viewing all articles
Browse latest Browse all 4

Trending Articles