Wednesday, August 25, 2010

BinaryTrees For Rescue

When ArrayLists become big reverse lookups like indexOf, lastIndexOf, contains and the modifications methods can slow down your program. When order is not important, you might use a HashMap. In some other cases you have a constant key-value, that can be use in a TreeMap. If you need to have to update the index however there is no real alternative. What I'm missing in the java collections is a balanced tree, allowing me to traverse it's structure. The NavigableSet-Interface available since Java 6 only allows to traverse the elements. There are no methods to access the tree-structure. I wonder why the tree used in TreeMap wasn't made available in a separate Red-Black-Tree class.

As mentioned before trees can be very useful when search through big datasets. Imagine you have a texteditor with more than 10000 lines of text and you need a mapping from lineNumber to line and line to lineNumber. One possibility is to put all lines as keys in a map and use the index as a value, but than you have to update the map every time, you insert a new line. Using a tree, you can insert the lines at a certain position and put the treenode instead of the index into the map. Now you can lookup up the treenode for a line. Calculating the index of a treenode is not difficult:



To speed up the calculation you can cache the node-count of each subtree in the corresponding rootnode and only recalculate it, when a subtree node changes. This way a lookup of a treenode by it's current index or calculating the a treenode index should be very fast enough.

The calculation of the index can be easy adapted to similar cases. For example if you have a texteditor supporting different line heights, calculating the y-Offset of a single line is expensive. Calculating the y-offset of a line requires to sum up all height of the previous lines. Having all lines in an Arraylist means to iterate through the hole list beginning from the first line. You may think about caching the line-offset in a treemap, but than you have to clear all entries after the modified line. In worst case if the first line changes, the while cache is invalid. With a tree, you can store the sum of heights in the treenodes. If the height of one line changes, you only have to recalculate the value of parent nodes. Caluating the y-Offset is now much cheaper, just sum up the height of the lines in all left-children and the height of the line stored in the node, when following a right-child. The only difference to the index calculation is that, that a treenode adds the height instead of the constant node count 1 when including it's own value.

Comparing the speed of an arraylist with the speed of the tree, isn't an easy task. There are many things to consider:

a) if the dataset is rarely modified and the dataset isn't to big, a map can be used to improve the reverse lookup
b) the speed for an indexOf/lastIndexOf/contains lookup an the arraylist strongly depends on the speed of the equals and hashcode method. For example for integers this is very fast.
c) the number of different elements, if you have only 10 different elements in the list and they are random, the results of indexOf/lastIndexOf/contains will need much less comparisons
d) a tree needs more memory
e) for looking up an element by an index there is nothing faster than an array and same applies for creating complete copies it.

I think the main reason for not including a balanced binary tree in the java collection as a separate class, is that the arraylist in most cases is fast enough. Wether to use a tree, hashmap, arraylist or linkedlist dependends on what you want to do. There is no golden hammer, that solves all your problems.

No comments:

Post a Comment