Overview Usage Performance Notes Links | tst 0.68 - 23 Jan 2009 |
In most implementations, a ternary tree node has the following members:
struct node { char splitchar; // key letter, or 0 at end of key node *hikid; // subtree of keys with higher value than splitchar node *eqkid; // subtree matching splitchar (pointer to mapped value at end-of-key node) node *lokid; // subtree less than splitchar node *parent; // necessary for iteration (not needed for insert/find) };
This means that each node is 1 char plus three or four pointers size. On many systems, struct member alignment makes the char member consume size of one pointer as well, so we have 4 (or 5) x sizeof(pointer) per node in the tree. With several kinds of dictionaries, the node count ends up at around 0.3-0.5 times total key length (since keys share nodes). This is even more expensive on 64-bit machines.
There are several variation points in the node class:
wchar_t
strings on 32- or 64-bit systems.The parent pointer also makes it possible to recover the inserted string by walking nodes backward from a terminal node to the root. Complexity is key length, plus log(tree.size), but it means inserted keys do not have to be copied to the end node. We opt to cache keys in iterators, at no measurable extra cost in iteration.
Instead of the key, an arbitrary value can be associated with endnodes. However, it should not be allowed to increase node size, since most nodes in the tree are not endnodes. In this library we store the mapped value directly in end-node if it is <=
Larger objects are allocated on the heap, and a pointer to the copy is stored in end-node (the copy is managed by the tree).sizeof(void*)
.
We use a vector<node>
as pool allocator, and record eq-hi-lo links as vector index instead of pointers. The pool allocation essentially follows original C code insert2() principle. For us, it also simplifies reallocation, since pointers do not have to be rebound; the indices are always valid. This has the following consequences:
"eqkid"
, becomes redundant, as it is always the next index (except after terminal nodes of course).lokid
node index is also unused by end-nodes (as no char should be lower than zero), so all endnode children are linked to the hi node.(In our binary-cognizant version where zero is a regular char value, this still holds, we just change the end-node test accordingly)
In the final cut, our node struct data members appear roughly like this:
struct node { CharType splitchar; // key letter, or 0 at end of key (to make sure lokid is never allowed) CharType endflag; // zero on normal nodes, 1 at end nodes, 2 at erased nodes. node_index hikid; // subtree of keys with higher value than splitchar node_index lokid; // subtree less than splitchar node_index parent; // necessary for iteration (not needed for insert/find) };
where CharType
is defined by template Key::value_type
, and treated as an unsigned type (so 0 is the lowest value); and node_index
is a size_t
-like type used by the node storage backend (currently std::vector
).
This optimization could also be applied to C version, trimming space requirement in DDJ code to 3-word nodes.
ternary_tree 0.68 -- by rasmus ekman -- Page generated with Doxygen 1.5.6 on 23 Jan 2009 |