Overview Usage Performance Notes Links | tst 0.68 - 23 Jan 2009 |
Table of contents
|
Note: in these usage notes we will not explain the behaviour of standard set, map or unordered_* containers. Please look at documentation and examples for your standard library associative containers, and please double-check the structured_* container reference when in doubt.
The class methods are exactly the same as for (multi)set and (multi)map, and show the same behaviour (except usually quite a bit faster). The only difference is the template definition, and this only becomes apparent if you want to use a custom comparator type. Then of course there are the advanced search methods. That's the good bit, we'll get to them shortly.
// // Basic use of structured_set // #include <iostream> #include <string> #include "structured_set.hpp" typedef containers::structured_set<std::string> Set; typedef Set::iterator SetIter; typedef std::pair<SetIter, SetIter> IterPair; // Insert some keys names.insert("apps"); names.insert("applets"); names.insert("banana"); std::cout << "The set contains\n "; for (SetIter it = names.begin(); it != names.end(); ++it) std::cout << *it << ", ";
The set contains
applets, apps, banana,
A simple search:
IterPair p = names.prefix_range("app"); std::cout << "prefix_range(\"app\") returns:\n "; while (p.first != p.second) { std::cout << *p.first++ << ", "; }
prefix_range("app") returns:
applets, apps,
p.second
points to "banana".
In contrast, the standard set/map operation equal_range
would return a pair of iterators pointing to the same item:
std::cout << "equal_range(\"app\") returns "; p = names.equal_range("app"); if (p.first == p.second) std::cout << "empty range\n"; std::cout << "p.second points to " << *p.second;
This prints
equal_range("app") returns empty range
p.second points to applets
In short, this section shows that if you have used std::set
, you already have code to use structured_set
.
Currently structured containers provide two useful single-key search methods returning iterator(s), and four different searches that return a list of matches. We begin with the "single-key" methods.
We have already seen prefix_range, which returns a range of keys that share the search string as prefix. If you use a structured container to store symbols found by a parser, a typical application of prefix_range
might be to find all names of class members, or all names in a name scope. Names might be stored like many C++ parsers (eg like "::scope1::scopeN::classname<neverending_list_of_types>::membername"
), or dot-separated like javascript or many other languages: "class.innerclass.member"
.
typedef containers::structured_set<std::string> SymbolSet; SymbolSet symbols; bool is_defined_in_scope(std::string scope, std::string name) { typedef std::pair<SymbolSet::iterator, SymbolSet::iterator> Range; Range r = symbols.prefix_range(scope); SymbolSet::iterator n = symbols.find(name); return n != symbols.end() && *n >= *r.first && *n < *r.second; }
Here the glory is soiled by not having random access iterators. We cannot compare the iterators (in constant time) to find their order, instead we have to dereference and compare the keys. (and the reason we cannot just compare the scope and the name strings is not apparent from the above code sketch)
char_type
(ie, some text). It then finds the longest key that can be matched to the beginning of the char range. If a key is found, it advances the first char iterator so it is positioned after the found key. This way it's easy to use a structured container as a simple tokenizer, and for large vocabularies, comparatively efficient.
Here is a tokenizer over a large dictionary which is nearly as fast as a single ternary_tree.find() operation.
Note that we do NOT use a std::istream_iterator
for the input from file. The iterators passed to longest_match
must support distance(), that is, they must be Forward Iterators, an Input Iterator is not enough.
void longest_match_example(const char* dictfile, const char* parsefile) { typedef containers::structured_map<std::string, int, nocase_less<char> > Vocabulary; Vocabulary english; // Fill dictionary fill_wordlist(dictfile, english); // (not shown here) if (english.empty()) return; // longest_match does not work with istream_iterator, so must fill buffer std::ifstream infile(parsefile); size_t filesize = get_filesize(infile); // (not shown here) boost::scoped_array<char> bytes(new char[filesize]); infile.read(bytes.get(), filesize); char *first = bytes.get(); char *last = first + infile.gcount(); while (first != last) { Vocabulary::iterator word = english.longest_match(first, last); if (word != english.end()) std::cout << (*word).first << " "; else { // No key; try next char ++first; } } }
Note that this may return spurious results if a word is not in the dictionary. Tokenizing Shakespeare quotes over the English Scrabble dictionary may return
"there" "is" "something" "rotten" "in" "den" "mark"
partial_match_search
).hamming_search
.levenshtein_search
(which could use a more descriptive name).The functions have similar interface: Arguments are the search-string, an output iterator onto which the results are assigned, and a search configuration parameter. For the partial match, the configuration is the wildcard character used in the search string (by default '?'), for Hamming and Levenshtein searches it is the number of allowed mismatches/edits. Here are the function call prototypes:
OutputIter partial_match_search (const key_type& , OutputIter , char_type wildcard = '?');
OutputIter hamming_search (const key_type& , OutputIter , size_type dist);
OutputIter levenshtein_search (const key_type& , OutputIter , size_type dist);
search_results_list
is a wrapper class that is almost a std::vector<StructuredContainer::
iterator>
but to save space we store a single reference to the tree and a list of positions. The search_results_list
returns proper iterators from its methods.
Each search_results_list
(abbr: SRL) is thus bound to a specific container instance. Therefore we chose to construct it by a factory method on structured containers:
typedef containers::structured_map<std::string, MyStuff> StuffMap; StuffMap mystuff; // Create the search results list StuffMap::search_results_list results = mystuff.create_search_results();
To use it with the search methods, wrap the results in a std::back_iterator
, eg
mystuff.levenshtein_search(key, std::back_iterator(results), 3);
The SRL class is defined with a template parameter for the source container iterator, so it is bound to the right iterator type.
The results list class is also limited - it doesn't do all a vector could do. We define
tree_iterator
type, which is the typename of the owner container's iterator typestd::back_insert_iterator
to search methodsiterator
type, which return tree_iterators when dereferencedoperator
[] which also returns a tree_iteratorsize()
, empty()
, clear()
with the expected behaviour.
The search_results_list::iterator
can be treated as a pointer-to-pointer:
typedef StuffMap::search_results_list::iterator ResultIter; // ... for (ResultIter it = results.begin(); it != results.end(); ++it) std::cout << (**it).second << "\n";
// ... for (int i = 0; i < results.size(); ++i) std::cout << (*results[i]).first << " = " << (*results[i]).second << "\n";
and that's all there is to it. We'll put it to use in the searches coming up.
To find all dates, if they somehow end up as a string in a structured container, the search is simple:
search_results_list dates = mystrings.create_search_results();
mystrings.partial_match_search("??/\??", std::back_iterator(results));
The search will be more expensive the more wildcards are put in the beginning of the search key. For each wildcard, every node at that level will be inspected, ie up to the entire alphabet.
For fixed patterns regexes would normally be the most useful tool, but if they are stored as strings.
This search can therefore be considered expensive, since several alternative searches are performed in parallel over each node traversed. At the same time it is very powerful, and dist
values of at most 2-3 is normally sufficient. (Put in another way: dist
of 20-30% of search string length is usually plenty.)
Main advise: limit your searches
One lesson learnt from performance tests is that it is cheaper to perform several levenshtein searches with lower distance than to overshoot the distance by one. So, if you for instance just want to find any match, the way to go is to perform all searches from dist=1
up until you get enough results to be happy.
Note that a levenshtein search with dist
= 5 on a 5-letter key will return all 1-, 2-, 3-, 4- and 5-letter words in the tree, plus all 6-letter keys containing at least one char from the search key, plus all 7-letter words containing at least two letters in the same order as in search key, etc etc.
See the count of found keys for different searches in the second table in Performance Notes.
So, normally the dist
count should be kept low. It may often be optimal to perform several searches, increasing the dist
count only if no matches are found.
// // Case-insensitive structured containers // #include <functional> template<class CharT> struct nocase_less : public std::binary_function<CharT, CharT, bool> { bool operator()(const CharT& a, const CharT& b) const { return tolower(a) < tolower(b); } }; typedef containers::structured_multiset<std::string, nocase_less<char> > CaselessSet;
It will store all capitalisation variants of a key under the same key, but returns the exact inserted strings.
CaselessSet caseless; caseless.insert("NoCase"); caseless.insert("nocase"); caseless.insert("noCase"); caseless.insert("NOCASE"); std::cout << "nocase count: " << caseless.count("nocase"); for(CaselessSet::iterator it = caseless.begin(); it != caseless.end(); ++it) { std::cout << ", " << *it; }
nocase count: 4, NoCase, nocase, noCase, NOCASE
We are not doing anything here that couldn't be done with std::multiset, though the comparator would look different. But we are doing it 3-4 times faster!
The Latin-1 character table (ISO-8859-1) graciously includes even the weird A-ring letter (an A with a small ring on top, perhaps not even representable in your browser). So we're lucky. But, it doesn't store them in the local sort order. Nor does it use the Danish and Norwegian sort order: Æ (AE), Ø, Å.
So we need another custom comparator here. It is again a std::binary_function<CharT, CharT, bool>
type - and you can probably figure out how to do it by using std::locale's
facet collate::compare
for char-by-char comparison. (We'll do better in a minute though.):
#include <locale> //! Expensive single-character comparison by std::locale<> facet collate template<class CharT> struct collate_comp : public std::binary_function<CharT, CharT, bool> { collate_comp(const std::string& name) : m_locale(name.c_str()) {} bool operator()(const CharT& a, const CharT& b) const { const CharT * a0 = &a, *b0 = &b; return std::use_facet<std::collate<CharT> >(m_locale).compare(a0, a0+1, b0, b0+1) < 0; } private: std::locale m_locale; };
Provided we have this, just supply the type at the Compare template parameter to your structured container. This comparator needs to know which locale to use, so this must be passed to the container constructor:
typedef containers::structured_set<std::string, collate_comp<char> > LocalSet; // use comparator constructor, create Swedish locale LocalSet se_names(collate_comp<char>("sv_SE")); // Use a default comparator for contrast typedef containers::structured_set<std::string> DefaultSet; DefaultSet anynames;
Try it with some Swedish words:
se_names.insert("Äska"); se_names.insert("Åmål"); se_names.insert("Ödla"); se_names.insert("Adam"); anynames.insert("Äska"); anynames.insert("Åmål"); anynames.insert("Ödla"); anynames.insert("Adam"); for(LocalSet::iterator sit = se_names.begin(); sit != se_names.end(); ++sit) { std::cout << *sit << ", "; } std::cout << "not:\n" for(DefaultSet::iterator dit = anynames.begin(); dit != anynames.end(); ++dit) { std::cout << *dit << ", "; }
Adam, Åmål, Äska, Ödla, not: Äska, Ödla, Åmål, Adam,
Well, actually it doesn't for me, it prints something horrible to my console, since I didn't set the locale for the DOS console window. But the horrors are correctly sorted anyway.
Unfortunately the std::collate
comparison comes at a cost: In the STL implementation I use (STLport) there is a virtual function call that ends up in a system or library function for every character comparison (on Windows we go to CompareString API). This is a major hit on performance; now find
is about 50-100 times slower than using the default less<char>
comparator! A std::map with a similar locale-based comparator is also slowed down, but "only" about 8-10 times. So now std::map is 2-4 times faster than structured containers.
To get around this, we can create an alphabet-size sort-order table using locale, then do character comparison through table lookup. The full wchar_t
table requires 2^16 X 2 = 128 kB storage. Your system may already provide similar technology under the hood. The file examples/locale_less.hpp has a class utility::locale_less<CharT>
which does this.
A quick test of the perfomance of this table-lookup solution indicates it is about 2 times slower than ternary_tree with default comparator. This is a modest price to pay for internationalization.
The better type declaration looks almost the same as above:
#include "examples/locale_less.hpp" typedef containers::structured_set< std::string, utility::locale_less<char> > LocalSet; // use comparator constructor, create Swedish locale LocalSet se_names(utility::locale_less<char>("sv_SE")); //... etc
By the way, if you want the locale-table optimization for a std::map<string, Key>
, why not use the structured_set::key_compare type from LocalSet above! It's a simple lexicographical_compare
using the utility::locale_less
class. It's faster than raw std::collate::compare
calls anyway. Just a thought.
First off, although ternary_tree
is Associative and stores both a Key and Value, it is NOT Pair Associative (like std::set
and map
etc).
Therefore, for a ternary_tree<std::string, int>
value_type
is the Value
template parameter, ie int reference
is int&
, const_reference
is int const&
std::allocator<Value>
, ie std::allocator<int>
less<char>
(like all structured containers)
Consequently, iterator value_type
is int
, and iterator dereference returns (const) int&
More about this next.
pair<Key, Value>
), and the internals of ternary_tree, the iterators are non-standard. Various designs for proxy class were attempted, but this only makes the code deceptive, it doesn't supply expected behaviour.
Therefore the ternary tree iterator dereference does NOT return a pair, it returns the value_type
as defined in the template.
It is still necessary to get the key for an iterator. For this purpose a member method key()
is defined on the iterator. To complement this, the method value() returns a reference to the value. Here is brief reference of the type defined as iterator for ternary tree:
key_type const& key() const; const_reference value() const; node_index tree_position() const; // used by wrapper classes reference operator*(); // available with nonconst_traits only const_reference operator*() const; // ... bidirectional iterator like operator++, operator-- etc
Please be warned that the key reference returned is NOT PERSISTENT: the key is cached in iterator, so when the iterator is incremented or decremented, the key is changed internally. So don't try to keep the reference, copy it at each access.
typedef containers::ternary_tree<std::string, int> MyTst; MyTst tst; // ...fill tree MyTst::iterator it = tst.find("global"); MyTst::key_type& temp = it.key(); // WRONG! std::string temp2 = it.key(); // Better it++; // temp is changed now
The typedef to get this interface uses two wrapper layers as detailed below.
Define: When you write a container using ternary_tree for implementation.
Don't define: when you use ternary_tree directly in application code.
ternary_tree uses an internal class tst_iterator_base
with methods dereference()
, increment()
, decrement()
etc for basic iterator support. There are also a few methods useful for wrapper classes like structured_set &co.
To provide standard iterator interface we use the iterator_wrapper adapter. As noted above, the iterator base class provides some methods useful in client code (esp key()
). The typedef so far is just:
typedef iterator_wrapper< tst_iterator_base, const_traits<value_type> > const_iterator; typedef iterator_wrapper< tst_iterator_base, nonconst_traits<value_type> > iterator;
In client code, the base iterator methods has to be accessed like myTstIter.base_iter().key()
etc, which quickly gets tedious.
Therefore we also provide another wrapper layer: tst_detail
::iter_method_forward. This forwards the base iterator methods, so client code can use directly without inserting base_iter()
everywhere. Yes, it's a mess and backwards, but we'll change to use Boost.Iterator in the future.
But structured container wrappers don't need this forwarding, thus the macro to avoid the iter_method_forward
complications.
Define: If you want to experiment with fastmap.
Usage: Define before including the library header file (ternary_tree.hpp, structured_set.hpp etc)
Note that the macro is undefined at end of ternary_tree.hpp header.
We have experimented with a fastmap, ie a alphabet-size array of pointers to the node for the first letter of a key. When this macro is defined, the fastmap is installed and maintained by insertion (but not by erase!)
For ex, to look up "shortcut", the find_impl method doesn't begin at root, instead it starts at fastmap['s']
. This turns out to give insignificant improvement for ASCII dictionaries: While there is a weak improvement of lookup times, it's rarely over 3-5%, usually less than time jitter of the test runs. For unbalanced (sorted-insertion) trees, insert() is over 10% faster (improving as the tree grows).
With wchar_t
and a large alphabet, it might be useful.
It might be useful to provide fastmap of the 2 first characters. This is not tested, as it would be complex to code. Also with the future subkey_iterator type, it will be easy to build a fastmap in client code, eg a vector of subkey_iterators. So this macro will probably be removed (still it adds only half a dozen lines total).
ternary_tree 0.68 -- by rasmus ekman -- Page generated with Doxygen 1.5.6 on 23 Jan 2009 |