pygtrie

Implementation of a trie data structure.

Trie data structure, also known as radix or prefix tree, is a tree associating keys to values where all the descendants of a node have a common prefix (associated with that node).

The trie module contains pygtrie.Trie, pygtrie.CharTrie and pygtrie.StringTrie classes each implementing a mutable mapping interface, i.e. dict interface. As such, in most circumstances, pygtrie.Trie could be used as a drop-in replacement for a dict, but the prefix nature of the data structure is trie’s real strength.

The module also contains pygtrie.PrefixSet class which uses a trie to store a set of prefixes such that a key is contained in the set if it or its prefix is stored in the set.

Features

  • A full mutable mapping implementation.
  • Supports iterating over as well as deleting a subtrie.
  • Supports prefix checking as well as shortest and longest prefix look-up.
  • Extensible for any kind of user-defined keys.
  • A PrefixSet supports “all keys starting with given prefix” logic.
  • Can store any value including None.

For some simple examples see example.py file.

Installation

To install pygtrie, run:

pip install pygtrie

Or download the sources and save pygtrie.py file with your project.

Upgrading from 0.9.x

The 1.0 release introduced backwards incompatibility in naming. The module has been renamed from trie to pygtrie. Fortunately, updating scripts using pygtrie should boil down to replacing:

from pytrie import trie

with:

import pygtrie as trie

Trie classes

class Trie(*args, **kwargs)[source]

A trie implementation with dict interface plus some extensions.

Keys used with the pygtrie.Trie must be iterable, yielding hashable objects. In other words, for a given key, dict.fromkeys(key) must be valid.

In particular, strings work fine as trie keys, however when getting keys back from iterkeys() method for example, instead of strings, tuples of characters are produced. For that reason, pygtrie.CharTrie or pygtrie.StringTrie may be preferred when using pygtrie.Trie with string keys.

__delitem__(key_or_slice)[source]

Deletes value associated with given key or raises KeyError.

If argument is a key, value associated with it is deleted. If the key is also a prefix, its descendents are not affected. On the other hand, if the argument is a slice (in which case it must have only start set), the whole subtrie is removed. For example:

>>> import pygtrie
>>> t = pygtrie.StringTrie()
>>> t['foo'] = 'Foo'
>>> t['foo/bar'] = 'Bar'
>>> t['foo/bar/baz'] = 'Baz'
>>> del t['foo/bar']
>>> t.keys()
['foo', 'foo/bar/baz']
>>> del t['foo':]
>>> t.keys()
[]
Parameters:

key_or_slice – A key to look for or a slice. If key is a slice, the whole subtrie will be removed.

Raises:
  • ShortKeyError – If the key has no value associated with it but is a prefix of some key with a value. This is not thrown is key_or_slice is a slice – in such cases, the whole subtrie is removed. Note that ShortKeyError is subclass of KeyError.
  • KeyError – If key has no value associated with it nor is a prefix of an existing key.
  • TypeError – If key is a slice whose stop or step are not None.
__getitem__(key_or_slice)[source]

Returns value associated with given key or raises KeyError.

When argument is a single key, value for that key is returned (or KeyError exception is thrown if the node does not exist or has no value associated with it).

When argument is a slice, it must be one with only start set in which case the access is identical to Trie.itervalues invocation with prefix argument.

Example

>>> import pygtrie
>>> t = pygtrie.StringTrie()
>>> t['foo/bar'] = 'Bar'
>>> t['foo/baz'] = 'Baz'
>>> t['qux'] = 'Qux'
>>> t['foo/bar']
'Bar'
>>> list(t['foo':])
['Baz', 'Bar']
>>> t['foo']
Traceback (most recent call last):
    ...
pygtrie.ShortKeyError: 'foo'
Parameters:

key_or_slice – A key or a slice to look for.

Returns:

If a single key is passed, a value associated with given key. If a slice is passed, a generator of values in specified subtrie.

Raises:
  • ShortKeyError – If the key has no value associated with it but is a prefix of some key with a value. Note that ShortKeyError is subclass of KeyError.
  • KeyError – If key has no value associated with it nor is a prefix of an existing key.
  • TypeError – If key_or_slice is a slice but it’s stop or step are not None.
__init__(*args, **kwargs)[source]

Initialises the trie.

Arguments are interpreted the same way Trie.update interprets them.

__len__()[source]

Returns number of values in a trie.

Note that this method is expensive as it iterates over the whole trie.

__setitem__(key_or_slice, value)[source]

Sets value associated with given key.

If key_or_slice is a key, simply associate it with given value. If it is a slice (which must have start set only), it in addition clears any subtrie that might have been attached to particular key. For example:

>>> import pygtrie
>>> t = pygtrie.StringTrie()
>>> t['foo/bar'] = 'Bar'
>>> t['foo/baz'] = 'Baz'
>>> t.keys()
['foo/baz', 'foo/bar']
>>> t['foo':] = 'Foo'
>>> t.keys()
['foo']
Parameters:
  • key_or_slice – A key to look for or a slice. If it is a slice, the whole subtrie (if present) will be replaced by a single node with given value set.
  • value – Value to set.
Raises:

TypeError – If key is a slice whose stop or step are not None.

clear()[source]

Removes all the values from the trie.

copy()[source]

Returns a shallow copy of the trie.

enable_sorting(enable=True)[source]

Enables sorting of child nodes when iterating and traversing.

Normally, child nodes are not sorted when iterating or traversing over the trie (just like dict elements are not sorted). This method allows sorting to be enabled (which was the behaviour prior to pygtrie 2.0 release).

For Trie class, enabling sorting of children is identical to simply sorting the list of items since Trie returns keys as tuples. However, for other implementations such as StringTrie the two may behove subtly different. For example, sorting items might produce:

root/foo-bar
root/foo/baz

even though foo comes before foo-bar.

Parameters:enable – Whether to enable sorting of child nodes.
classmethod fromkeys(keys, value=None)[source]

Creates a new trie with given keys set.

This is roughly equivalent to calling the constructor with a (key, value) for key in keys generator.

Parameters:
  • keys – An iterable of keys that should be set in the new trie.
  • value – Value to associate with given keys.
Returns:

A new trie where each key from keys has been set to the given value.

has_key(key)[source]

Indicates whether given key has value associated with it.

See Trie.has_node for more detailed documentation.

has_node(key)[source]

Returns whether given node is in the trie.

Return value is a bitwise or of HAS_VALUE and HAS_SUBTRIE constants indicating node has a value associated with it and that it is a prefix of another existing key respectively. Both of those are independent of each other and all of the four combinations are possible. For example:

>>> import pygtrie
>>> t = pygtrie.StringTrie()
>>> t['foo/bar'] = 'Bar'
>>> t['foo/bar/baz'] = 'Baz'
>>> t.has_node('qux') == 0
True
>>> t.has_node('foo/bar/baz') == pygtrie.Trie.HAS_VALUE
True
>>> t.has_node('foo') == pygtrie.Trie.HAS_SUBTRIE
True
>>> t.has_node('foo/bar') == (pygtrie.Trie.HAS_VALUE |
...                           pygtrie.Trie.HAS_SUBTRIE)
True

There are two higher level methods built on top of this one which give easier interface for the information. Trie.has_key and returns whether node has a value associated with it and Trie.has_subtrie checks whether node is a prefix. Continuing previous example:

>>> t.has_key('qux'), t.has_subtrie('qux')
False, False
>>> t.has_key('foo/bar/baz'), t.has_subtrie('foo/bar/baz')
True, False
>>> t.has_key('foo'), t.has_subtrie('foo')
False, True
>>> t.has_key('foo/bar'), t.has_subtrie('foo/bar')
True, True
Parameters:key – A key to look for.
Returns:Non-zero if node exists and if it does a bit-field denoting whether it has a value associated with it and whether it has a subtrie.
has_subtrie(key)[source]

Returns whether given key is a prefix of another key in the trie.

See Trie.has_node for more detailed documentation.

items(prefix=<object object>, shallow=False)[source]

Returns a list of (key, value) pairs in given subtrie.

This is equivalent to constructing a list from generator returned by Trie.iteritems which see for more detailed documentation.

iteritems(prefix=<object object>, shallow=False)[source]

Yields all nodes with associated values with given prefix.

Only nodes with values are output. For example:

>>> import pygtrie
>>> t = pygtrie.StringTrie()
>>> t['foo'] = 'Foo'
>>> t['foo/bar/baz'] = 'Baz'
>>> t['qux'] = 'Qux'
>>> t.items()
[('qux', 'Qux'), ('foo', 'Foo'), ('foo/bar/baz', 'Baz')]

Items are generated in topological order but the order of siblings is unspecified by default. In other words, in the above example, the ('qux', 'Qux') pair might have been at the end of the list. At an expense of efficiency, this can be changed via Trie.enable_sorting.

With prefix argument, only items with specified prefix are generated (i.e. only given subtrie is traversed) as demonstrated by:

>>> t.items(prefix='foo/bar')
[('foo/bar/baz', 'Baz')]

With shallow argument, if a node has value associated with it, it’s children are not traversed even if they exist which can be seen in:

>>> t.items(shallow=True)
[('qux', 'Qux'), ('foo', 'Foo')]
Parameters:
  • prefix – Prefix to limit iteration to.
  • shallow – Perform a shallow traversal, i.e. do not yield items if their prefix has been yielded.
Yields:

(key, value) tuples.

Raises:

KeyError – If prefix does not match any node.

iterkeys(prefix=<object object>, shallow=False)[source]

Yields all keys having associated values with given prefix.

This is equivalent to taking first element of tuples generated by Trie.iteritems which see for more detailed documentation.

Parameters:
  • prefix – Prefix to limit iteration to.
  • shallow – Perform a shallow traversal, i.e. do not yield keys if their prefix has been yielded.
Yields:

All the keys (with given prefix) with associated values in the trie.

Raises:

KeyError – If prefix does not match any node.

itervalues(prefix=<object object>, shallow=False)[source]

Yields all values associated with keys with given prefix.

This is equivalent to taking second element of tuples generated by Trie.iteritems which see for more detailed documentation.

Parameters:
  • prefix – Prefix to limit iteration to.
  • shallow – Perform a shallow traversal, i.e. do not yield values if their prefix has been yielded.
Yields:

All the values associated with keys (with given prefix) in the trie.

Raises:

KeyError – If prefix does not match any node.

keys(prefix=<object object>, shallow=False)[source]

Returns a list of all the keys, with given prefix, in the trie.

This is equivalent to constructing a list from generator returned by Trie.iterkeys which see for more detailed documentation.

longest_prefix(key)[source]

Finds the longest prefix of a key with a value.

This is equivalent to taking the last object yielded by Trie.prefixes with a default of (None, None) if said method yields no items. As an added bonus, the pair in that case will be a falsy value (as opposed to regular two-element tuple of None values).

Example

>>> import pygtrie
>>> t = pygtrie.StringTrie()
>>> t['foo'] = 'Foo'
>>> t['foo/bar/baz'] = 'Baz'
>>> t.longest_prefix('foo/bar/baz/qux')
('foo/bar/baz', 'Baz')
>>> t.longest_prefix('does/not/exist')
(None, None)
>>> bool(t.longest_prefix('does/not/exist'))
False
Parameters:key – Key to look for.
Returns:(k, value) where k is the longest prefix of key (it may equal key) and value is a value associated with that key. If no node is found, (None, None) is returned.
pop(key, default=<object object>)[source]

Deletes value associated with given key and returns it.

Parameters:
  • key – A key to look for.
  • default – If specified, value that will be returned if given key has no value associated with it. If not specified, method will throw KeyError in such cases.
Returns:

Removed value, if key had value associated with it, or default (if given).

Raises:
  • ShortKeyError – If default has not been specified and the key has no value associated with it but is a prefix of some key with a value. Note that ShortKeyError is subclass of KeyError.
  • KeyError – If default has not been specified and key has no value associated with it nor is a prefix of an existing key.
popitem()[source]

Deletes an arbitrary value from the trie and returns it.

There is no guarantee as to which item is deleted and returned. Neither in respect to its lexicographical nor topological order.

Returns:(key, value) tuple indicating deleted key.
Raises:KeyError – If the trie is empty.
prefixes(key)[source]

Walks towards the node specified by key and yields all found items.

Example

>>> import pygtrie
>>> t = pygtrie.StringTrie()
>>> t['foo'] = 'Foo'
>>> t['foo/bar/baz'] = 'Baz'
>>> list(t.prefixes('foo/bar/baz/qux'))
[('foo', 'Foo'), ('foo/bar/baz', 'Baz')]
>>> list(t.prefixes('does/not/exist'))
[]
Parameters:key – Key to look for.
Yields:(k, value) pairs denoting keys with associated values encountered on the way towards the specified key.
setdefault(key, value=None)[source]

Sets value of a given node if not set already. Also returns it.

In contrast to Trie.__setitem__, this method does not accept slice as a key.

shortest_prefix(key)[source]

Finds the shortest prefix of a key with a value.

This is equivalent to taking the first object yielded by Trie.prefixes with a default of (None, None) if said method yields no items. As an added bonus, the pair in that case will be a falsy value (as opposed to regular two-element tuple of None values).

Example

>>> import pygtrie
>>> t = pygtrie.StringTrie()
>>> t['foo'] = 'Foo'
>>> t['foo/bar/baz'] = 'Baz'
>>> t.longest_prefix('foo/bar/baz/qux')
('foo', 'Foo')
>>> t.longest_prefix('does/not/exist')
(None, None)
>>> bool(t.longest_prefix('does/not/exist'))
False
Parameters:key – Key to look for.
Returns:(k, value) where k is the shortest prefix of key (it may equal key) and value is a value associated with that key. If no node is found, (None, None) is returned.
traverse(node_factory, prefix=<object object>)[source]

Traverses the tree using node_factory object.

node_factory is a callable function which accepts (path_conv, path, children, value=...) arguments, where path_conv is a lambda converting path representation to key, path is the path to this node, children is an iterable of children nodes constructed by node_factory, optional value is the value associated with the path.

node_factory’s children argument is a generator which has a few consequences:

  • To traverse into node’s children, the generator must be iterated over. This can by accomplished by a simple “children = list(children)” statement.
  • Ignoring the argument allows node_factory to stop the traversal from going into the children of the node. In other words, whole subtrie can be removed from traversal if node_factory chooses so.
  • If children is stored as is (i.e. as a generator) when it is iterated over later on it will see state of the trie as it is during the iteration and not when traverse method was called.

Trie.traverse has two advantages over Trie.iteritems and similar methods:

  1. it allows subtries to be skipped completely when going through the list of nodes based on the property of the parent node; and
  2. it represents structure of the trie directly making it easy to convert structure into a different representation.

For example, the below snippet prints all files in current directory counting how many HTML files were found but ignores hidden files and directories (i.e. those whose names start with a dot):

import os
import pygtrie

t = pygtrie.StringTrie(separator=os.sep)

# Construct a trie with all files in current directory and all
# of its sub-directories.  Files get set a True value.
# Directories are represented implicitly by being prefixes of
# files.
for root, _, files in os.walk('.'):
    for name in files: t[os.path.join(root, name)] = True

def traverse_callback(path_conv, path, children, is_file=False):
    if path and path[-1] != '.' and path[-1][0] == '.':
        # Ignore hidden directory (but accept root node and '.')
        return 0
    elif is_file:
        print path_conv(path)
        return int(path[-1].endswith('.html'))
    else:
        # Otherwise, it's a directory.  Traverse into children.
        return sum(int(is_html) for is_html in children)

print t.traverse(traverse_callback)

As documented, ignoring the children argument causes subtrie to be omitted and not walked into.

In the next example, the trie is converted to a tree representation where child nodes include a pointer to their parent. As before, hidden files and directories are ignored:

import os
import pygtrie

t = pygtrie.StringTrie(separator=os.sep)
for root, _, files in os.walk('.'):
    for name in files: t[os.path.join(root, name)] = True

class File(object):
    def __init__(self, name):
        self.name = name
        self.parent = None

class Directory(File):
    def __init__(self, name, children):
        super(Directory, self).__init__(name)
        self._children = children
        for child in children:
            child.parent = self

def traverse_callback(path_conv, path, children, is_file=False):
    if not path or path[-1] == '.' or path[-1][0] != '.':
        if is_file:
            return File(path[-1])
        children = filter(None, children)
        return Directory(path[-1] if path else '', children)

root = t.traverse(traverse_callback)

Note: Unlike iterators, traverse method uses stack recursion which means that using it on deep tries may lead to a RuntimeError exception thrown once Python’s maximum recursion depth is reached.

Parameters:
  • node_factory – Makes opaque objects from the keys and values of the trie.
  • prefix – Prefix for node to start traversal, by default starts at root.
Returns:

Node object constructed by node_factory corresponding to the root node.

update(*args, **kwargs)[source]

Updates stored values. Works like dict.update.

values(prefix=<object object>, shallow=False)[source]

Returns a list of values in given subtrie.

This is equivalent to constructing a list from generator returned by Trie.iterivalues which see for more detailed documentation.

class CharTrie(*args, **kwargs)[source]

A variant of a pygtrie.Trie which accepts strings as keys.

The only difference between pygtrie.CharTrie and pygtrie.Trie is that when pygtrie.CharTrie returns keys back to the client (for instance in keys() method is called), those keys are returned as strings.

Canonical example where this class can be used is a dictionary of words in a natural language. For example:

>>> import pygtrie
>>> t = pygtrie.CharTrie()
>>> t['wombat'] = True
>>> t['woman'] = True
>>> t['man'] = True
>>> t['manhole'] = True
>>> t.has_subtrie('wo')
True
>>> t.has_key('man')
True
>>> t.has_subtrie('man')
True
>>> t.has_subtrie('manhole')
False
class StringTrie(*args, **kwargs)[source]

pygtrie.Trie variant accepting strings with a separator as keys.

The trie accepts strings as keys which are split into components using a separator specified during initialisation (“/” by default).

Canonical example where this class can be used is when keys are paths. For example, it could map from a path to a request handler:

import pygtrie

def handle_root(): pass
def handle_admin(): pass
def handle_admin_images(): pass

handlers = pygtrie.StringTrie()
handlers[''] = handle_root
handlers['/admin'] = handle_admin
handlers['/admin/images'] = handle_admin_images

request_path = '/admin/images/foo'

handler = handlers.longest_prefix(request_path)
__init__(*args, **kwargs)[source]

Initialises the trie.

Except for a separator named argument, all other arguments are interpreted the same way Trie.update interprets them.

Parameters:
  • *args – Passed to super class initialiser.
  • **kwargs – Passed to super class initialiser.
  • separator – A separator to use when splitting keys into paths used by the trie. “/” is used if this argument is not specified. This named argument is not specified on the function’s prototype because of Python’s limitations.

PrefixSet class

class PrefixSet(iterable=None, factory=<class 'pygtrie.Trie'>, **kwargs)[source]

A set of prefixes.

pygtrie.PrefixSet works similar to a normal set except it is said to contain a key if the key or it’s prefix is stored in the set. For instance, if “foo” is added to the set, the set contains “foo” as well as “foobar”.

The set supports addition of elements but does not support removal of elements. This is because there’s no obvious consistent and intuitive behaviour for element deletion.

__contains__(key)[source]

Checks whether set contains key or its prefix.

__init__(iterable=None, factory=<class 'pygtrie.Trie'>, **kwargs)[source]

Initialises the prefix set.

Parameters:
  • iterable – A sequence of keys to add to the set.
  • factory – A function used to create a trie used by the pygtrie.PrefixSet.
  • kwargs – Additional keyword arguments passed to the factory function.
__iter__()[source]

Return iterator over all prefixes in the set.

See PrefixSet.iter method for more info.

__len__()[source]

Returns number of keys stored in the set.

Since a key does not have to be explicitly added to the set to be an element of the set, this method does not count over all possible keys that the set contains (since that would be infinity), but only over the shortest set of prefixes of all the keys the set contains.

For example, if “foo” has been added to the set, the set contains also “foobar”, but this method will not count “foobar”.

add(key)[source]

Adds given key to the set.

If the set already contains prefix of the key being added, this operation has no effect. If the key being added is a prefix of some existing keys in the set, those keys are deleted and replaced by a single entry for the key being added.

For example, if the set contains key “foo” adding a key “foobar” does not change anything. On the other hand, if the set contains keys “foobar” and “foobaz”, adding a key “foo” will replace those two keys with a single key “foo”.

This makes a difference when iterating over the keys or counting number of keys. Counter intuitively, adding of a key can decrease size of the set.

Parameters:key – Key to add.
clear()[source]

Removes all keys from the set.

copy()[source]

Returns a copy of the prefix set.

iter(prefix=<object object>)[source]

Iterates over all keys in the set optionally starting with a prefix.

Since a key does not have to be explicitly added to the set to be an element of the set, this method does not iterate over all possible keys that the set contains, but only over the shortest set of prefixes of all the keys the set contains.

For example, if “foo” has been added to the set, the set contains also “foobar”, but this method will not iterate over “foobar”.

If prefix argument is given, method will iterate over keys with given prefix only. The keys yielded from the function if prefix is given does not have to be a subset (in mathematical sense) of the keys yielded when there is not prefix. This happens, if the set contains a prefix of the given prefix.

For example, if only “foo” has been added to the set, iter method called with no arguments will yield “foo” only. However, when called with “foobar” argument, it will yield “foobar” only.

Version History

2.1: 2017/03/23

  • The library is now Python 3 compatible.
  • Value returend by shortest_prefix and longest_prefix evaluates to false if no prefix was found. This is in addition to it being a pair of Nones of course.

2.0: 2016/07/06

  • Sorting of child nodes is disabled by default for better performance. enable_sorting method can be used to bring back old behaviour.
  • Tries of arbitrary depth can be pickled without reaching Python’s recursion limits. (N.B. The pickle format is incompatible with one from 1.2 release). _Node’s __getstate__ and __setstate__ method can be used to implement other serialisation methods such as JSON.

1.2: 2016/06/21 [pulled back from PyPi]

  • Tries can now be pickled.
  • Iterating no longer uses recursion so tries of arbitrary depth can be iterated over. The traverse method, however, still uses recursion thus cannot be used on big structures.

1.1: 2016/01/18

  • Fixed PyPi installation issues; all should work now.

1.0: 2015/12/16

  • The module has been renamed from trie to pygtrie. This could break current users but see documentation for how to quickly upgrade your scripts.
  • Added traverse method which goes through the nodes of the trie preserving structure of the tree. This is a depth-first traversal which can be used to search for elements or translate a trie into a different tree structure.
  • Minor documentation fixes.

0.9.3: 2015/05/28

  • Minor documentation fixes.

0.9.2: 2015/05/28

  • Added Sphinx configuration and updated docstrings to work better with Sphinx.

0.9.1: 2014/02/03

  • New name.

0.9: 2014/02/03

  • Initial release.