porgramming - Codefisher

In this post, I want to have a look at the different possible solutions to two rather simple and closely related problems. How to group objects into a dictionary, or count them. It is something that at least I have found I do every now and then in various forms. I want to start from the simplest good solution and work towards better and faster solutions. The code can be downloaded off gist.github.com.

Grouping

So the problem that we want to solve is that we have some items that we want to group according to some criteria. So in python that is going to be turning a list or some other iterable into a dictionary of lists. We are also going to want some function to create the value we group by, and for the sake of simplicity, we will use len() it since that is the simplest built-in function that gives a key we might group on. In your own code, you would replace that with some logic that gives you the key or tag that you want to group things by.

Or to put that all in pseudo-code – also working python 😉 that would be the following:

names = ['mark', 'henry', 'matthew', 'paul',
         'luke', 'robert', 'joseph', 'carl', 'michael']

d = {}
for name in names:
    key = len(name)
    if key not in d:
        d[key] = []
    d[key].append(name)

# result: d = {4: ['mark', 'paul', 'luke', 'carl'], 
#              5: ['henry'], 6: ['robert', 'joseph'], 7: ['matthew', 'michael']}

So that is a nice start, loop over the data, and create a key value for each. Then we do a check to see if it is already there or not. This here is the best way to check if a key is in a dictionary. The alternative is either doing try except around a key lookup, which is really ugly and slow. Or checking the return value of get() but that prevents putting None values in the dictionary.

There is a downside to this, and that is where the key has to be hashed two or three times (python dictionaries are internally a kind of hash map, that gives them their almost linear lookup time). First in the if statement, and a possible second in the assignment to an empty list() and finally in the lookup for the append. That python has to hash the value has an overhead. So how might we do better? The following is one possible better solution.

d = {}
for name in names:
	key = len(name)
	d.setdefault(key, []).append(name)

This uses the setdefault() method. This is a function, that even the developers of Python admit freely is confusingly named. The problem is any descriptive alternatives look like do_a_get_lookup_but_if_not_found_assign_the_second_argument(). So more or less, the same code as we wrote ourselves before, but since it is done by the dictionary itself, the key is only hashed once. It will be faster when we have lots of values.

This is still not the best code that we could do, there is a nicer way. It involves using a data structure called defaultdict that lives in the collections module. If you have not checked out the collections module recently, I recommend you read its docs, there are a number of very useful utilities in it. With that aside, defaultdict lets us create a dictionary-like object that is different only in that if a lookup fails, it uses the argument passed to it during creation (in our case list) to fill that key in. It lets us now write code like this:

from collections import defaultdict

d = defaultdict(list)
for name in names:
	key = len(name)
	d[key].append(name)

So now we can just look up the key and append it to it, not worrying about if it exists or not. If it does not, the defaultdict will create the value for us.

Counting

Now we have mastered grouping, counting should be simple. We just have to know that int() when called returns the value 0, so that can be passed to defaultdict. So here we have:

from collections import defaultdict

d = defaultdict(int)
for name in names:
	key = len(name)
	d[key] += 1

Here a common use case is not even to use a key, but to count just the number of times something appears. In that case, we could do the following simplified version.

from collections import defaultdict

names = ["mark", "john", "mark", "fred", "paul", "john"]

d = defaultdict(int)
for name in names:
	d[name] += 1

#result: d = {'mark': 2, 'john': 2, 'fred': 1, 'paul': 1}

This was considered common enough that there is even a built-in way to do this using Counter, so the above can be reduced to this.

from collections import Counter
names = ["mark", "john", "mark", "fred", "paul", "john"]
d = Counter(names)

Counter comes with some nice little extras, such as being able to add, or subtract results. So we could do something like the following.

from collections import Counter
boys = ["mark", "john", "mark", "fred", "paul", "john"]
girls = ["mary", "joan", "joan", "emma", "mary"]
b = Counter(boys)
g = Counter(girls)
c = b + g
#result: c = Counter({'mark': 2, 'joan': 2, 'john': 2, 'mary': 2, 'fred': 1, 'paul': 1, 'emma': 1})

But what happens if you want to use Counter but need to pass the result through some key function first? How would you do it? The solution would be to put a generator inside of it like the following.

from collections import Counter
names = ['mark', 'henry', 'matthew', 'paul',
         'luke', 'robert', 'joseph', 'carl', 'michael']
d = Counter(len(name) for name in names)

Useful key functions

Some possible common cases when grouping or counting, is you might want to do so based on some item in or attribute of the items you are grouping. So for examples, your data might be a tuple of first and last names, or a dictionaries with first and last name keys, or a class with first and last name attributes. If that is what you group or count by, there are two built in functions that can help do this, without needing to write our own functions. Those are itemgetter() and attrgetter from the operator module. Some examples might help.

from collections import defaultdict
from operator import itemgetter, attrgetter

# if names used tuples
names = [('mary', 'smith'), ('mark', 'davis')]
# the last_name function would look like
last_name = itemgetter(1)

# if names are dictionaries
names = [{'first': 'mary', 'last':'smith'), ('first': 'mark', 'last': 'davis')]
# the last_name function would look like
last_name = itemgetter('last')

# if names are classes with first and last as attributes
names = [Person('mary', 'smith'), Person('mark', 'davis')]
# the last_name function would look like
last_name = attrgetter('last')

d = defaultdict(list)
for name in names:
	key = last_name(name)
	d[key].append(name)

Bonus

When I was studying Software Engineering I got a job tutoring for the first year programming course, which was in python and had 200-300 students depending on semester (hence the need for tutors to help with questions during practicals). One the challengers some of more curious students used to ask, is how I would do certain things in one line (I ended up doing their whole first assigment in a single 1500 character line). Often really bad code, but also often rather interesting trying to reduce a problem to a single statement. I had a shot at doing it for this, and this was the solution that I came up with in a few minutes. I leave working out how it works as an exercise to the reader. I would never use it in production code.

names = ['mark', 'henry', 'matthew', 'paul',
         'luke', 'robert', 'joseph', 'carl', 'michael']
# len is our 'key' function here
d = {k: [i for x, i in v] 
     for k, v in itertools.groupby(sorted((len(x), x) for x in names), 
     key=operator.itemgetter(0))}

Tag: porgramming

Python How To: Group and Count with Dictionaries

Grouping

Counting

Useful key functions

Bonus