How to Create an Ordered Counter class in Python

After I made my post How to Group and Count with Dictionaries, I noticed there was something nice I could have added. That was how to make a counter class that also remembers the order in which it first found something. It is super easy (pun intended) to do in Python. See the following.

from collections import Counter, OrderedDict

class OrderedCounter(Counter, OrderedDict):
	pass 

This code works because of the method resolution order. You can easily find out what that is by trying help(OrderedCounter) into the python console. I have reproduced what you would get for Counter and OrderedCounter below.

Counter
 |  Method resolution order:
 |      Counter
 |      __builtin__.dict
 |      __builtin__.object

OrderedCounter
 |  Method resolution order:
 |      OrderedCounter
 |      collections.Counter
 |      collections.OrderedDict
 |      __builtin__.dict
 |      __builtin__.object

What has changed is that now OrderedDict has been inserted in the order in which methods which be searched for when using OrderedCounter just before dict. So all of the dictionary methods will come OrderedDict instead of dict giving us the result we want.

You can use this method of inheritance on any class that inherits from dict where you want to keep the order of its key as that in which they are inserted. In the more general case, you can use it to create a new class that uses an alternative implementation of some interface.

For more information on this, there is an excellent post by Raymond Hettinger called Python’s super() considered super! that goes into all this in way more detail.

The property function: why Python does not have Private Methods

Now and then, I see it said that Python should have private or protected methods on its classes. This typically comes from someone used to programming in Java or C++ or the like, who misses them and insists they are essential for any programming language. But python does not have them and never will. Private methods are a solution that works in statically typed languages. It is not the solution for a highly dynamic language like Python.

tl;dr Python has a dynamic way of fixing this problem. Just scroll down to the end to see the code.

But let us look at the problem starting with an example class. Let us do something non-trivial that might be used in the real world: a class that handles colour manipulation, maybe because I am building a colour picker or doing something cool with colours. So my code might start like this.

class Colour(object):
    """A colour wrangler class"""
    def __init__(self, hexcode="ff0000"):
        self.hexcode = hexcode

I would then add a pile of other methods that will be needed, but let us focus on this. The class has this hexcode attribute that anyone can now access. So I build the rest of my project around this and access the hexcode in many places in my code, and others start doing it in theirs as well.

But then I hit a problem, storing the colour as a hex string is clumsy since I have to keep getting out the RGB components any time I want to manipulate it, and also when trying to convert to other colour spaces and back again, I get rounding errors. So I now want to store it as three floats. So I update my code.

class Colour(object):
    """A colour wrangler class"""
    def __init__(self, hexcode="ff0000"):
        self.r = float(int(hexcode[0:2], 16))
        self.g = float(int(hexcode[2:4], 16))
        self.b = float(int(hexcode[4:6], 16))

Now everyone else’s code that depended on the hexcode attribute is broken. At this point, the Java/C++ people would say I told you so. If only you had static methods, you could have had getter and setter methods and kept the attribute private. But that is never going to happen in Python. Python is too dynamic to check this at compile time, and run-time checking would be far too expensive. But python does have an elegant solution to this. It is the built-in property function. This is how I would now fix my code using it.

class Colour(object):
    """A colour wrangler class"""
    def __init__(self, hexcode="ff0000"):
        self.hexcode = hexcode
    
    @property
    def hexcode(self):
        """Return the colour as a hex string"""
        return ("{:02x}" * 3).format(int(self.r), int(self.g), int(self.b))
    
    @hexcode.setter
    def hexcode(self, hexcode):
        self.r = float(int(hexcode[0:2], 16))
        self.g = float(int(hexcode[2:4], 16))
        self.b = float(int(hexcode[4:6], 16))

I can now get and set the hexcode attribute as before, and it will all get transparently handled. I have even used it as a setter in my __init__ method, so I don’t have to repeat the conversion.

Note that @property is put first, before the function that acts as the getter, and then @hexcode.setter not @property.setter is used to define the setter. You can’t change the order and expect it to work. Also, note that the two functions have the same name. And the doc string from the getter will be given if you do help(Colour.hexcode).

There is also an alternative way of doing the same thing, this time not using property as a decorator.

class Colour(object):
    """A colour wrangler class"""
    def __init__(self, hexcode="ff0000"):
        self.hexcode = hexcode
    
    def gethexcode(self):
        return ("{:02x}" * 3).format(int(self.r), int(self.g), int(self.b))
    
    def sethexcode(self, hexcode):
        self.r = float(int(hexcode[0:2], 16))
        self.g = float(int(hexcode[2:4], 16))
        self.b = float(int(hexcode[4:6], 16))
    
    hexcode = property(gethexcode, sethexcode, doc="Return the colour as a hex string")

This gives the same result, except it also exposes the two functions as an alternative way of getting/setting the value. Also, for completeness, I guess I should mention there is a deleter as well, which can be passed as the third argument to property, or used as @hexcode.deleter, but that seems less useful to me.

I would say I prefer the python way of doing it. First, it means I only have to define the getter and setter functions after the fact of an API change, not as some pre-emptive move just in case it changes later. Also, it looks cleaner to me to access attributes than call methods.

You can also use this to create read-only attributes by leaving out the setter (you will get an AttributeError) or adding some validation around an attribute to prevent some invalid values be added in. I could, in this case, make the self.r etc. into attributes as well to ensure they are in the correct range of 0 to 255.

This, however, will not stop people from messing with variables you wanted to be protected (indicated by putting an underscore before it). But you told them they are playing with fire. If they get burned, that can’t be helped. So this python solution does not cover all use cases that private and protected methods are supposed to, but it covers enough that I think the need for them is not significant in Python. Instead, Python has its own very powerful and flexible solution.

Python: Tips, Tricks and Idioms – Part 2 – Decorators and Context Managers

Two weeks ago, I wrote a post called Python: tips, tricks and idioms, where I went through many features of python. However, I want to narrow down on just a few and look at them in more depth. The first is decorators, which I did not cover, and the second is context managers, which I only gave one example. Again all the code samples are on gist.github.com.

There is a reason that I put them together; they both have the same goal. They can both help separate what you are trying to do (the “business logic”) from some extra code added for clean-up or performance etc. (the “administrative logic”). So basically, it helps package away in a reusable way code that we don’t care too much about.

Decorators

Decorators are easy to use, even if you have never seen them. You could probably guess what is going on, even if not how or why. Take this for example:

@cache
def web_lookup(url):
    page = urlopen(url)
    try:        
        return page.read()
    finally:
        page.close()

Overlooking for now precisely what library urlopen comes from… We can assume that the results of web_lookup() will be cached so that they are not fetched every time we ask for the same URL. So simple, just know we can put @some_decorator before our function, and we can use any decorator. But how do we write one?

First, we need to understand what the decorator is doing. @cache is just syntactic sugar for the following.

web_lookup = cache(web_lookup)

So this is important. What cache is, is a function that takes another function as an argument and returns a new function that can be used just like the function could be before, but presumably adding in some extra logic. So for our first decorator, let us start with something simple, a decorator that squares the result of the function it wraps.

def square(func):
    def _square(num):
        return func(num) ** 2
    return _square

# which we can then use like this

@square 
def plus(num):
    """Adds 1 to a number"""
    return num + 1

So in this little example, every time we call plus() the number will have 1 added to it, but because of our decorator, the result will also be squared.

But there is a problem with this, plus() is no longer the plus() function that we defined, but another function wrapping it. Such things like the doc string have gone missing. Things like help(plus) will no longer work. But in the functools library, there is a decorator to fix that, functools.wraps(). Always use functools.wraps() when writing a decorator.

from functools import wraps

def square(func):
    @wraps(func)
    def _square(num):
        return func(num) ** 2
    return _square

But did you notice? wraps() is a decorator that takes arguments. How do we write something like that? It gets a little more involved, but let us start with the code. It will be a function that now raises the number to some power.

def power(pow):
    def _power(func):
        @wraps(func)
        def _pow(num):
            return func(num) ** pow
        return _pow 
    return _power

@power(2) 
def plus(num):
    """Adds 1 to a number"""
    return num + 1

So yes, three functions, one inside the other. The result is that power() is no longer the decorator but a function that returns the decorator. See, the issue is one of scoping, which we why we have to put the functions inside each other. When _pow() is called the value of pow, comes from the scope of the power() function that contains it.

So now we know how to write highly reusable function decorators, or do we? There is a problem still, and that is our internal function _square() or _pow() only takes one argument, so any function it wraps can take only one argument. What we want is to be able to have any number of arguments. So that is where the star operator comes in.

Star operator

The * (star) operator can be used in a function definition to take an arbitrary number of arguments, all of which are collapsed into a single tuple. An example might help.

def join_words(*args):
    """Joins all the words into a single string"""
    return " ".join(args)

print(join_words("Hello", "world"))

The * operator can also be used for the reverse case when we have an iterator, but we want to pass that as the arguments to a function. This gets called argument unpacking.

words = ("Hello", "world")
print(join_words(*words))

The same basic idea can also be used for keyword arguments. For this, we use the ** (double star) operator. But instead of getting a list of the arguments, we get a dictionary. We can also use both together. So some examples.

def print_args(*args, **kwargs):
    print(args)
    print(kwargs)

print_args("Hello", "world", count=2, letters=10)

# output:
# ('Hello', 'world')
# {'count': 2, 'letters': 10}

# or calling the function with argument unpacking.

words = ("Hello", "world")
arguments = {'count': 2, 'letters': 10}

print_args(*words, **arguments)

Better Decorators

So now we can go back and change our decorator to be truly generic. Let’s do it with the simplest one, we wrote, @square.

def square(func):
    @wraps(func)
    def __square(*args, **kwargs):
        return func(*args, **kwargs) ** 2
    return __square

Now no matter what arguments the function takes, we will happily just pass them through to the function we are wrapping.

So let us go back to our web_lookup function and write it first with caching, and then write the decorator to see the difference.

saved = {}

def web_lookup(url):
    if url in saved:
        return saved[url]
    page = urlopen(url)
    try:        
        data = page.read()
    finally:
        page.close()
    saved[url] = data
    return data

That is how it might look, and our problem here is that the caching code is mixed up with what web_lookup() is supposed to do. It makes it harder to maintain it, harder to reuse it, and harder to update the way we cache it if we have done something like this all over our code. So our very generic decorator might look like this.

def cache(obj):
    saved = obj.saved = {}
    @functools.wraps(obj)
    def memoizer(*args, **kwargs):
        key = str(args) + str(kwargs)
        if key not in saved:
            saved[key] = obj(*args, **kwargs)
        return saved[key]
    return memoizer

# now our nice clean web_lookup()

@cache
def web_lookup(url):
    page = urlopen(url)
    try:
        return page.read()
    finally:
        page.close() 

So that can wrap any function with any number of arguments just by putting @cache before it. But I did not write that function myself. I just lifted it right off the Python Decorator Library, which has many examples of decorators you can use.

Context Managers

In the previous post, I did a single example of using a context manager, opening a file. It looked like this:

with open('/etc/passwd', 'r') as f:
    print(f.read())

# which is equivalent to the longer

f = open('/etc/passwd', 'r')
try:
    print(f.read())
finally:
    f.close()

Admittedly the context manager is only a little shorter, and the file will be garbage collected (at least in CPython), but there are other cases where it might be a bigger problem if you get it wrong. So, for example, the threading library can also use a context manager.

lock = threading.Lock()
with lock:
    print('Critical section')

This is nice and simple, and you can see from the indent what the critical section is and be sure the lock is released.

If you are dealing with a file-like object that doesn’t have a content manager, the contextlib has the closing context manager. So let us go back and improve our web_lookup() function.

from contextlib import closing

def web_lookup(url):
    with closing(urlopen(url)) as page:
        return page.read()

We can also write our own context managers. All that is needed is to use the @contextmanager decorator and have a function with a yield in it. The yield marks the point at which the context manager stops while the code within the with statement runs. So the following can be used to time how long it takes to do something.

from contextlib import contextmanager
import time

@contextmanager
def timeit():
    start = time.time()
    try:
        yield
    finally:
        print("It took", time.time() - start, "seconds")

# this might take a few seconds
with timeit():
    list(range(1000000))

The try finally in this case, is optional, but without it, the time would not be printed if there was an exception raised inside the with statement.

We can also do more complicated context managers. The following is something like what was added in Python 3.4. It will take whatever is printed to sysout and put it in a file (or file-like object). So, for example, if we had all our timeit() context managers in your code and wanted to start putting the results into a log file. Here the yield is followed by a value, which is why we can then use the with ... as syntax.

from contextlib import contextmanager
import io, sys

@contextmanager
def redirect_stdout(fileobj=None):
    if fileobj is None:
        fileobj = io.StringIO() # in python 2 use BytesIO
    oldstdout = sys.stdout
    sys.stdout = fileobj
    try:
        yield fileobj
    finally:
        sys.stdout = oldstdout

with redirect_stdout() as f:
    help(pow)
help_text = f.getvalue()

with open('some_log_file', 'w') as f:
    with redirect_stdout(f):
        help(pow)

# above can be also written as
with open('some_log_file', 'w') as f, redirect_stdout(f):
    help(pow)

The last with statement also shows off the use of compound with statements. It is just the same as putting one with inside another.

Finally, at least in passing, it is worth mentioning that any class can be turned into a context manager by adding the __enter__() and __exit__() methods. The code in either will do more or less what the code on either side of the yield statement would do.

And that is all for this round. I hope you learned something new and interesting. Don’t forget to follow me on Twitter if you want more python tips, such as when I write about sets and dictionaries next time. In the meantime, if you are looking for more, there is an excellent book, Python Cookbook, Third edition by O’Reilly Media. I have been reading parts of it and might include a few things I learned from it in my next post. Or, if you want something simpler, try Learning Python, 5th Edition.

Python: Tips, Tricks and Idioms

My programming language of preference is python because I feel I write better code faster with it than I do with other languages. However, it also has a lot of nice tricks and idioms to do things well. Therefore, and partly as a reminder to use them, and partly because I thought this might be of general interest, I have put together this collection of some of my favourite idioms. I am also putting this on gist.github.com so that anyone that wants to contribute their things can, and I will try and keep this post up to date.

enumerate

A fairly common thing to do is loop over a list while keeping track of what index we are up to. Now we could use a count variable, but python gives us a nicer syntax for this with the enumerate() function.

students = ('James', 'Andrew', 'Mark')
for i, student in enumerate(students):
    print i, student
# output:
# 0 James
# 1 Andrew
# 2 Mark 

set

set is a useful little data structure, it is kind of like a list, but each value in it is unique. There are some valuable operations, besides creating a list of unique items, that we can do with it. For example, let us try different ways of validating input lists.

colours = set(['red', 'green', 'blue', 'yellow', 'orange', 'black', 'white'])

# or using the newer syntax to declare the set.
input_values = {'red', 'black', 'pizza'}

# get a list off the valid colours
valid_values = input_values.intersection(colours)

print valid_values
# output set(['black', 'red'])

# get a list of the invalid colours
invalid_values = input_values.difference(colours)

print invalid_values
# output set(['pizza'])

# throw exception if there is something invalid 
if not input_values.issubset(colours):
    raise ValueError("Invalid colour: " + ", ".join(input_values.difference(colours)))

Control statements

with

The with statement is useful when accessing anything that supports the context management protocol. This means open() for example. It ensures that any set-up and clean-up code, such as closing files, is run without worrying about it. So, for example, to open a file:

with open('/etc/passwd', 'r') as f:
    print f.read()

for … else

This is an interesting bit of syntax. It allows you to run some code if the loop never reached the break statement. It replaces the need to keep a tracking variable for if you broke or not. Just looking over my code, here is a pseudo version of something I was doing.

# some code

for file_name in file_list:
    if is_build_file(file_name):
        break
else: # no break
    make_built_file()

# something else here

Conditional Expressions

Python allows for conditional expressions, so instead of writing an if .. else with just one variable assignment in each branch, you can do the following:

# make number always be odd
number = count if count % 2 else count - 1

# call function if object is not None 
name = user.name() if user is not None else 'Guest'
print "Hello", name

This is one of the reasons I like python. The above is very readable, compared to the teneray operator that looks like a ? b : c that exits in other languages. It always confuses me.

List Comprehension

List comprehensions are supposed to replace building a list by looping and calling append. Compare the following.

numbers = [1, 2, 3, 4, 5, 6, 7]
squares = []
for num in numbers:
    squares.append(num * num)

# with a list compression 
squares = [num * num for num in numbers]

We can also make this more complicated by adding in filtering or putting a conditional assignment in:

numbers = [1, 2, 3, 4, 5, 6, 7]

# squares of all the odd numbers
squares = [num * num for num in numbers if num % 2]

# times even numbers by 2 and odd numbers by 3
mul = [num * 3 if num % 2 else num * 2 for num in numbers]

Generator expressions

List comprehensions have one possible problem: they build the list in memory right away. If you are dealing with big data sets, that can be a big problem, but even with small lists, it is still extra overhead that might not be needed if you are only going to loop over the results once there is no gain in building this list. So if you can give up being able to index into the result and do other list operations, you can use a generator expression, which uses very similar syntax, but creates a lazy object that computes nothing until you ask for a value.

# generator expression for the square of all the numbers
squares = (num * num for num in numbers)

# where you would likely get a memory problem otherwise

with open('/some/number/file', 'r') as f:
    squares = (int(num) * int(num) for num in f)
    # do something with these numbers

Generators using yield

Generator expressions are great, but sometimes you want something with similar properties but not limited by the syntax that generators use. Enter the yield statement. So, for example, the below will create a generator is an infinite series of random numbers. So as long as we keep asking for another random number, it will happily supply one.

import random
def random_numbers(high=1000):
    while True:
        yield random.randint(0, high)

Dictionary Comprehensions

One generator use can be to build a dictionary, like in the first example below. This proved itself to be common enough that now there is even a new dictionary comprehension syntax for it. Both of these examples swap the keys and values of the dictionary.

teachers = {
    'Andy': 'English',
    'Joan': 'Maths',
    'Alice': 'Computer Science',
}
# using a list comprehension
subjects = dict((subject, teacher) for teacher, subject in teachers.items())

# using a dictionary comprehension
subjects = {subject: teacher for teacher, subject in teachers.items()}

zip

If you thought that generating an infinite number of random int was not that useful, well, here I want to use it to show another function that I like to use zip(). zip() takes several iterables and joins the nth item of each into a tuple. So, for example:

names = ('James', 'Andrew', 'Mark')
for i, name in zip(random_numbers(), names):
    print i, name

# output:
# 288 James
# 884 Andrew
# 133 Mark

So basically, it prints out all the names with a random number (from our previous random number generator) next to a name. Notice that zip() will stop as soon as it reaches the end of the shortest iterable. However, if that is not desired, the itertools module has one that goes till the end of the longest.

We could also do something similar to get a dict of each name mapped to a random number like this.

dict(zip(names, random_numbers()))

# output: {'James': 992, 'Andrew': 173, 'Mark': 329}

itertools

I mentioned itertools before. It is worth reading through if you have not looked at it before. Plus, at the end, there is a whole section of recipes on how to use the module to create even more interesting operators on iterables.

Collections

Python comes with a module that contains several container data types called Collections. Though I only want to look at two right, now it also has three more called namedtuple(), deque (a linked list like structure), and OrderedDict.

defaultdict

This is a data type that I use a fair bit. One practical case is when you are appending to lists inside a dictionary. If you are using a dict() you would need to check if the key exists before appending, but with defaultdict, this is not required. So, for example.

from collections import defaultdict

order = (
    ('Mark', 'Steak'),
    ('Andrew', 'Veggie Burger'),
    ('James', 'Steak'),
    ('Mark', 'Beer'),
    ('Andrew', 'Beer'),
    ('James', 'Wine'),
)

group_order = defaultdict(list)

for name, menu_item in order:
    group_order[name].append(menu_item)

print group_order

# output
# defaultdict(<type 'list'>, {
#     'James': ['Steak', 'Wine'],
#     'Andrew': ['Veggie Burger', 'Beer'],
#     'Mark': ['Steak', 'Beer']
# })

We could also count them like this.

order_count = defaultdict(int)

for name, menu_item in order:
    order_count[menu_item] += 1

print order_count

# output
# defaultdict(<type 'int'>, {
#     'Beer': 2, 
#     'Steak': 2, 
#     'Wine': 1, 
#     'Veggie Burger': 1
# })

Counter

But the last example is redundant because Collections already contains a class for doing this, called Counter. In this case, I need to first extract the second item from each tuple, for which I can use a generator expression.

from collections import Counter

order_count =  Counter(menu_item for name, menu_item in order)
print order_count

# output
# Counter({
#    'Beer': 2,
#    'Steak': 2,
#    'Wine': 1,
#    'Veggie Burger': 1
# })

Another better example might be counting all the different lines that appear in a file. It becomes straightforward.

with open('/some/file', 'r') as f:
    line_count = Counter(f)

If you enjoyed this post or found it helpful, please leave a comment or share it on Twitter. Also, if people find this useful, I will try and do some follow-up posts explaining some things in more detail and with additional examples.

Edit: if you have found this helpful but want more, there is an excellent book, Python Cookbook, Third edition by O’Reilly Media, that has a whole lot more. If you want something simpler, try Learning Python, 5th Edition.