Python operators and how they affect pandas

Python is today among the most popular programming languages. And in my opinion, there is one main reason for it, readability.

A clear example of how Python was designed to be readable is next example:

if 'melon' not in ('apple', 'coconut'): print('it is missing!')

And compare this to for example Javascript:

var fruits = new Array("apple", "coconut");
if (!(fruits.indexOf("melon") > 0)) { console.log("it is missing!"); }

I think it's clear how Python is trying to make humans life easy, even at the cost of extra complexity in the interpreter.

Example

The above comment about readability is not only true when you are writing code that you need to read later. It also applies when you are programming libraries, that users will use. Python is also designed to let you write libraries in a way that your users will be able to write readable code.

For example, let's think of a toy library that implements colors.

In [1]:
class ColorStep1:
    def __init__(self, red=0, green=0, blue=0):
        self.red = red
        self.green = green
        self.blue = blue

    def __str__(self):
        """Convert the color from the 3 integer values, to a string like #ffffff."""
        return f'#{self.red:02x}{self.green:02x}{self.blue:02x}'

    def _repr_html_(self):
        """Display the color as a box of its color in Jupyter."""
        return f'<span style="color: {self}">▅</span>'

blue = ColorStep1(blue=255)
blue
Out[1]:

Note: In practice it would make sense to have a single Color class with all the methods. I'll be writing it in separate ColorStepN classes that inherit from the previous to show the development step by step.

A common way to mix colors could be to simply implement a mix method.

In [2]:
class ColorStep2(ColorStep1):
    @staticmethod
    def _mix_one(color1, color2):
        """There are many ways to mix colors, here we just take the sum of the components."""
        return min(color1 + color2, 255)

    def mix(self, other):
        return ColorStep2(red=self._mix_one(self.red, other.red),
                          green=self._mix_one(self.green, other.green),
                          blue=self._mix_one(self.blue, other.blue))

red = ColorStep2(red=255)
green = ColorStep2(green=255)

# Mixing red and green to generate yellow
red.mix(green)
Out[2]:

This works well, but could we make that last line more readable? I think so. I think it would be really cool for the users of our colors library if they could simply write red + green.

As mentioned before, Python is designed to not only let us write readable code, but to write libraries that will make the code of our users readable.

Operators

A first version of our class with operators could look like:

In [3]:
class ColorStep3(ColorStep2):
    def __add__(self, other):
        return self.mix(other)

red = ColorStep3(red=255)
green = ColorStep3(green=255)

red + green
Out[3]:

You can disagree, but to me, and I bet to most Python programmers, red + green is easier to read than red.mix(green). So, we managed to let users use this syntax, with just the addition of the special __add__ method.

Interacting with other types

An extra feature that I would like to have, is to be able to mix my color class, with colors in the form #ffffff. Let's give it a try first, and see why it fails:

In [4]:
red = ColorStep3(red=255)
green = '#00ff00'

red + green
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-a9a4578f859e> in <module>
      2 green = '#00ff00'
      3 
----> 4 red + green

<ipython-input-3-354fcff4f7c5> in __add__(self, other)
      1 class ColorStep3(ColorStep2):
      2     def __add__(self, other):
----> 3         return self.mix(other)
      4 
      5 red = ColorStep3(red=255)

<ipython-input-2-828cfc83e29c> in mix(self, other)
      6 
      7     def mix(self, other):
----> 8         return ColorStep2(red=self._mix_one(self.red, other.red),
      9                           green=self._mix_one(self.green, other.green),
     10                           blue=self._mix_one(self.blue, other.blue))

AttributeError: 'str' object has no attribute 'red'

Our implementation of the mix method is assuming that we'll receive an instance of the color class. Since it expects to find the attributes red, green and blue.

What we will do is to create a method to convert the string representation to our class. And then we will automatically convert the other parameter if it is a string.

In [5]:
class ColorStep4(ColorStep3):
    @staticmethod
    def _parse_rgb_string(value):
        import re
        match = re.search('^#?([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2})$',
                          value.lower())
        
        return ColorStep4(red=int(match.group(1), 16),
                          green=int(match.group(2), 16),
                          blue=int(match.group(3), 16))

    def __add__(self, other):
        if isinstance(other, str):
            other = self._parse_rgb_string(other)

        return self.mix(other)

red = ColorStep4(red=255)
green = '#00ff00'

red + green
Out[5]:

This wasn't difficult. Now I can add (mix) a string to my color class. But can I add my color class to a string? Python strings are generic, and they don't know anything about the color class I just implemented.

The answer is no:

In [6]:
red = '#ff0000'
green = ColorStep4(green=255)

red + green
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-81eb708d4cba> in <module>
      2 green = ColorStep4(green=255)
      3 
----> 4 red + green

TypeError: can only concatenate str (not "ColorStep4") to str

I don't think it makes sense to modify the str type in Python to let it know about our new class (and it wouldn't be simple).

Luckily, Python provides a way to let this happen easily. The idea is that after the operation raises the TypeError exception, and before it is reported to the user, Python will try something else. Will try to find a __radd__ method in the class at the right of the operation. In this case it didn't exist, but let's see what happens if we implement it:

In [7]:
class ColorStep5(ColorStep4):
    def __radd__(self, other):
        return self + other

red = '#ff0000'
green = ColorStep5(green=255)

red + green
Out[7]:

And volià, it worked. :)

What happened here is next:

  • We tried the operation add with str + color
  • Python called the __add__ method of str, and it raised TypeError
  • Then Python captured the error, and called the __radd__ of color, with the str instance as the other parameter
  • That worked, and Python reported the result to the user

Limitations

This is great, and we can not only operate our class with additions from and to any other class, but there are many other operations we can do. Just some random examples:

  • color + color
  • color + whatever
  • whatever + color
  • color - color
  • color * whatever
  • whatever == color
  • color > color
  • ...

The Python documentation has the full list of Python operators.

This is great, but there are some operators that are not in this list:

  • color and color
  • color or color
  • not color
  • color in (color1, color2)

There was a proposal to be able to overload them, which was rejected by Guido van Rossum.

While I don't know what are the implications for the interpreter of accepting the proposal (in terms of performance, complexity...), I do know what is the implication for library authors, and specifically to pandas.

Operators in pandas

pandas makes heavy use of operation overloading. See these examples:

df['distance_in_miles'] * 1.609344

df['base_price'] + df['base_price'] * df['vat_rate']

df['age'] >= 18

Now consider this other example:

df['airline'] == 'DL' and not df['first_class']

While this looks very readable, there is a problem with this. The and and not operators are not being overloaded by pandas, since this is not allowed. So, they are the original operators from the Python interpreter.

The original and and not operators will convert their parameters to a boolean value, and then evaluate the condition based on that. So, in this case df['airline'] == 'DL' won't be evaluated to one value per row, but converted to a single value True or False. This is not what a pandas user would expect, and it's inconsistent with the other operators, so this is not the syntax used by pandas.

If pandas maintainers can't offer the above syntax, what are the alternatives? There are in my opinion two reasonable approaches.

The first solution is to go back to using methods, like we started with mix. This would look like:

pandas.and(df['airline'] == 'DL',
           pandas.not(df['first_class']))

This is not valid Python syntax, since and and not are reserved keywords in Python, and will result in a syntax error.

The recommended solution based on PEP-8 is to add a single trailing underscore, so the final syntax would be:

pandas.and_(df['airline'] == 'DL',
            pandas.not_(df['first_class']))

I think we will all agree that is less readable than using the and and not operators.

A second solution is to use other operators that we can overload. There are few that don't have an immediate use for dataframes, that can be considered. In particular, the bitwise operators.

Let's have a look at the original bitwise operators:

In [8]:
binary_value_1 = 0b0010

binary_value_2 = 0b1010

result_and = binary_value_1 & binary_value_2

result_or = binary_value_1 | binary_value_2

result_not = ~ binary_value_1

print(f'binary and: {binary_value_1:04b} & {binary_value_2:04b} = {result_and:04b}')
print(f'binary or: {binary_value_1:04b} | {binary_value_2:04b} = {result_or:04b}')
print(f'binary inverse: ~ {binary_value_1:04b} = {result_not & 0b1111:04b}')
binary and: 0010 & 1010 = 0010
binary or: 0010 | 1010 = 1010
binary inverse: ~ 0010 = 1101

Python provides these operators to opearte at the bit level. The & operator is like an and, but it doesn't operate for the whole value, but for each individual bit of it. We can see in the result, that there is a 1 in the positions where there is a 1 in the first value and in the second. The or operator is equivalent, there is a 1 where there is a 1 in the first value or there is a 1 in the second.

Finally, the inverse just reverses every 0 and makes it a 1, and the other way round.

In pandas, initially, there was not much use for those, in the original meaning. So, they could be borrowed as the and, or and not operators for dataframes (or series).

The result with the previous example would look like:

df['airline'] == 'DL' & ~ df['first_class']

This looks correct, and this syntax is the one supported by pandas, but this is not equivalent to:

df['airline'] == 'DL' and not df['first_class']

It is not because of Python operator precedence. This is the order in which operators are evaluated.

See this example:

In [9]:
1 + 2 * 3
Out[9]:
7

If operators were evaluated from left to right, the previous result would be 1 + 2 = 3 and then 3 * 3 = 9. But the 2 * 3 is actually happening first.

Something similar is happening with the previous example.

We would expect that the first to evaluate is:

df['airline'] == 'DL'

And once this is computed, the and is performed with the second part of the expression (the condition on not being first class). This is what it would actually happen when using the Python and operator. But the bitwise & has a difference precedence.

So how thinigs are actually being executed are:

df['airline'] == ('DL' & (~ df['first_class']))

So, the and is not performed between the two conditions, but between DL and the second condition. This makes pandas conditions very tricky. And it's easy to get unexpected results.

The solution for pandas is to be explicit on the order by using brackets:

(df['airline'] == 'DL') & (~ df['first_class'])

This will ensure that the order in which operators are evaludated is the expected.

I understand why pandas was designed this way, and I see value on having a more compact representation of conditions. But this feels quite hacky and counter-intuitive, and I would personally prefer the syntax presented before:

pandas.and_(df['airline'] == 'DL',
            pandas.not_(df['first_class']))
Show Comments