New pandas workflow

Follow me for more content or contact for work opportunities:
Twitter / LinkedIn

Some exciting news. After some years of organizing sprints, and maintaining open source, I've been thinking on a more efficient workflow for projects with high volume of activity, like pandas.

An exaggerated example would be that I want to create 1,600 issues in pandas. One for each docstring of the project, with the flaws that we are able to automatically detect. As a side know, most of our validations to detect incorrect things in docstrings based on the numpydoc standard are now available in numpydoc (in master). You can check the documentation to see how to use it. And the source code for the list of errors we validate.

Back to the example, GitHub API and our validations scripts would make it very easy to create those 1,600 GitHub issues. We could create a label Docstring errors to identify them, and ask the community for help to fix those. The community responded extremely well in the past when we ask them for help. 500 people joined our worldwide documentation sprint. So, things seem feasible so far.

There are just two main problems to make all this work:

First, there is a small number of maintainers who would have to review, give feedback, and merge the contributions. 1,600 pull requests is surely too much for a small group of volunteers. We are surely in a much better position now, than when 500 people contributed in a single day (it took months to deal with all the pull requests of the sprints). We are around 12 active maintainers, compared to 4 at that time. And we've been improving on making our workflow more efficient, with the CI providing every time better feedback. More accurate, and presented in a better way, so first time contributors can detect problems in their work without much intervention from maintainers. GitHub Actions will be key in making our workflow more efficient for code reviews (things like contributors receiving automated emails when the CI detects something that needs to be fixed in their work).

Second, how could people know which of the 1,600 issues are available, and which are already in the works by someone else? For small projects, GitHub has an option Assignees where members of a scrum team can assign to themselves what they are working on. But this is not possible for a project the size of pandas, since only members of the organization are able to self-assign issues. And even if we wanted to add every possible contributor to the pandas GitHub organization, that would be a huge amount of work for maintainers.

The best solution should come from GitHub. Adding an option so project admins can decide whether they want to allow any GitHub user to self-assign issues in their projects. I've been discussing this with people at GitHub, and it is something it may be added. But not immediately.

The good news is that with the help of GitHub Actions is now possible to achieve the same, in a slightly trickier way. We just added to pandas an action to self-assign issues. How it works is by just writing a comment with the keyword take to an issue. And few seconds later, the action will assign the issue to the commenter. This is possible because few months ago GitHub added a feature to let assign issues to issue commenters. It is not possible even for maintainers to assign an issue to an arbitrary user.

With this simple but powerful change, now a much more efficient workflow should be possible. The workflow could consist in:

  • People interested in contributing to pandas start by setting up the environment and learn how to make an open source contribution
  • Then they check the list of unassigned good first issues
  • Once they find one that they want to work on, they write a comment with the keyword take on it
  • The issue will disappear from the list of unassigned issues, other people won't waste time checking whether it's available or not
  • If the person can't finally move forward (got busy, they are not interested anymore...) they can simply unassign themselves from the issue, and it will be in the list again

This new workflow scales to the 1,600 issues or more. Before, potential contributors had a list with all issues, assigned and not assigned. They had to check each individually for comments claiming the issue, deal with ambiguity (do messages like "can I work on this?" mean you're working on the issue?), and possibly have some discussion, before they could know if someone else is working in the issue.

One obvious problem is if people self-assigning an issue, discontinuing work on it, but not unassigning the issue. We will see how this works, but even in the worst case, unassigned issues will still be easy to find if they exist. For the assigned ones, people can check them, and know immediately who to ask to know if work is still going on, or progress was made. And to ask if the original assignee is happy to hand over the issue to the new interested contributor.

Implementing a bot that unassignes issues automatically after N days of inactivity could also be an option.

Follow me for more content or contact for work opportunities:
Twitter / LinkedIn