Dataframe summit @ EuroSciPy write up

Follow me for more content or contact for work opportunities:
Twitter / LinkedIn

Last week took place in Bilbao, Spain, EuroSciPy 2019. This year we introduced the maintainers track a room dedicated to discussions among maintainers. The idea is similar to the birds of a feather or unconference sessions of other conferences. But focussed on open source maintainers and contributors. And we scheduled most of the sessions in advanced, to attract the interested people to join the conference. We also had a maintainers plenary session, in which 26 maintainers of popular open source scientific projects participated (my guess is that around 50 maintainers attended the conference).

Dataframe summit session

One of the sessions was a 2 hours discussion on Python dataframes. 16 people attended it, around half of them were maintainers of dataframe open source libraries. There were also pandas users and contributors, maintainers of other projects (PyPy, pytest) and people interested in being involved. Also the developer of a proprietary dataframe library in Python, who could also add value to the discussion.

Those were the libraries represented:

  • pandas Flexible and powerful data analysis / manipulation library for Python
  • Dask Parallel computing with task scheduling
  • Vaex Out-of-Core DataFrames for Python
  • Modin A dataframe framework that scales the pandas API with Ray and Dask
  • xframe DataFrame library in C++

We started by personal introductions, project introductions, and what people wanted to get out of the session (many people already proposed topics before the event, and we defined an agenda with those).

Document the ecosystem

One of the first topics discussed was on how to let users know what is the best dataframe tool for their job, and how the existing packages are different. The general consensus was that the pandas ecosystem page is the best place for it. There are already plans to improve this page (and plans and work in progress to improve the look and feel of the pandas website and documentation).

Apache Arrow

Another topic that was discussed early was Apache Arrow. Arrow's mission is to provide a common memory representation for all dataframe libraries. So, libraries don't need to reinvent the wheel, and transferring data among packages (e.g. pandas to R) can be done without transformations or even without copying the memory.

Vaex is already using Arrow, and pandas has plans in its roadmap to move in that direction. People were in general happy with the idea, but there were some concerns about decisions made in Arrow (mainly contributed by Sylvain, from xframe):

  • Apache arrow C++ API and implementation not following common C++ idioms
  • Using a monorepo (including all bindings in the same repo as Arrow)
  • Not a clear distinction between the specification and implementation (as in for instance project Jupyter)

Not only related to Arrow, but it was mentioned that would be useful to have dataframes for streaming data. A library named Perspective exists, which implements something similar, and has Python bindings.

Interoperability

The next topic was about interoperability. How dataframe libraries can interact among them, and with the rest of the ecosystem. Examples can be:

  • Using the same plotting backends from different dataframe libraries
  • Passing to scikit-learn pandas-like dataframe objects

There was consensus that defining a standard (and minimal) dataframe API would help. Dataframe libraries could extend this smaller API and offer users a much bigger APIs (like pandas). But having a subset of operations and methods would be very useful for third party libraries expecting dataframe objects.

Devin from Modin is doing research at UC Berkeley on defining this API, and he's already got some documentation he's happy to share. Modin is already implemented with this design, and while it's one of the less mature participating projects (in Devin's words), it's user-facing layer could potentially be reused by other projects reimplementing dataframes with a different backend. Devin has shared the documentation for this design and the corresponding API on the Modin documentation.

It was noted that could be useful to have a common test suite, if a standard dataframe API is defined. There was agreement that the pandas test suite is not appropriate for other packages.

NumPy did something similar in NEP-18, which can be used as reference.

Public API improvements

At the end of the session, we discussed about possible improvements to the public pandas API. Since several of the participants reimplemented the pandas API, was a good opportunity to see places where they found inconsistencies, or where the API was making their lives difficult when using other approaches.

Indexing was the part of pandas that other maintainers were less happy about. The way .loc behaves was one of the comments. And being forced to have a default index, and not being able to index by other columns were other comments.

Next steps

Couple of things were discussed to keep those discussions active, and keep coordinating on shaping the dataframes of the future.

The first was to start a workgroup, or a distribution list (or discourse). The pandas-dev list wasn't used by the participants (except the pandas maintainers), and it didn't seem to be the appropriate place.

Another idea would be to organize another bigger dataframe summit in the future. It was proposed to be hosted somewhere in the Caribbean (ok, it was me who proposed that, and everybody else laughed, but here I leave it again). :)

Follow me for more content or contact for work opportunities:
Twitter / LinkedIn