Subclassing a (Pandas) DataFrame
Introduction
If you are a Python programmer using the Pandas library as one of the core libraries in the products you create, then you should be interested in this post. I hope to make a case for subclassing
a Pandas DataFrame for certain use cases that are very common in projects that make use of DataFrames as a primary data structure to pass around tabular data.
In a serious application, a typical use case is to wrap the following snippet of code in some domain specific class or a function.
:::python
def load_data(students_data_file_path):
# some code here...
df = pd.read_csv(students_data_file_path) # read a file with numerous columns
return df
The dataframe returned by such a function could potentially have a large number of columns. Usually, this dataframe gets passed around to various other functions. Say, we have a function, that will find the mean of scores for the given list of students.
def calculate_mean(students_df):
# some code
..return df['scores'].mean() # string 'scores' term
Notice in the above code that the term scores
is stringy. Also, we are using this stringy
value to access the column
in the dataframe. That is, we end up using this stringy
value as an API into the dataframe we loaded. Now, imagine such stringy
columns being used and reused everywhere across all the functions this dataframe gets passed around.
Hope you see the problem here. We are using ’stringy` values as API. One way to fix this and make the reference more symbolic would be to create a related class which represents the columns available in the dataframe.
For example, we could have
# student_data_def.py
class StudentsData:
# each of the left-hand values represents the
# column names that are availale in the loaded file
= 'scores'
SCORES = 'student_name' STUDENT_NAME
And therefore, our calculate_mean
function, now looks like this
from student_data_def import StudentsData
def calculate_mean(students_df):
.
.return df[StudentsData.SCORES].mean()
Now, this is better. We converted our stringy API to some symbolic values with better discover-ability. But, the problem here is that the students_df
and StudentsData
types are decoupled. Wherever, the dataframe is passed around we will have to import the StudentsData class in the modules that are working on this dataframe. Ideally, these two objects are supposed to reside together. We need more coupling
between these two objects.
Now, if you are one of the fortunate folks using Python 3.5 and above, you should be excited about Type Hints. Type Hints, if embraced carefully can help our editors assist us as we type out our code. The linting tools like PyLint and type checkers like MyPy can also help us verify certain kinds of coding errors even before we run our code. In the scenario we are dealing with, imagine that we wanted to annotate our calculate_mean
function with its type.
Let us try that,
def calculate_mean(students_df: pd.DataFrame) -> np.float:
.
.return df[StudentsData.SCORES].mean()
Huh! pd.DataFrame
. This type annotation does not provide us with any more information that the name of the variable. There is no way, for one to know what other columns are available in this DataFrame. All introspection capabilities are lost from such a rich data structure like the DataFrame which stores tens of fields. It seems like a mysterious object keeps getting passed around without any ability to understand its type without running the program or looking a all places the dataframe could have been initialized.
Summarizing our frustrations
Stringy API to access columns in DataFrame
If we try to get rid of stringy API, then we have decoupled objects(dataframe and the class declaring the columns)
Inability to provide useful type hints.
One Proposed Solution
Subclassing DataFrames
Pandas provides a way to subclass a DataFrame using the _constructor
function. Here is some nice stuff that we can do this construct.
import pandas as pd
class StudentsDF(pd.DataFrame):
= 'scores'
SCORES = 'name'
STUDENT_NAME
@property
def _constructor(self):
return StudentsDF
= StudentsDF(data=dict(name=['Alice', 'Bob'], scores=[60, 50]), index=[100, 200])
x type(x) # __main__.StudentData
Now, we see that the type(x)
is StudentData
. Therefore, we can pass around this object to our calculate_mean
function.
def calculate_mean(df: StudentsDF):
.
.return df[df.SCORES].mean() # we are able to use `df` to access both column labels and values
Notice, that we just solved the three frustrations we had summarized above.
We pass around the Custom class whose
type
describes more about the object.Both the data and the column definitions (formerly, stringy API) are coupled together tightly
Type annotations make more sense now. We do not have to use the generic
pd.DataFrame
any more.
Note, that using this construct, it is also possible to create a StudentsDF
type instance while create a dataframe from a file.
= pd.read_csv('/tmp/t.csv')
df = StudentsDF(df) # just pass the original df into the StudentsDF constructor
students_df print(student_df.columns) # Index(['name', 'scores'], dtype='object')
Automating this class generation
Using a simple script like the one shown below, we can automate creation of some of the boiler plate code. The following code creates and prints out a class
definition based on the columns read from the file. This kind of script can be used as a one time set up tool to create a bunch of subclasses of the DataFrame.
import pandas as pd
class DFClassGenerator:
= 'class {class_name}(pd.DataFrame):'
CLASS_HEADER = ' {var} = "{label}"' # we cheat an encode 4 spaces here,for demo
COLUMNS
= (" @property\n"
CONSTRUCTOR " def _constructor(self):\n"
" return {class_name}")
@classmethod
def generate_class(cls, df, class_name):
= [cls.COLUMNS.format(var=c.upper(), label=c)
cols for c in df.columns] # works for single hierarchical column index
= [cls.CLASS_HEADER.format(class_name=class_name)]
lines = cls.CONSTRUCTOR.format(class_name=class_name)
constructor = '\n'.join(lines + cols) + '\n\n' + constructor
source_code print(source_code)
We can invoke this function every time we want to generate a new DF type.For example to generate our StudentDF class we do the following.
= pd.read_csv(student_file)
df = DFClassgenerator.generate_class(df, 'StudentDF')
source_code print(source_code)
class StudentsDF(pd.DataFrame):
= 'scores'
SCORES = 'name'
NAME
@property
def _constructor(self):
return StudentsDF
Then, we have the class we need.
Of course, a lot more enhancements can be built into this generation class. Some of the enhancements that can be added are as follows:
Ability to create a symbolic name for Index columns
Handle multi-hierarchical columns
Provide a default converter functions if a type-conversion dictionary is provided.
Build a mapping of fields to types, and automatically create data type converter functions.
Summary
There are lot of scenarios in our production code where it would help to annotate and pass around a rich data structure like a DataFrame by binding them with symbolic column names. Providing more meaningful type hinting also adds needed documentation. Subclassing the dataframe is one approach to help us accomplish those goals.