New Python Dataclass

Dataclass - New for Python 3.7 and above

While this is not necessarily new new (June 27, 2018) it is worth noting. The Dataclass is a useful option for handling data.

Per the PEP 557 Abstract:

This PEP describes an addition to the standard library called Data Classes. Although they use a very different mechanism, Data Classes can be thought of as “mutable namedtuples with defaults”. Because Data Classes use normal class definition syntax, you are free to use inheritance, metaclasses, docstrings, user-defined methods, class factories, and other Python class features.

A class decorator is provided which inspects a class definition for variables with type annotations as defined in PEP 526, “Syntax for Variable Annotations”. In this document, such variables are called fields. Using these fields, the decorator adds generated method definitions to the class to support instance initialization, a repr, comparison methods, and optionally other methods as described in the Specification section. Such a class is called a Data Class, but there’s really nothing special about the class: the decorator adds generated methods to the class and returns the same class it was given.

What does this mean?

The initial example in the PEP 557 documents walks through how it works, but the gist of it is that the Dataclass creates the class automatically in the background. Its constructor and other magic methods, such as repr(), eq(), and hash() are generated automatically. They also come with basic functionality such as instantiate, print, and compare data class instances that are ready to use once created.

Where is it not appropriate to use Data Classes?

  • API compatibility with tuples or dicts is required.
  • Type validation beyond that provided by PEPs 484 and 526 is required, or value validation or conversion is required.

Why use a Dataclass?

Below is an example of code that pulls in USGS earthquake data using an API, uses the Dataclass to format the data in the desired manner, and then loads it into a Pandas Dataframe. This approach standardizes and cleans up ingesting the JSON data from the API to a Dataclass with some added features and then loads the ingested data into a Pandas DataFrame.

Code Example

Consider the USGS Earthquake API and ingesting the data. (For full details on USGS Earthquake API - LINK)

The API return data is in JSON and the information is a collection of strings, integers, and floats. The full description of the data is available on the USGS website. For this example the variable of interest is time. The time is stored as a integer, example 1596974857650, which is ISO8601 Date/Time format. But for ease of reading converting it to a date and time format such as YYYY-MM-DD HH:MM:SS is desired. To do this the Dataclass will be used to add a new variable within the class and automatically convert the time when an instance of the class is created.

Input Data

The data input looks like this (one line scroll for overall page readability):

{"type":"FeatureCollection","metadata":{"generated":1617631452000,"url":"https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&starttime=2014-01-01&endtime=2014-01-02","title":"USGS Earthquakes","status":200,"api":"1.10.3","count":326},"features":[{"type":"Feature","properties":{"mag":1.29,"place":"10km SSW of Idyllwild, CA","time":1388620296020,"updated":1457728844428,"tz":-480,"url":"https://earthquake.usgs.gov/earthquakes/eventpage/ci11408890","detail":"https://earthquake.usgs.gov/fdsnws/event/1/query?eventid=ci11408890&format=geojson","felt":null,"cdi":null,"mmi":null,"alert":null,"status":"reviewed","tsunami":0,"sig":26,"net":"ci","code":"11408890","ids":",ci11408890,","sources":",ci,","types":",cap,focal-mechanism,general-link,geoserve,nearby-cities,origin,phase-data,scitech-link,","nst":39,"dmin":0.06729,"rms":0.09,"gap":51,"magType":"ml","type":"earthquake","title":"M 1.3 - 10km SSW of Idyllwild, CA"},"geometry":{"type":"Point","coordinates":[-116.7776667,33.6633333,11.008]},"id":"ci11408890"},

Building a Dataclass

Using Pyhthon3 requests to call API and get the data it is then ingested into a Dataclass. The Dataclass is as follows.

@dataclass
class EarthquakeClassEvent:
    mag: float
    place: str
    time: str 
    updated: int
    tz: str
    url: str
    detail: str
    felt: int
    cdi: float
    mmi: float
    alert: str
    status: str
    tsunami: int
    sig: int
    net: str
    code: str
    ids: str
    sources: str
    types: str
    nst: str
    dmin: float
    rms: float
    gap: int
    magType: str
    ttype: str
    title: str
    readable_time: datetime = field(init = False)  

Note the last line readable_time this is not a data field from the API this is a user defined variable that will not get a value input when created. To convert the raw time input to a readable time __post_init__(self): is used as follows:

def __post_init__(self):
    """Converts the raw timestamp input into a readable 
    time format and saves to readable_time variable"""
    ts = int(self.time)/1000
    self.readable_time = datetime.datetime.fromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S')

When data is loaded into the class the function above (part of the Dataclass itself) will convert time to the desired format and store it in readable_time. The process of appending each earthquake event as a class object to a list is shown below:

def parse_data_to_dataclass(data):
    # declare list to store dataclass
    earthquake_list = []
    for j in (data['features']):
        i = j['properties']
        earthquake_list.append(EarthquakeClassEvent(i['mag'], 
            i['place'], i['time'], i['updated'], i['tz'], 
            i['url'], i['detail'], i['felt'], i['cdi'], 
            i['mmi'], i['alert'], i['status'], i['tsunami'], 
            i['sig'], i['net'], i['code'], i['ids'], i['sources'], 
            i['types'], i['nst'], i['dmin'], i['rms'], i['gap'], 
            i['magType'], i['type'], i['title']))
    return earthquake_list 

The advantage of this approach is that the ingestion of the data and operations on the data is all in one place. This helps keep the process clean and easier to use in different applications.

Another useful feature is that Dataclass can be used to load a Dataframe in Pandas. Below the list of EarthquakeClassEvent are converted to a Pandas Dataframe.

# load dataclass to the pandas dataframe
df = pd.DataFrame(earthquakes)
print(df.head(5))

The list of Dataclass data is loaded into the Pandas Dataframe and the first five rows are displayed.

mag                              place           time        updated  tz  ... gap magType       ttype                                      title        readable_time
0  6.4      26 km SW of Pocito, Argentina  1611024382380  1616879123040 NaN  ...  22     mww  earthquake      M 6.4 - 26 km SW of Pocito, Argentina  2021-01-18 21:46:22
1  5.5  52 km NE of Bandar-e Lengeh, Iran  1610746264660  1616879081040 NaN  ...  35     mww  earthquake  M 5.5 - 52 km NE of Bandar-e Lengeh, Iran  2021-01-15 16:31:04
2  5.5        7 km WNW of Sivrice, Turkey  1609051052897  1615072766040 NaN  ...  26     mww  earthquake        M 5.5 - 7 km WNW of Sivrice, Turkey  2020-12-27 01:37:32
3  7.6     99 km SE of Sand Point, Alaska  1603140878950  1615823316759 NaN  ...  36     mww  earthquake     M 7.6 - 99 km SE of Sand Point, Alaska  2020-10-19 16:54:38
4  6.6  13 km E of San Pedro, Philippines  1597709028566  1603570140040 NaN  ...  14     mww  earthquake  M 6.6 - 13 km E of San Pedro, Philippines  2020-08-17 20:03:48

[5 rows x 27 columns]

Now the data is converted as desired and in a Pandas Dataframe. The readable_time is the last column to the right. The full code is in GitHub LINK.

Conclusions

The “New” Python Dataclass is a useful tool that standardizes how data is ingested, transformed, and handled. It keeps the data handling in one place and can be easily converted to a Pandas Dataframe. This is just a brief and simple example, but there are many examples and resources that go into much greater depth.

References