Advertisement

Your Ad could be here. I want to connect my readers to relavant ads. If you have a product targeted at developers, let's talk. [email protected]

Kedro provides an efficient way to build out data catalogs with their yaml api. It allows you to be very declaritive about loading and saving your data. For the most part you just need to tell Kedro what connector to use and its filepath. When running Kedro takes care of all of the read/write, you just reference the catalog key.

But what is happening behind the scenes

Under the hood there is an AbstractDataSet that each connector inherits from. It sets up a lot of the behind the scenes structure for us so that we dont have to. For the most part kedro has connectors for about anything that you want to load, csv, parquet, sql, json, from about anywhere, http, s3, localfile system are just some of the examples.

Here is a DataSet implementation from their docs. Here you can see the barebones example straight from the docs. Parameters from the yaml catalog will get passed in


from pathlib import Path

import pandas as pd

from kedro.io import AbstractDataSet


class MyOwnDataSet(AbstractDataSet):
    def __init__(self, param1, param2, filepath, version):
        super().__init__(Path(filepath), version)
        self._param1 = param1
        self._param2 = param2

    def _load(self) -> pd.DataFrame:
        load_path = self._get_load_path()
        return pd.read_csv(load_path)

    def _save(self, df: pd.DataFrame) -> None:
        save_path = self._get_save_path()
        df.to_csv(save_path)

 	def _exists(self) -> bool:
        path = self._get_load_path()
        return path.is_file()

    def _describe(self):
        return dict(version=self._version, param1=self._param1, param2=self._param2)