Kedro Catalog

Waylon Walker

Learning in Public @_WaylonWalker

Tags

I am exploring a kedro catalog meta data hook, these are some notes about what I am thinking.

Process #

metadata will be attached to the dataset object under a .metadata attribute
metadata will be updated after_node_run
metadata will be empty until a pipeline is ran with the hook on
optionally a function to add metadata will be added
metadata will be stored in a file next to the filepath
meta

Problems This Hook Should solve #

what datasets have a columns with sales in the name
what datasets were updated after last tuesday
which pipeline node created this dataset
how many rows are in this dataset (without reloading all datasets)

implementation details #

metadata will be attached to each dataset as a dictionary
list/dict comprehensions can be used to make queries

Metadata to Capture #

try pandas method -> try spark -> try dict/list -> none

column names
length
Null count
created_by node name

Database? #

Is there an easy way to create a nosql database in memory from a a list of dictionaries?

Recent Posts

Recent Thoughts

Recent Stars