I am exploring a kedro catalog meta data hook, these are some notes about what I am thinking.

Process

  • metadata will be attached to the dataset object under a .metadata attribute
  • metadata will be updated after_node_run
  • metadata will be empty until a pipeline is ran with the hook on
  • optionally a function to add metadata will be added
  • metadata will be stored in a file next to the filepath
  • meta

Problems This Hook Should solve

  • what datasets have a columns with sales in the name
  • what datasets were updated after last tuesday
  • which pipeline node created this dataset
  • how many rows are in this dataset (without reloading all datasets)

implementation details

  • metadata will be attached to each dataset as a dictionary
  • list/dict comprehensions can be used to make queries

Metadata to Capture

try pandas method -> try spark -> try dict/list -> none

  • column names
  • length
  • Null count
  • created_by node name

Database?

Is there an easy way to create a nosql database in memory from a a list of dictionaries?