Kedro Catalog

edit✏️

πŸŽ„This post has fully grown


I am exploring a kedro catalog meta data hook, these are some notes about what I am thinking.

Process

  • metadata will be attached to the dataset object under a .metadata attribute
  • metadata will be updated after_node_run
  • metadata will be empty until a pipeline is ran with the hook on
  • optionally a function to add metadata will be added
  • metadata will be stored in a file next to the filepath
  • meta

Problems This Hook Should solve

  • what datasets have a columns with sales in the name
  • what datasets were updated after last tuesday
  • which pipeline node created this dataset
  • how many rows are in this dataset (without reloading all datasets)

implementation details

  • metadata will be attached to each dataset as a dictionary
  • list/dict comprehensions can be used to make queries

Metadata to Capture

try pandas method -> try spark -> try dict/list -> none

  • column names
  • length
  • Null count
  • created_by node name

Database?

Is there an easy way to create a nosql database in memory from a a list of dictionaries?



πŸ‘€ see an issue, edit this post on GitHub

If you found value in this post
and want to send a tip.

Buy Me A Coffee



tweet about this post and it will show up here.

    loading
←An IndieWeb Webring πŸ•ΈπŸ’β†’

.