Posts tagged: kedro

All posts with the tag "kedro"

40 posts latest post 2025-02-05
Publishing rhythm
Feb 2025 | 1 posts

Kedro Basics

Learn Kedro in 5 days Day 0 Setup # [1] - vm - install - python - editor Day 1 # [2] - kedro new - kedro viz Day 2 # [3] - catalog - filter catalog - load data - fsspec Day 3 # [4] - pipeline - nodes Day 4 # [5] - filter pipeline - run partial pipeline Day 5 # [6] - kedro docker - GitHub Actions Advanced Kedro # [7] - hooks - custom datasets - modular pipelines References: [1]: #day-0-setup [2]: #day-1 [3]: #day-2 [4]: #day-3 [5]: #day-4 [6]: #day-5 [7]: #advanced-kedro

What's New in Kedro 0.16.4

If we take a look at the release notes [1] I see one major feature improvement on the list, auto-discovery of hooks. ## Major features and improvements * Enabled auto-discovery of hooks implementations coming from installed plugins. This one comes a bit surprising as it was just casually mentioned in #435 [2] [2] Think pytest # [3] As mentioned in #435 [2] this is the model that pytest uses. Not all plugins automatically start doing things right out of the box but require a CLI argument. simplicity # [4] It feels a bit crazy that simply installing a package will change the way that your pipeline gets executed. I do like that it requires just a bit less reaching into the framework stuff for the average user. Most folks will be able to write in the catalog and nodes without much change to the rest of the project. Implementation # [5] Reading through the docs [6], they show us that we can make our hooks automatically register by adding a kedro.hooks endpoint that points to a ...

Kedro Catalog

I am exploring a kedro catalog meta data hook, these are some notes about what I am thinking. Process # [1] - metadata will be attached to the dataset object under a .metadata attribute - metadata will be updated after_node_run - metadata will be empty until a pipeline is ran with the hook on - optionally a function to add metadata will be added - metadata will be stored in a file next to the filepath - meta Problems This Hook Should solve # [2] - what datasets have a columns with sales in the name - what datasets were updated after last tuesday - which pipeline node created this dataset - how many rows are in this dataset (without reloading all datasets) implementation details # [3] - metadata will be attached to each dataset as a dictionary - list/dict comprehensions can be used to make queries Metadata to Capture # [4] try pandas method -> try spark -> try dict/list -> none - column names - length - Null count - created_by node name Database? # [5] Is there...

Gracefully adopt kedro, the catalog

Why use kedro catalog? # [1] While using the catalog alone will not reap all of the benefits of the framework, it does get you and your project ready for the full framework eventually. For me the full benefit of the catalog comes when you combine it with the pipeline and dont even touch read/write steps at all. Taking a step into kedro by adopting the catalog first will give you a way to organize all of your data loads in one place, and stop manually writing read/write code, which can be different for each data and storage type. You just don’t need to think about it. --- - iperitive loading style - organizes your data - all file locations can be quickly identified - can be dropped into kedro later --- “can be dropped into kedro later” Let’s talk a bit more about that 2 Ways to Gracefully adopt the catalog # [2] How do I get started with the kedro catalog - add with the code api - load from yaml (recommended) 1. Adding to the catalog with the code api # [3] how to use ...

How to find things in your kedro catalog

kedro 0.16.2 just dropped last week with a long-awaited feature… catalog search! I went as far as monkey patching this into each of my projects. I work jump between a few really big projects that have tons of datasets. Being able to quickly search for what I need is so useful. The Catalog # [1] The kedro data catalog is a key component to the kedro framework. It handles all data loading and saving for you. It is configurable and hackable. Having all your data connections listed in one place make it so easy to pick your project up and move it to a completely new environment. That sweet imperative loading style saves so much read/write overhead. I can load all my data with a single command whether it’s in amazon s3, google cloud platform, or a local file. Kick start a toy project # [2] Just like with most of these articles, I am going to create a conda environment so that I don’t break any existing projects and scaffold up a toy project to learn from. conda create -n kedro0162 py...

How Kedro handles your inputs

Passing inputs into kedro is a key concept. Understanding how it accepts a single catalog key as input is quite trivial that easily makes sense, but passing a list or dictionary of catalog entries can be a bit confusing. *args/**args review # [1] Check out this post for a review of how *args **kwargs work in python. understanding python *args and **kwargs [2] python args and kwargs [3] article by @_waylonwalker [4] All Kedro inputs are catalog Entries # [5] When kedro runs your pipeline it uses the catalog to imperatively load your data, meaning that you don’t tell kedro how to load your data, you tell it where your data is and what type it is. These catalog entries are like a key-value store. You just need to give the key when setting up a node. Single Inputs # [6] These are fairly straightforward to understand. In the example below when kedro runs the pipeline it will load the input from the catalog, then pass that input to the func, then save the returned value to the out...

004

🔥 #kedrotips use find-kedro to assembly your pipelines

1 min

002

** 0.3.0 just launched with _ support 🎉

1 min

Kedro Static Viz 0.3.0 is out with Hooks Support

kedro-static-viz [1] is out with support for the newly released hooks feature. This means that you can have kedro-static-viz automatically deploy a full gatsby site before_pipeline_run keeping your visualization always up to date. Even though it is a static site there is no functionality lost. The only thing that’s missing is the flask server. With kedro-static-viz [1] you can deploy your visualization to a number of static hosting providers such as GitHub pages free of charge with wicked fast performance ⚡ It’s Fast # [2] Even though it’s built on gatsbyjs the full site builds in under 2s even on slower hardware. This is because the site is already pre-rendered and stripped of any excess. It’s zipped up right into the python package and is typically used with the cli, but now can be used with python, or as a hook as well. What is kedro-viz [3] 🤔 # [4] Kedro viz is a fantastic kedro plugin that allows you to visualize your data pipeline. Kedro allows you to quickly build produc...

Brainstorming Kedro Hooks

This post is a 🧠 branstorming work in progress. I will likely use it as a storage location/brain dump of hook ideas. What is Kedro 🤔 # [1] If you are completely unsure what kedro is be sure to check out my what is kedro [2] post after_catalog_created # [3] - filepath replacer - bucket replacer before_pipeline_run # [4] - preflight - check that data exists - run kedro_static_viz - run mypy - run interrogate - run flake8 after_pipeline_run # [5] - Great Expectations - send email - send slack before_node_run # [6] after_node_run # [7] - Great Expectations - save stats/meta data - Execution Order # [8] hooks are executed in reverse order of the hooks list. hooks with tryfirst will be moved to the end of the list hooks with trylast will be moved to the end of the list - after_catalog_created - before_pipeline_run - args - run_params = run_params = {‘run_id’: ‘2020-05-23T15.24.23.958Z’, ‘project_path’: ‘/mnt/c/temp/kedro0160’, ’env’: ’local’, ‘kedro_version’: ‘...

Create Custom Kedro Dataset

Kedro provides an efficient way to build out data catalogs with their yaml api. It allows you to be very declaritive about loading and saving your data. For the most part you just need to tell Kedro what connector to use and its filepath. When running Kedro takes care of all of the read/write, you just reference the catalog key. But what is happening behind the scenes # [1] Under the hood there is an AbstractDataSet that each connector inherits from. It sets up a lot of the behind the scenes structure for us so that we dont have to. For the most part kedro has connectors for about anything that you want to load, csv, parquet, sql, json, from about anywhere, http, s3, localfile system are just some of the examples. Here is a DataSet implementation from their docs. Here you can see the barebones example straight from the docs. Parameters from the yaml catalog will get passed in from pathlib import Path import pandas as pd from kedro.io import AbstractDataSet class MyOwnDataSet(...

creating the kedro-preflight hook

Kedro Hooks Intro - kedro hooks are an exciting upcoming feature of kedro 0.16.0. They allow you to hook into catalog_created,pipeline_run, and node_run(nouns). With a before, or after (adjective). This really reminds me of reacts lifecycle hooks, that let you hook into various state of react web components. This is going to make kedro so extendable by the community. I am super pumped to see what the community is able to do with this ability. kedro hooks are an exciting upcoming feature of kedro 0.16.0. They allow you to hook into catalog_created,pipeline_run, and node_run(nouns). With a before, or after (adjective). This really reminds me of reacts lifecycle hooks, that let you hook into various state of react web components. This is going to make kedro so extendable by the community. I am super pumped to see what the community is able to do with this ability. What is Kedro [1] If you are completely unsure what kedro is be sure to check out my what is kedro post Docs # [2] a w...

📝 Kedro Preflight Notes

This is a very rough idea for a kedro package to prevent time lost to get partway through a pipeline run only to realize that you dont have access to data or resources. Must Haves # [1] - check that inputs exist or are of a type to skip (sql) Good to haves - check that all input and output databases are accessible with good credentials - check for s3 bucket access - check for spark install Implementation # [2] @hook_spec def before_pipeline_run(run_params, pipeline, catalog): run params # [3] { "run_id": str "project_path": str, "env": str, "kedro_version": str, "tags": Optional[List[str]], "from_nodes": Optional[List[str]], "to_nodes": Optional[List[str]], "node_names": Optional[List[str]], "from_inputs": Optional[List[str]], "load_versions": Optional[List[str]], "pipeline_name": str, "extra_params": Optional[Dict[str, Any]] } References: [1]: #must-haves [2]: #implementation [3]: #run-params

📢 Announcing find-kedro

find-kedro is a small library to enhance your kedro experience. It looks through your modules to find kedro pipelines, nodes, and iterables (lists, sets, tuples) of nodes. It then assembles them into a dictionary of pipelines, each module will create a separate pipeline, and __default__ being a combination of all pipelines. This format is compatible with the kedro _create_pipelines format. [1] [2] [3] [4] # [5] kedro is a ✨ fantastic project that allows for super-fast prototyping of data pipelines, while yielding production-ready pipelines. find-kedro enhances this experience by adding a pytest like node/pipeline discovery eliminating the need to bubble up pipelines through modules. When working on larger pipeline projects, it is advisable to break your project down into different sub-modules which requires knowledge of building python libraries, and knowing how to import each module correctly. While this is not too difficult, in some cases, it can trip up even the most se...

Create New Kedro Project

This is a quickstart to getting a new kedro [1] pipeline up and running. After this article you should be able to understand how to get started with kedro [1]. You can learn more about this Hello World Example [2] in the docs [2] 🧹 Install Kedro [1] 🛢 Create the Example Pipeline 💨 Run the example 📉 Show the pipeline visualization Create a Virtual Environment [3] # [4] I use conda to control my virtual environments and will create a new environment called kedro_iris with the following command. note the latest compatible version of python is 3.7. EDIT: as of kedro 0.16.0 kedro supports up to 3.8 conda create -n kedro_iris python=3.8 -y [5] Options Activate your conda environment # [6] I try to keep my base environment as clean as possible. I have ran into too many issues installing things in the base environment. Almost always its some dependency that starts causing issues making it even harder to realize where its coming from as I never even installed it in base. source...

What is Kedro

What is Kedro [1] This is my original what-is-kedro article. There is a brand new one --- Kedro is an open source data pipeline framework. It provides guardrails to set your project up right from the start without needing to know deeply how to setup your own python library for data pipelining. It includes really great ways to manipulate catalogs and pipelines. This article will cover the 10K view of kedro, future articles will dive deper into each one. kedro [2] is an open-source data pipeline framework. It provides guardrails to set your project up right from the start without needing to know deeply how to set up your own python library for data pipelining. It includes great ways to manipulate catalogs and pipelines. This article will cover the 10K view of kedro [2], future articles will dive deeper into each one. Libraries # [3] Currently, kedro [2] is broken down into 3 different libraries. 💎 kedro [2] 📉 kedro-viz [4] 🏗 kedro-docker [5] kedro [2] # [6] [7] kedro [2] ...

Kedro

See all of my kedro related posts in [[ tag/kedro ]]. #kedrotips [1] # [2] I am tweeting out most of these snippets as I add them, you can find them all here #kedrotips [3]. 🗣 Heads up # [4] Below are some quick snippets/notes for when using kedro to build data pipelines. So far I am just compiling snippets. Eventually I will create several posts on kedro. These are mostly things that I use In my everyday with kedro. Some are a bit more essoteric. Some are helpful when writing production code, some are useful more usefule for exploration. 📚 Catalog # [5] [6] Photo by jesse orrico on Unsplash CSVLocalDataSet # [7] python import pandas as pd iris = pd.read_csv('https://raw.githubusercontent.com/kedro-org/kedro/d3218bd89ce8d1148b1f79dfe589065f47037be6/kedro/template/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/iris.csv') data_set = CSVLocalDataSet(filepath="test.csv", load_args=None, save_args={"index": False}) iris_data_set.save(iris) reloaded_iris = iris_data_se...