[1]
Migrating from kedro 0.18.4 to the latest version involves handling the deprecated OmegaConf loader. Switching over does not look as bad as I originally thought.
- installing kedro 0.18.5+
- set the CONFIG_LOADER_CLASS in settings.py
- swap out import statements
- config must be yaml or json
- getting values from config must be done with bracket __getattr__ style not with .get
- any Exceptions caught from Templated config loader will need to be swapped to OmegaConfig exceptions, similar to #3
- templated values must lead with an _
- Globals are handled different
- OmegaConfig does not support jinja2 sytax, but rather a ${variable} syntax
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: /static/https://docs.kedro.org/en/stable/configuration/config_loader_migration.html
[2]: /thoughts/
Posts tagged: kedro
All posts with the tag "kedro"
40 posts
latest post 2025-02-05
Publishing rhythm
Kedro rich is a very new and unstable (it’s good, just not ready) plugin for
kedro to make the command line prettier.
Install kedro rich # [1]
There is no pypi package yet, but it’s on github. You can pip install it with
the git [2] url.
pip install git+https://github.com/datajoely/kedro-rich
Kedro run # [3]
You can run your pipeline just as you normally would, except you get progress
bars and pretty prints.
kedro run
[4]
Kedro catalog # [5]
Listing out catalog entries from the command line now print out a nice pretty
table.
kedro catalog list
[6]
Give it a star # [7]
Go to the GitHub repo [8] and give it a
star, Joel deserves it.
References:
[1]: #install-kedro-rich
[2]: /glossary/git/
[3]: #kedro-run
[4]: https://images.waylonwalker.com/kedro-rich-run.png
[5]: #kedro-catalog
[6]: https://images.waylonwalker.com/kedro-rich-catalog-list.png
[7]: #give-it-a-star
[8]: https://github.com/datajoely/kedro-rich
I keep my nodes short and sweet. They do one thing and do it well. I
turn almost every DataFrame transformation into its own node. It makes
it must easier to pull catalog entries, than firing up the pipeline,
running it, and starting a debugger. For this reason many of my nodes
can be built from inline lambdas.
Examples # [1]
Here are two examples, the first one lambda x: x is sometimes referred
to as an identity function. This is super common to use in the early
phases of a project. It lets you follow standard layering conventions,
without skipping a layer, overthinking if you should have the layer or
not, and leaves a good placholder to fill in later when you need it.
Many times I just want to get the data in as fast as possible, learn
about it, then go back and tidy it up.
from kedro.pipeline import node
my_first_node = node(
func=lambda x: x,
inputs='raw_cars',
output='int_cars',
tags=['int',]
)
my_first_node = node(
func=lambda cars: cars[['mpg', 'cyl', 'disp',]].query('disp>200'),
inputs='raw_cars',
output='int_cars',
tags=['pri',]
)
Note: try not to take the idea of a one liner too far. If your
one line function wraps several lines down it probably deserv...
As you work on your kedro projects you are bound to need to add more
dependencies to the project eventually. Kedro uses a fantastic command
pip-compile under the hood to ensure that everyone is on the same version of
packages at all times, and able to easily upgrade them. It might be a bit
different workflow than what you have seen, let’s take a look at it.
git status # [2]
Before you start mucking around with any changes to dependencies make sure that
your git status is clean. I’d even reccomend starting a new branch for this,
and if you are working on a team potentially submit this as its own PR for
clarity.
git status
git checkout main
git checkout -b add-rich-dependency
requirements.in # [3]
New requirements get added to a requirements.in file. If you need to specify
an exact version, or a minimum version you can do that, but if all versions
generally work you can leave it open.
# requirements.in
rich
Here I added the popular rich package to my requirements.in file. Since
I am ok with the latest version I am not going to pin anything, I am going to
let the pip resolver pick the latest version that does not conflict with any of
my dependencies for me.
build-reqs # [4]
...
I am a huge believer in practicing your craft. Professional athletes
spend most of their time honing their skills and making themsleves
better. In Engineering many spend nearly 0 time practicing. I am not
saying that you need to spend all your free time practicing, but a few
minutes trying new things can go a long way in how you understand what
you are doing and make a hue impact on your long term productivity.
What is Kedro [1]
Start practicing # [2]
practice building pipelines with #kedro today
Go to your playground directory, and if you don’t have one, make one.
cd ~/playground
get pipx # [3]
Install pipx in your system python. This is one of the very few, and
possibly the only python library that deserves to be installed in your
system directory, primarily because its used to sanbox clis in their own
virtual environment [4] automatically for you.
pip install pipx
make a new project # [5]
From inside your playground directory, start your new kedro project.
This is quite simple and painless. So much so that if you mess this one
up doing something wild, it might be easier to make a new one that
fixing the wild one.
pipx run kedro new
# answer the questions it asks
I u...
I just installed a brand new Ubuntu 21.10 Impish Indri, and wanted a
kedro project to play with so I did what any good kedroid would do, I
went to my command line and ran
pipx run kedro new --starter spaceflights
But what I got back was not what I expected!
Fatal error from pip prevented installation. Full pip output in file:
/home/walkers/.local/pipx/logs/cmd_2022-01-01_20.42.16_pip_errors.log
Some possibly relevant errors from pip install:
ERROR: Could not find a version that satisfies the requirement kedro (from versions: none)
ERROR: No matching distribution found for kedro
Error installing kedro.
This is weird, why cant I run kedro new with pipx? Lets try pip.
pip install kedro
Same issue.
ERROR: Could not find a version that satisfies the requirement kedro (from versions: none)
ERROR: No matching distribution found for kedro
What is Kedro [1]
Curious what kedro is? Check out this article.
What’s up # [2]
wrong python version
The issue is that kedro only runs on up to python 3.8, and on Ubuntu
21.10 when you apt install python3 you get python 3.9 and the
standard repos don’t have an old enough version to run kedro.
How to fix this? # [3]
Theres a couple of wa...
kedro catalog create
I use kedro catalog create to boost my productivity by automatically
generating yaml catalog entries for me. It will create new yaml files for each
pipeline, fill in missiing catalog entries, and respect already existing
catalog entries. It will reformat the file, and sort it based on catalog key.
https://youtu.be/_22ELT4kja4
What is Kedro [1]
👆 Unsure what kedro is? Check out this post.
Running Kedro Catalog Create # [2]
The command to ensure there are catalog entries for every dataset in the passed
in pipeline.
kedro catalog create --pipeline history_nodes
- Create’s new yaml file, if needed
- Fills in new dataset entries with the default dataset
- Keeps existing datasets untouched
- it will reformat your yaml file a bit
- default sorting will be applied
- empty newlines will be removed
CONF_ROOT # [3]
Kedro will respect your CONF_ROOT settings when it creates a new catalog
file, or looks for existing catalog files. You can change the location of your
configuration f...
nvim conf 2021 | IDE's are slow | Waylon Walker
https://youtu.be/E18m4KkJUnI
---
Slides 👇 # [1]
welcome # [2]
Other possible titles # [3]
- Using Vim as a Team Lead
- I 💜 Tmux
- Why I stopped using @code
- Get there fast
- How I vim
It’s ok # [4]
Use a graphical IDE if it works for you.
Trick it out # [5]
vim is so well integrated into the terminal, take advantage
It wasn’t working for me anymore # [6]
dozens of instances # [7]
As a team lead I bounce betweeen a dozen projects a per day
https://pbs.twimg.com/media/FAEmRjYUcAUk2eR?format=jpg&name=large [8]
Move With Intent # [9]
Running vim inside tmux lets me move swiftly between the exact project I need.
https://twitter.com/_WaylonWalker/status/1438849269407047686/photo/1// [10]: <> (__)
Hub and Spoke # [11]
- direct link to specific projects
- fuzzy into all projects
- fuzzy into open projects
How I navigate tmux in 2021 [12]#hub-and-spoke
Other Things That Make this Possible # [13]
- tmux
- direnv
vim adjacent things
yes, vim is ugly, make it your...
Kedro-Broken-Urls
Broken Urls # [1]
- https://github.com/josephhaaga) [ ] https://example.com/file.h5
- https://raw.githubusercontent.com/kedro-org/kedro/develop/static/img/pipeline_visualisation.png
- https://example.com/file.txt
- https://github.com/jmespath/jmespath.py.
- https://github.com/tsanikgr)
- https://example.com/file.csv
- https://kedro.readthedocs.io/en/latest/04_user_guide/15_hooks.html
- https://kedro.readthedocs.io/en/stable/07_extend_kedro/04_hooks.html
- https://github.com/EbookFoundation/free-programming-books/blob/master/books/free-programming-books.md#python
- https://github.com/quantumblacklabs/private-kedro/blob/develop/docs/source/04_user_guide/04_data_catalog.md
- http://example.com/api/test
- https://example.com/file.parquet
- https://kedro.readthedocs.io/en/stable/11_faq/01_faq.html#how-do-i-upgrade-kedro
- https://example.com/file.xlsx
- https://www.datacamp.com/community/tutorials/docstrings-python
- https://github.com/mmchougule)
- https://example.com/f...
Setting Parameters in kedro
Parameters are a place for you to store variables for your pipeline that can be
accessed by any node that needs it, and can be easily changed by changing your
environment. Parameters are stored in the repository in yaml files.
https://youtu.be/Jj5cQ5bqcjg
What is Kedro [1]
👆 Unsure what kedro is? Check out this post.
parameters files # [2]
You can have multiple parameters files and choose which ones to load by setting
your environment. By default kedro will give you a base and local
parameters file.
- conf/base/parameters.yml
- conf/local/parameters.yml
base # [3]
The base environment should contain all of the default values you want to run.
# /conf/base/parameters.yml
test_size: 0.2
random_state: 3
features:
- engines
- passenger_capacity
- crew
- d_check_complete
- moon_clearance_complete
- iata_approved
- company_rating
- review_scores_rating
NOTE base will always be loaded first.
accessing parameters # [4]
Parameters can be accessed through context or throug...
Writing your first kedro Nodes
https://youtu.be/-gEwU-MrPuA
Before we jump in with anything crazy, let’s make some nodes with some vanilla
data structures.
import node # [1]
You will need to import node from kedro.pipeline to start creating nodes.
from kedro.pipeline import node
func # [2]
The func is a callable that will take the inputs and create the outputs.
inputs / outputs # [3]
Inputs and outputs can be None, a single catalog entry as a string, mutiple
catalog entries as a List of strings, or a dictionary of strings where the key
is the keyword argument of the func and the value is the catalog entry to use
for that keyword.
our first node # [4]
Sometimes in our pipelines our data is coming from an api where we already have
python functions built to pull with. Thats ok, kedro supposrts that with
inputs=None.
def create_range():
return range(100)
make_range = node(
func=create_range,
inputs=None,
outputs='range'
)
second node # [5]
Now we have some data to work from, lets use that as our inpu...
Running your Kedro Pipeline from the command line
Running your kedro pipeline from the command line could not be any easier to
get started. This is a concept that you may or may not do often depending on
your workflow, but its good to have under your belt. I personally do this half
the time and run from ipython half the time. In production, I mostly use docker
and that is all done with this cli.
https://youtu.be/ZmccpLy-OEI
What is Kedro [1]
👆 Unsure what kedro is? Check out this post.
Kedro run # [2]
To run the whole darn project all we need to do is fire up a terminal, activate
our environment, and tell kedro to run.
kedro run
Specific Pipelines # [3]
Running a sub pipeline that we have created is as easy as telling kedro which
one we want to run.
kedro run --pipeline dp
Single Nodes # [4]
While developing a node or a small list of nodes in a larger pipeline its handy
to be able to run them one at a time. Besides the use case of developing a
single node I would not reccomend leaning very heavy on running single nodes,
le...
kedro Virtual Environment
Avoid serious version conflict issues, and use a virtual environment [1] anytime
you are running python, here are three ways you can setup a kedro virtual
environment.
https://youtu.be/ZSxc5VVCBhM
- conda
- venv
- pipenv
conda # [2]
I prefer to use conda as my virtual environment manager of choice as it give me
both the interpreter and the packages I install. I don’t have to rely on the
system version of python or another tool to maintain python versions at all, I
get everything in one tool.
conda create -n my-project python=3.8 -y
conda activate my-project
python -m pip install --upgrade pip
pip install -e src
conda info --envs
- stores environment in a root directory i.e. ~/miniconda3
- conda can use its own way to manage environments environment.yml
- the python interpreter is packaged with the environment
virtualenv # [3]
Virtual env (venv) is another very respectable option that is built right into
python, and requires no additional installs or using a different dis...
Kedro Pipeline Create
Kedro pipeline create is a command that makes creating new
pipelines much easier. There is much less boilerplate that
you need to write yourself.
https://youtu.be/HtyIKqlEoNw
creating a new pipeline # [1]
The kedro cli comes with the following command to scaffold out
new pipelines. Note that it will not add it to your
pipeline_registry, to be covered later, you will need to add
it yourself.
kedro pipeline create example
results # [2]
The directory structure that it creates looks like this.
tree src/kedro_conda/pipelines
src/kedro_conda/pipelines
├── __init__.py
└── example
├── __init__.py
├── nodes.py
├── pipeline.py
└── README.md
References:
[1]: #creating-a-new-pipeline
[2]: #results
Kedro Install
Kedro comes with an install command to install and manage all of your
projects dependencies.
https://youtu.be/IWimEs-hHQg
cd into your project directory and activate env # [1]
You must start by having your kedro project either cloned down
from an existing project or created from kedro new. Then
activate your environment.
Kedro New [2]
this post covers kedro new
kedro Virtual Environment [3]
This post covers creating your virtual environment [4] for kedro
install kedro # [5]
Make sure you have kedro installed in your current
environment, if you dont already have it.
pip install kedro==0.17.4
pip-tools # [6]
Kedro uses the pip-tools package under the hood to pin
dependencies in a very robust way to ensure that the project
will continue to work on everyone’s machine day, including
production, day in and day out. No matter what happens to the
dependencies you have installed.
pip-compile # [7]
The command that kedro uses from pip-tools is pip-compile. It will look at
what yo...
Kedro Git Init
Immediately after kedro new, before you start running kedro install or your first line of code the first
thing you should always do after getting a new kedro template created is to
git init.
https://youtu.be/IGba3ytf_6U
git init # [2]
Its as simple as these three commands to get started.
git init
git add .
git commit -m init
I don’t care if this project is for learning, if it will never have a remote or not, use git.
References:
[1]: /glossary/git/
[2]: #git-init
Kedro New
https://youtu.be/uqiv5LAiJe0
Kedro new is simply a wrapper around the cookiecutter templating library. The
kedro team maintains a ready made template that has everything you need for a
kedro project. They also maintain a few kedro starters, which are very similar
to the base template.
What is Kedro [1]
Unsure what kedro is, Check out yesterdays post on What is Kedro.
pipx # [2]
I reccomend using pipx when running kedro new. pipx is designed for system
level cli tools so that you do not need to maintain a virtual environment [3] or
worry about version conflicts, pipx manages the environment for you.
The kedro team does not reccomend pipx in their docs as they already feel
like there is a bit of a tool overload for folks that may be less familiar with
pipx kedro new
I like using pipx as it gives you better control over using a specific
version or always the latest version, unlike when you run what you have on your
system depends on when you last installed or upgraded.
Kedro Ne...
What is Kedro
Kedro is an unopinionated Data Engineering framework that comes with a somewhat
opinionated template. It gives the user a way to build pipelines that
automatically take care of io through the use of abstract DataSets that the
user specifies through Catalog entries. These Catalog entries are loaded,
ran through a function, and saved by Nodes. The order that these Nodes are
executed are determined by the Pipeline, which is a DAG. It’s the
runner’s job to manage the execution of the Nodes.
https://youtu.be/Wf4rnFsaFFU
---
What is Kedro [1]
This is an updated version of my original what-is-kedro article
---
Hot Take # [2]
If you are doing a series of operations to data with python, especially if you
are using something as supported as pandas, you should be using a framework
that gives you a pipeline as a DAG and abstracts io.
Orchestrators # [3]
Like I said, kedro is unopinionated it does determine where or how your data
should be ran. The kedro team does support the following ...
How I Kedro
https://youtu.be/bw5_FWDVRpU
Ubuntu # [1]
I recently switched over to using Ubuntu, it works well pretty much out of the
box for me. I am using gnome with a dark theme.
Gnome Terminal # [2]
I am still using the built in default gnome terminal, it just works. It does
all the things that I need it to do. It supports transparency renders my fonts
and allows me to highlight things well.
- One Dark Theme
dotfiles # [3]
You can find my
dotfiles [4] on
github. Feel free to read through and take anything that you
find useful. I would encourage you not to steal them, but to
integrate the parts that you want into your own dotfiles.
dotfiles are a very personal thing. They are an extension of
ones fingertips designed for how you think and type.
zsh # [5]
I use zsh as my default shell. I like to use it as my
interactive shell. It works, and does a bit better with
things like tab completion out of the box.
starship # [6]
I use the starship prompt for my shell. It works well out of
the...
Incremental Versioned Datasets in Kedro
Kedro versioned datasets can be mixed with incremental and partitioned datasets
to do some timeseries analysis on how our dataset changes over time. Kedro is
a very extensible and composible framework, that allows us to build solutions
from the individual components that it provides. This article is a great
example of how you can combine these components in unique ways to achieve some
powerful results with very little work.
What is Kedro [1]
👆 Unsure what kedro is? Check out this post.
How does our dataset change over time?? # [2]
This was a question presented to me at work. We had some plots being produces
as the output of our pipeline and the user wanted the ability to compare
results over time. Luckily this was asked early in the project so we were able
to proactively setup versioning on the right datasets.
To enable this all we needed to do now was to add versioned: true and we will
be able to compare results over time. Yes kedro makes it that easy to setup.
set up a proje...