Posts tagged: data

All posts with the tag "data"

70 posts latest post 2025-06-09
Publishing rhythm
Jun 2025 | 1 posts

Kedro Install

Kedro comes with an install command to install and manage all of your projects dependencies. https://youtu.be/IWimEs-hHQg cd into your project directory and activate env # [1] You must start by having your kedro project either cloned down from an existing project or created from kedro new. Then activate your environment. Kedro New [2] this post covers kedro new kedro Virtual Environment [3] This post covers creating your virtual environment [4] for kedro install kedro # [5] Make sure you have kedro installed in your current environment, if you dont already have it. pip install kedro==0.17.4 pip-tools # [6] Kedro uses the pip-tools package under the hood to pin dependencies in a very robust way to ensure that the project will continue to work on everyone’s machine day, including production, day in and day out. No matter what happens to the dependencies you have installed. pip-compile # [7] The command that kedro uses from pip-tools is pip-compile. It will look at what yo...

Kedro Git Init

Immediately after kedro new, before you start running kedro install or your first line of code the first thing you should always do after getting a new kedro template created is to git init. https://youtu.be/IGba3ytf_6U git init # [2] Its as simple as these three commands to get started. git init git add . git commit -m init I don’t care if this project is for learning, if it will never have a remote or not, use git. References: [1]: /glossary/git/ [2]: #git-init

Kedro New

https://youtu.be/uqiv5LAiJe0 Kedro new is simply a wrapper around the cookiecutter templating library. The kedro team maintains a ready made template that has everything you need for a kedro project. They also maintain a few kedro starters, which are very similar to the base template. What is Kedro [1] Unsure what kedro is, Check out yesterdays post on What is Kedro. pipx # [2] I reccomend using pipx when running kedro new. pipx is designed for system level cli tools so that you do not need to maintain a virtual environment [3] or worry about version conflicts, pipx manages the environment for you. The kedro team does not reccomend pipx in their docs as they already feel like there is a bit of a tool overload for folks that may be less familiar with pipx kedro new I like using pipx as it gives you better control over using a specific version or always the latest version, unlike when you run what you have on your system depends on when you last installed or upgraded. Kedro Ne...

What is Kedro

Kedro is an unopinionated Data Engineering framework that comes with a somewhat opinionated template. It gives the user a way to build pipelines that automatically take care of io through the use of abstract DataSets that the user specifies through Catalog entries. These Catalog entries are loaded, ran through a function, and saved by Nodes. The order that these Nodes are executed are determined by the Pipeline, which is a DAG. It’s the runner’s job to manage the execution of the Nodes. https://youtu.be/Wf4rnFsaFFU --- What is Kedro [1] This is an updated version of my original what-is-kedro article --- Hot Take # [2] If you are doing a series of operations to data with python, especially if you are using something as supported as pandas, you should be using a framework that gives you a pipeline as a DAG and abstracts io. Orchestrators # [3] Like I said, kedro is unopinionated it does determine where or how your data should be ran. The kedro team does support the following ...

How I Kedro

https://youtu.be/bw5_FWDVRpU Ubuntu # [1] I recently switched over to using Ubuntu, it works well pretty much out of the box for me. I am using gnome with a dark theme. Gnome Terminal # [2] I am still using the built in default gnome terminal, it just works. It does all the things that I need it to do. It supports transparency renders my fonts and allows me to highlight things well. - One Dark Theme dotfiles # [3] You can find my dotfiles [4] on github. Feel free to read through and take anything that you find useful. I would encourage you not to steal them, but to integrate the parts that you want into your own dotfiles. dotfiles are a very personal thing. They are an extension of ones fingertips designed for how you think and type. zsh # [5] I use zsh as my default shell. I like to use it as my interactive shell. It works, and does a bit better with things like tab completion out of the box. starship # [6] I use the starship prompt for my shell. It works well out of the...

Incremental Versioned Datasets in Kedro

Kedro versioned datasets can be mixed with incremental and partitioned datasets to do some timeseries analysis on how our dataset changes over time. Kedro is a very extensible and composible framework, that allows us to build solutions from the individual components that it provides. This article is a great example of how you can combine these components in unique ways to achieve some powerful results with very little work. What is Kedro [1] 👆 Unsure what kedro is? Check out this post. How does our dataset change over time?? # [2] This was a question presented to me at work. We had some plots being produces as the output of our pipeline and the user wanted the ability to compare results over time. Luckily this was asked early in the project so we were able to proactively setup versioning on the right datasets. To enable this all we needed to do now was to add versioned: true and we will be able to compare results over time. Yes kedro makes it that easy to setup. set up a proje...

I Started Streaming on Twitch

I recently started streaming on twitch.tv/waylonwalker [1] and it’s been a blast so far. - python - kedro - Data Science - Data Engineering - webdev - digital gardening Kedro Spaceflights # [2] It all started with kedro/issues/606 [3], Yetu called out for users of kedro to record themselves doing a walk through of their tutorials. I wanted to do this, but was really stuck at the fact that recording or editing somewhat polished vide is quite time consuming for me. [4] Inspiration # [5] My introduction to twitch came from twitch.tv/theprimeagen [6]. I watched him on YouTube, and then decided to drop into a stream. It was so fun to watch him live that I started following others in the science and tech category. - twitch.tv/teej_dv [7] Brilliant neovim core dev, I learn a bunch about nvim every time I watch. - twitch.tv/cmgriffing [8] Super Chill and engaging chat. - twitch.tv/cassidoo [9] Fantastic discussion/chat. - twitch.tv/anthonywritescode [10] Building the python ...

Upcoming Stream

I'm no longer streaming As much as I would really love to make streaming work, its really hard for my family situation to make large blocks of time work for me. https://stackoverflow.com/questions/16720541/python-string-replace-regular-expression I am starting to stream 3 days per week, before I start work in the morning. These streams will likely be me just talking through things I am already doing. Making DAGs do 🔮Magical Things | Open Source 🐍Python | kedro plugins | # [1] Science & Technology | Every Monday • 7:00 AM - 9:00 AM CDT On Monday’s I am going to be working on open source packages/plugins for kedro. - kedro-diff - test kedro-diff on piplines with history - setup deploy pipeline - deply to pypi 🌱 Digital Gardening | Blogging with 🐍Python | Building 🔮Markata a static site generator in python for waylonwalker.com # [2] Science & Technology | Every Wednesday • 7:00 AM - 9:00 AM CDT On Wednesday morning I will be working on my personal website and the static s...

Kedro Spaceflights - part 2 | Stream replay June 7, 2021

This was my seconf time ever streaming on twitch.tv/waylonwalker [1], and I completely botched my mic 2x. https://youtu.be/_7MwgKu-844 Links # [2] - Spaceflights Tutorial [3] - my spaceflights repo [4] Notes to get started # [5] pipx run kedro new cd project python -m venv .venv source .venv/bin/activate pip install kedro kedro install References: [1]: https://twitch.tv/waylonwalker [2]: #links [3]: https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html [4]: https://github.com/WaylonWalker/kedro-spaceflights [5]: #notes-to-get-started

Kedro Spaceflights - part 1 | Stream replay June 4, 2021

This was my first time ever streaming on twitch.tv/waylonwalker [1]. I am excited to get going. I have been streaming early in the morning while I am still waking up, so still a bit groggy as I go. https://youtu.be/Y07UBr9Ccjs Kedro Spaceflights # [2] It all started with kedro/issues/606 [3], Yetu called out for users of kedro to record themselves doing a walk through of their tutorials. I wanted to do this, but was really stuck at the fact that recording or editing somewhat polished vide is quite time consuming for me. [4] Notes # [5] pipx run kedro new cd project python -m venv .venv source .venv/bin/activate pip install kedro kedro install References: [1]: https://twitch.tv/waylonwalker [2]: #kedro-spaceflights [3]: https://github.com/kedro-org/kedro/issues/606 [4]: https://images.waylonwalker.com/kedro-issue-606.png [5]: #notes

Comprehensive guide to creating kedro nodes

The Kedro node is an essential part of the pipeline. It defines what catalog entries get passed in, what function gets ran, and the catalog entry to save the results under. does this link work? # [1] https://waylonwalker.com/what-is-kedro/ 👆 Unsure what kedro is? Check out this post. The node function # [2] The node function is the most common and reccomended way to define kedro nodes. It is a function that constructs and returns Node objects for you. Creating your first kedro node # [3] from kedro.pipeline import node def identity(df): "a function that returns itself" return df my_first_node = node( func=identity, inputs='raw_cars', output='int_cars', tags=['int',] ) function # [4] The func passed into node can be any callable that accepts the inputs yout have specified, and returns the correct output that you specify as your output. - any callable - a function you write - a function from a library - class constructor - lambda function - partial function - l...

Creating pypi-list with kedro

I had an idea come to me via twitter. Short one word name packages are becoming hard to find on pypi. Short one word readable package names that are not a play on words are easy to remember, easy to spell correctly, and quick to type out. Simple index # [1] I started with the simple index. Pypi provides a single page listing to every single package hosted on pypi via the simple-index [2] References: [1]: #simple-index [2]: https://pypi.org/simple/

Using Kedro In Scripts

With the latest releases of kedro 0.17.x, it is now possible to run kedro pipelines from within scripts. While I would not start a project with this technique, it will be a good tool to keep in my back pocket when I want to sprinkle in a bit of kedro goodness in existing projects. New to Kedro # [1] What is Kedro [2] If your just learning about kedro check out this post walking through it No More Rabbit Hole of Errors # [3] as of 0.17.2 I’ve tried to do this in kedro 0.16.x, and it turned into a rabbit hole of errors. First kedro needed a conf directory, if you tried to fake one in it would then ask for logging setup. These errors just kept coming to the point it wasnt worth doing and I might as well use a proper template for real projects and stick to simple function calls for things that are not a kedro project. Kedro in a script # [4] To get kedro running, you will need a pipeline, catalog, and runner at a minimum. Those who have used kedro before the pipeline will look v...

Silence Kedro Logs

Kedro can have a chatty logger. While this is super nice in production so see everything that happened during a pipeline run. This can be troublesome while trying to implement a cli extension with clean output. Silence a Python log # [1] First, how does one silence a python log? Python loggers can be retrieved by the logging module’s getLogger function. Then their log level can be changed. Much of kedro’s chattiness comes from INFO level logs. I don’t want to hear about anything for my current use case unless it’s essential, i.e., a failure. In this case, I set the log levels to ERROR as most errors should stop execution anyways. python logging levels # [2] Level Numeric value CRITICAL 50 ERROR 40 WARNING 30 INFO 20 DEBUG 10 NOTSET 0 Get or Create a logger # [3] Getting a python logger is straightforward if we know the name of the logger. The following block will grab the logger object for the logger currently registered under the name passed in. logger = logging.getLog...

Vim Fugitive

:G :G status :G commit :G add % :Gdiff :G push :Glog Add current file and commit with diff in a split # [1] function! s:GitAdd() exe "G add %" exe "G diff --staged" exe "only" exe "G commit" endfunction :command! GitAdd :call s:GitAdd() nnoremap gic :GitAdd<CR> :on[ly] # [2] C-W o :on[ly] will make the current buffer the only one on the screen. This is super helpful as many of fugitive commands will open in a split by default. C-I C-O # [3] cycle through the jumplist This one has nothing to do with fugitive, but is a native vim feature that makes fugitive glorious. Before I realized how to utilize C-i and C-o, I would get completely lost when using fugitive. Digging deep into the log, opening a file from a specific commit, then no way to get back where I was in the log. C-i jump :jump[s] # [4] show the jumplist The jumplist is sorted Oldest to newest :Telescope jumplist # [5] When navigating the jumplist with :Telescope jumplist, it will add a new entry to the jumpli...

Custom Kedro Logger

DRAFT - formatters: mine: format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s - %(me)s" handlers: mine_handler: class: logging.StreamHandler level: INFO formatter: mine stream: ext://sys.stdout loggers: me: level: DEBUG handlers: [mine_handler] root: level: INFO handlers: [console, info_file_handler, error_file_handler]

Zev Averbach Interview

Zev Averbach, Frustrated spreadsheet jockey to software developer at 36 Q: Tell me about your journey as a spreadsheet jockey into Data Engineering? A: First of all, it’s hilarious that I accidentally found your questions for this interview by Googling myself. 😊 I’ve always been a frustrated software user, and that frustration led me to be a “power user” (keyboard shortcuts etc) of my most used applications, as well as a “visual coder” using desktop automation like Alfred and Keyboard Maestro (Mac). Now that I’ve met data analysts and finance people that use Excel all day, I don’t think I’d claim to have been a true “spreadsheet jockey” in comparison to them. However, hitting up against the limitations of spreadsheets for running my transcription business [1] – specifically for bookkeeping – created a new frustration for me: As the business grew I was spending more and more time copying entries from Google Sheets to the Freshbooks web app for invoicing purposes. I tried to auto...

Kedro pipeline_registry.py

With the realease of kedro==0.17.2 came a new module in the project template pipeline_registry.py. Here are some notes that I learned while playing with this new module. migrating to pipeline_registry.py # [1] - create a src/<package-name>/pipeline_registry.py file create a - register_pipelines function in pipeline_registry.py that mirrors the - register_pipelines method from your hooks.py module do not bring the - hook_impl decorator remove register_pipelines method on your ProjectHooks - class You should now have something that looks like this in your src/<package-name>/pipeline_registry.py. """Project pipelines.""" from typing import Dict from kedro.pipeline import Pipeline def register_pipelines() -> Dict[str, Pipeline]: """Register the project's pipelines. Returns: A mapping from a pipeline name to a ``Pipeline`` object. """ return {"__default__": Pipeline([])} pipeline_registry only works in kedro>=0.17.2 Conflict Resolution # [2] What happens If I register p...

Minimal Kedro Pipeline

How small can a minimum kedro pipeline ready to package be? I made one within 4 files that you can pip install. It’s only a total of 35 lines of python, 8 in setup.py and 27 in mini_kedro_pipeline.py. 📝 Note this is only a composable pipeline, not a full project, it does not contain a catalog or runner. Minimal Kedro Pipeline # [1] I have everything for this post hosted in this gihub repo [2], you can fork it, clone it, or just follow along. Installation # [3] pip install git+https://github.com/WaylonWalker/mini-kedro-pipeline Caveats # [4] This repo represents the minimal amount of structure to build a kedro pipeline that can be shared across projects. Its installable, and drops right into your hooks.py or run.py modules. It is not a runnable pipeline. At this point I think the config loader requires to have a logging config file. This is a sharable pipeline that can be used across many different projects. Usage # [5] # hooks.py import mini_kedro_project as mkp class Pro...