Posts tagged: kedro

All posts with the tag "kedro"

40 posts latest post 2025-02-05
Publishing rhythm
Feb 2025 | 1 posts

I Started Streaming on Twitch

I recently started streaming on twitch.tv/waylonwalker [1] and it’s been a blast so far. - python - kedro - Data Science - Data Engineering - webdev - digital gardening Kedro Spaceflights # [2] It all started with kedro/issues/606 [3], Yetu called out for users of kedro to record themselves doing a walk through of their tutorials. I wanted to do this, but was really stuck at the fact that recording or editing somewhat polished vide is quite time consuming for me. [4] Inspiration # [5] My introduction to twitch came from twitch.tv/theprimeagen [6]. I watched him on YouTube, and then decided to drop into a stream. It was so fun to watch him live that I started following others in the science and tech category. - twitch.tv/teej_dv [7] Brilliant neovim core dev, I learn a bunch about nvim every time I watch. - twitch.tv/cmgriffing [8] Super Chill and engaging chat. - twitch.tv/cassidoo [9] Fantastic discussion/chat. - twitch.tv/anthonywritescode [10] Building the python ...

Upcoming Stream

I'm no longer streaming As much as I would really love to make streaming work, its really hard for my family situation to make large blocks of time work for me. https://stackoverflow.com/questions/16720541/python-string-replace-regular-expression I am starting to stream 3 days per week, before I start work in the morning. These streams will likely be me just talking through things I am already doing. Making DAGs do 🔮Magical Things | Open Source 🐍Python | kedro plugins | # [1] Science & Technology | Every Monday • 7:00 AM - 9:00 AM CDT On Monday’s I am going to be working on open source packages/plugins for kedro. - kedro-diff - test kedro-diff on piplines with history - setup deploy pipeline - deply to pypi 🌱 Digital Gardening | Blogging with 🐍Python | Building 🔮Markata a static site generator in python for waylonwalker.com # [2] Science & Technology | Every Wednesday • 7:00 AM - 9:00 AM CDT On Wednesday morning I will be working on my personal website and the static s...

Kedro Spaceflights - part 2 | Stream replay June 7, 2021

This was my seconf time ever streaming on twitch.tv/waylonwalker [1], and I completely botched my mic 2x. https://youtu.be/_7MwgKu-844 Links # [2] - Spaceflights Tutorial [3] - my spaceflights repo [4] Notes to get started # [5] pipx run kedro new cd project python -m venv .venv source .venv/bin/activate pip install kedro kedro install References: [1]: https://twitch.tv/waylonwalker [2]: #links [3]: https://kedro.readthedocs.io/en/stable/03_tutorial/01_spaceflights_tutorial.html [4]: https://github.com/WaylonWalker/kedro-spaceflights [5]: #notes-to-get-started

Kedro Spaceflights - part 1 | Stream replay June 4, 2021

This was my first time ever streaming on twitch.tv/waylonwalker [1]. I am excited to get going. I have been streaming early in the morning while I am still waking up, so still a bit groggy as I go. https://youtu.be/Y07UBr9Ccjs Kedro Spaceflights # [2] It all started with kedro/issues/606 [3], Yetu called out for users of kedro to record themselves doing a walk through of their tutorials. I wanted to do this, but was really stuck at the fact that recording or editing somewhat polished vide is quite time consuming for me. [4] Notes # [5] pipx run kedro new cd project python -m venv .venv source .venv/bin/activate pip install kedro kedro install References: [1]: https://twitch.tv/waylonwalker [2]: #kedro-spaceflights [3]: https://github.com/kedro-org/kedro/issues/606 [4]: https://images.waylonwalker.com/kedro-issue-606.png [5]: #notes

Comprehensive guide to creating kedro nodes

The Kedro node is an essential part of the pipeline. It defines what catalog entries get passed in, what function gets ran, and the catalog entry to save the results under. does this link work? # [1] https://waylonwalker.com/what-is-kedro/ 👆 Unsure what kedro is? Check out this post. The node function # [2] The node function is the most common and reccomended way to define kedro nodes. It is a function that constructs and returns Node objects for you. Creating your first kedro node # [3] from kedro.pipeline import node def identity(df): "a function that returns itself" return df my_first_node = node( func=identity, inputs='raw_cars', output='int_cars', tags=['int',] ) function # [4] The func passed into node can be any callable that accepts the inputs yout have specified, and returns the correct output that you specify as your output. - any callable - a function you write - a function from a library - class constructor - lambda function - partial function - l...

Creating pypi-list with kedro

I had an idea come to me via twitter. Short one word name packages are becoming hard to find on pypi. Short one word readable package names that are not a play on words are easy to remember, easy to spell correctly, and quick to type out. Simple index # [1] I started with the simple index. Pypi provides a single page listing to every single package hosted on pypi via the simple-index [2] References: [1]: #simple-index [2]: https://pypi.org/simple/

Using Kedro In Scripts

With the latest releases of kedro 0.17.x, it is now possible to run kedro pipelines from within scripts. While I would not start a project with this technique, it will be a good tool to keep in my back pocket when I want to sprinkle in a bit of kedro goodness in existing projects. New to Kedro # [1] What is Kedro [2] If your just learning about kedro check out this post walking through it No More Rabbit Hole of Errors # [3] as of 0.17.2 I’ve tried to do this in kedro 0.16.x, and it turned into a rabbit hole of errors. First kedro needed a conf directory, if you tried to fake one in it would then ask for logging setup. These errors just kept coming to the point it wasnt worth doing and I might as well use a proper template for real projects and stick to simple function calls for things that are not a kedro project. Kedro in a script # [4] To get kedro running, you will need a pipeline, catalog, and runner at a minimum. Those who have used kedro before the pipeline will look v...

Silence Kedro Logs

Kedro can have a chatty logger. While this is super nice in production so see everything that happened during a pipeline run. This can be troublesome while trying to implement a cli extension with clean output. Silence a Python log # [1] First, how does one silence a python log? Python loggers can be retrieved by the logging module’s getLogger function. Then their log level can be changed. Much of kedro’s chattiness comes from INFO level logs. I don’t want to hear about anything for my current use case unless it’s essential, i.e., a failure. In this case, I set the log levels to ERROR as most errors should stop execution anyways. python logging levels # [2] Level Numeric value CRITICAL 50 ERROR 40 WARNING 30 INFO 20 DEBUG 10 NOTSET 0 Get or Create a logger # [3] Getting a python logger is straightforward if we know the name of the logger. The following block will grab the logger object for the logger currently registered under the name passed in. logger = logging.getLog...

Vim Fugitive

:G :G status :G commit :G add % :Gdiff :G push :Glog Add current file and commit with diff in a split # [1] function! s:GitAdd() exe "G add %" exe "G diff --staged" exe "only" exe "G commit" endfunction :command! GitAdd :call s:GitAdd() nnoremap gic :GitAdd<CR> :on[ly] # [2] C-W o :on[ly] will make the current buffer the only one on the screen. This is super helpful as many of fugitive commands will open in a split by default. C-I C-O # [3] cycle through the jumplist This one has nothing to do with fugitive, but is a native vim feature that makes fugitive glorious. Before I realized how to utilize C-i and C-o, I would get completely lost when using fugitive. Digging deep into the log, opening a file from a specific commit, then no way to get back where I was in the log. C-i jump :jump[s] # [4] show the jumplist The jumplist is sorted Oldest to newest :Telescope jumplist # [5] When navigating the jumplist with :Telescope jumplist, it will add a new entry to the jumpli...

Custom Kedro Logger

DRAFT - formatters: mine: format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s - %(me)s" handlers: mine_handler: class: logging.StreamHandler level: INFO formatter: mine stream: ext://sys.stdout loggers: me: level: DEBUG handlers: [mine_handler] root: level: INFO handlers: [console, info_file_handler, error_file_handler]

Kedro pipeline_registry.py

With the realease of kedro==0.17.2 came a new module in the project template pipeline_registry.py. Here are some notes that I learned while playing with this new module. migrating to pipeline_registry.py # [1] - create a src/<package-name>/pipeline_registry.py file create a - register_pipelines function in pipeline_registry.py that mirrors the - register_pipelines method from your hooks.py module do not bring the - hook_impl decorator remove register_pipelines method on your ProjectHooks - class You should now have something that looks like this in your src/<package-name>/pipeline_registry.py. """Project pipelines.""" from typing import Dict from kedro.pipeline import Pipeline def register_pipelines() -> Dict[str, Pipeline]: """Register the project's pipelines. Returns: A mapping from a pipeline name to a ``Pipeline`` object. """ return {"__default__": Pipeline([])} pipeline_registry only works in kedro>=0.17.2 Conflict Resolution # [2] What happens If I register p...

Minimal Kedro Pipeline

How small can a minimum kedro pipeline ready to package be? I made one within 4 files that you can pip install. It’s only a total of 35 lines of python, 8 in setup.py and 27 in mini_kedro_pipeline.py. 📝 Note this is only a composable pipeline, not a full project, it does not contain a catalog or runner. Minimal Kedro Pipeline # [1] I have everything for this post hosted in this gihub repo [2], you can fork it, clone it, or just follow along. Installation # [3] pip install git+https://github.com/WaylonWalker/mini-kedro-pipeline Caveats # [4] This repo represents the minimal amount of structure to build a kedro pipeline that can be shared across projects. Its installable, and drops right into your hooks.py or run.py modules. It is not a runnable pipeline. At this point I think the config loader requires to have a logging config file. This is a sharable pipeline that can be used across many different projects. Usage # [5] # hooks.py import mini_kedro_project as mkp class Pro...

Kedro Dependency Management

Docs # [1] https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/01_dependencies.html?highlight=install pip-tools # [2] pip-compile # [3] requirements # [4] - requirements.in - requirements.txt References: [1]: #docs [2]: #pip-tools [3]: #pip-compile [4]: #requirements

Kedro - My Data Is Not A Table

In python data science/engineering most of our data is in the form of some sort of table, typically a DataFrame from a library like pandas, spark, or dask. DataFrames are the heart of most pipelines # [1] These containers for data contain many convenient methods to manipulate table like data structures. Sometimes we leverage other data types, namely vanilla types like lists and dicts, or even numpy data types. What is Kedro [2] unfamiliar with kedro, check out this post Sometimes datasets are not tables # [3] There are times when our data doesn’t fit nicely into a DataFrame. Lucky for us Kedro has pickle support out of the box. Pickle is a way to store any python object to disk. Beware that pickle files coming from an unknown source can run malicous code and are considered unsafe. For the most part though when you read and write your own pickle files they are a good tool to consider. See more about pickle [4] from python.org. Cataloging Pickle # [5] I may have a dictionary ...

Testing Data Pipelines

Lint/Format/Doc - black - flake8 - interrogate - mypy Pipeline Assertions - pipeline constructs - pipeline as expected nodes - pipeline has minimum nodes - test minimum tags - test alternate tags Catalog Assertions - test catalog follows naming structure - Node Tests - test function does the correct operations on test data Great Expectations

reasons-to-kedro

There are many reasons that you should be using kedro. If you are on a team of Data Scientists/Data Engineers processing DataFrames from many data sources should be considering a pipeline framework. Kedro is a great option that provides many benefits for teams to collaborate, develop, and deploy data pipelines What is Kedro [1] Starter Template # [2] Kedro makes it super easy to get started with their cli that utilizes cookiecutter under the hood. conda create -n my-new-project -y python=3.8 kedro new kedro install kedro run Create New Kedro Project [3] read more about how to start your first kedro project here Collaboration # [4] Kedro provides many tools that help teams collaborate on a single codebase. While writing monolithic scripts it can be easy to pin yourself in a corner where it is difficult to have multiple people making changes to the notebook/script at the same time. Kedro helps guide your team to break your project down into small pieces that different members o...

Reasons to Kedro

Reasons to Kedro # [1] - collaboration - Sharable catalog - small nodes over monolithic notebooks - catalog - easily load anything without needing to run - No need to write read/write code - pipeline - No need to keep execution order in your head - easily run a slice of a pipeline - plugins - pip install - make your own - hooks - flexible expandable cli Reasons Not to Kedro # [2] - Already utilizing another DAG framework - Data is not in a widely supported format - Micro short-lived project - Large Project / Deadline - Use a lower profile project to learn first - Team not willing to change - Need minimal dependencies - God Project - kedro owns everything?? References: [1]: #reasons-to-kedro [2]: #reasons-not-to-kedro

What's New in Kedro 0.16.6

Kedro 0.16.6 [1] is out! Let’s take a look through the release notes Deployment Docs # [2] This is really exciting to see more deployment options coming from the kedro team. It really shows the power of the framework. The power of some of these orchestrations options is incredible. - Argo [3] - Prefect [4] - Kubeflow [5] - Batch [6] - SageMaker [7] Most of them hinge on a sweet combination of the kedro cli, docker image, and the pipeline knowing your nodes dependencies. Argo, Prefect, and Kubeflow have an interesting technique where they translate the pipeline and its dependencies from kedro to their language. Batch uses the aws cli to submit jobs, one node per job, and listen for them to complete. It will submit all nodes with completed dependencies at once, meaning that we can get some massive parallelization. I did a quick and dirty test of one of these by simulating the technique in a bash script and saw a 40 hr pipeline finish in about 1 hour. I am excited to get thi...

A brain dump of stories

I started making stories as kind of a brain dump a few times per day and posting them to [LinkedIn](https://www.linkedin.com/in/waylonwalker/(https://www.linkedin.com/in/waylonwalker/). Here are the last 11 days of stories. I store all the stories on my website with the hopes of doing something with them on my own platform eventually. For now it makes it easy to make these posts. cd static/stories ls | xargs -I {} echo '![](https://waylonwalker.com/stories/{})' Stories 10-10-2020 - 10-21-2020 # [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] References: [1]: #stories-10-10-2020---10-21-2020 [2]: https://waylonwalker.com/stories/TIL-kedro-sorts-nodes.png [3]: https://waylonwalker.com/stories/disable-base-pip.png [4]: https://waylonwalker.com/stories/discovered-social-cards.png [5]: https://waylonwalker.com/stories/find-kedro-de1-contributor.png [6]: https://waylonwalker.com/stories/hacktoberfest-2020-kedro-538-tests-pass.png [7]: https://waylonwalk...