Posts tagged: data

All posts with the tag "data"

70 posts latest post 2025-06-09
Publishing rhythm
Jun 2025 | 1 posts

Kedro - My Data Is Not A Table

In python data science/engineering most of our data is in the form of some sort of table, typically a DataFrame from a library like pandas, spark, or dask.

These containers for data contain many convenient methods to manipulate table like data structures. Sometimes we leverage other data types, namely vanilla types like lists and dicts, or even numpy data types.

What is Kedro

...

Testing Data Pipelines

Lint/Format/Doc black flake8 interrogate mypy Pipeline Assertions pipeline constructs pipeline as expected nodes pipeline has minimum nodes test minimum tags test alternate tags Catalog Assertions test catalog follows naming structure Node Tests test function does the correct operations on test data Great Expectations

reasons-to-kedro

There are many reasons that you should be using kedro. If you are on a team of Data Scientists/Data Engineers processing DataFrames from many data sources should be considering a pipeline framework. Kedro is a great option that provides many benefits for teams to collaborate, develop, and deploy data pipelines

What is Kedro

Kedro makes it super easy to get started with their cli that utilizes cookiecutter under the hood.

...

Reasons to Kedro

Reasons to Kedro # collaboration Sharable catalog small nodes over monolithic notebooks catalog easily load anything without needing to run No need to write read/write code pipeline No need to keep execution order in your head easily run a slice of a pipeline plugins pip install make your own hooks flexible expandable cli Reasons Not to Kedro # Already utilizing another DAG framework Data is not in a widely supported format Micro short-lived project Large Project / Deadline Use a lower profile project to learn first Team not willing to change Need minimal dependencies God Project - kedro owns everything??

What's New in Kedro 0.16.6

Kedro 0.16.6 is out! Let’s take a look through the release notes

This is really exciting to see more deployment options coming from the kedro team. It really shows the power of the framework. The power of some of these orchestrations options is incredible.

Most of them hinge on a sweet combination of the kedro cli, docker image, and the pipeline knowing your nodes dependencies.

...

A brain dump of stories

I started making stories as kind of a brain dump a few times per day and posting them to [LinkedIn](https://www.linkedin.com/in/waylonwalker/(https://www.linkedin.com/in/waylonwalker/). Here are the last 11 days of stories.

I store all the stories on my website with the hopes of doing something with them on my own platform eventually. For now it makes it easy to make these posts.

cd static/stories ls | xargs -I {} echo '![](https://waylonwalker.com/stories/{})'

Stories 10-10-2020 - 10-21-2020 #

Kedro Catalog

I am exploring a kedro catalog meta data hook, these are some notes about what I am thinking.

try pandas method -> try spark -> try dict/list -> none

Is there an easy way to create a nosql database in memory from a a list of dictionaries?

Gracefully adopt kedro, the catalog

While using the catalog alone will not reap all of the benefits of the framework, it does get you and your project ready for the full framework eventually. For me the full benefit of the catalog comes when you combine it with the pipeline and dont even touch read/write steps at all.

Taking a step into kedro by adopting the catalog first will give you a way to organize all of your data loads in one place, and stop manually writing read/write code, which can be different for each data and storage type. You just don’t need to think about it.

“can be dropped into kedro later” Let’s talk a bit more about that

...

How to find things in your kedro catalog

kedro 0.16.2 just dropped last week with a long-awaited feature… catalog search! I went as far as monkey patching this into each of my projects. I work jump between a few really big projects that have tons of datasets. Being able to quickly search for what I need is so useful.

The kedro data catalog is a key component to the kedro framework. It handles all data loading and saving for you. It is configurable and hackable. Having all your data connections listed in one place make it so easy to pick your project up and move it to a completely new environment. That sweet imperative loading style saves so much read/write overhead. I can load all my data with a single command whether it’s in amazon s3, google cloud platform, or a local file.

Just like with most of these articles, I am going to create a conda environment so that I don’t break any existing projects and scaffold up a toy project to learn from.

...

011

Load _ from database into **

1 min

010

load remote _ with **

1 min

009

Combine a directory of _ with **

1 min

004

🔥 #kedrotips use find-kedro to assembly your pipelines

1 min

002

** 0.3.0 just launched with _ support 🎉

1 min