Posts tagged: data

Kedro Dependency Management

Docs # [1] https://kedro.readthedocs.io/en/stable/04_kedro_project_setup/01_dependencies.html?highlight=install pip-tools # [2] pip-compile # [3] requirements # [4] - requirements.in - requirements.txt References: [1]: #docs [2]: #pip-tools [3]: #pip-compile [4]: #requirements

Blog Data With Python

Generating an api for a blog is much simpler than one might expect with python. Markdown # [1] Frontmatter # [2] Fill in the blanks # [3] fix missing data Fast # [4] References: [1]: #markdown [2]: #frontmatter [3]: #fill-in-the-blanks [4]: #fast

Kedro - My Data Is Not A Table

In python data science/engineering most of our data is in the form of some sort of table, typically a DataFrame from a library like pandas, spark, or dask. DataFrames are the heart of most pipelines # [1] These containers for data contain many convenient methods to manipulate table like data structures. Sometimes we leverage other data types, namely vanilla types like lists and dicts, or even numpy data types. What is Kedro [2] unfamiliar with kedro, check out this post Sometimes datasets are not tables # [3] There are times when our data doesn’t fit nicely into a DataFrame. Lucky for us Kedro has pickle support out of the box. Pickle is a way to store any python object to disk. Beware that pickle files coming from an unknown source can run malicous code and are considered unsafe. For the most part though when you read and write your own pickle files they are a good tool to consider. See more about pickle [4] from python.org. Cataloging Pickle # [5] I may have a dictionary ...

Testing Data Pipelines

Lint/Format/Doc - black - flake8 - interrogate - mypy Pipeline Assertions - pipeline constructs - pipeline as expected nodes - pipeline has minimum nodes - test minimum tags - test alternate tags Catalog Assertions - test catalog follows naming structure - Node Tests - test function does the correct operations on test data Great Expectations

reasons-to-kedro

There are many reasons that you should be using kedro. If you are on a team of Data Scientists/Data Engineers processing DataFrames from many data sources should be considering a pipeline framework. Kedro is a great option that provides many benefits for teams to collaborate, develop, and deploy data pipelines What is Kedro [1] Starter Template # [2] Kedro makes it super easy to get started with their cli that utilizes cookiecutter under the hood. conda create -n my-new-project -y python=3.8 kedro new kedro install kedro run Create New Kedro Project [3] read more about how to start your first kedro project here Collaboration # [4] Kedro provides many tools that help teams collaborate on a single codebase. While writing monolithic scripts it can be easy to pin yourself in a corner where it is difficult to have multiple people making changes to the notebook/script at the same time. Kedro helps guide your team to break your project down into small pieces that different members o...

Reasons to Kedro

Reasons to Kedro # [1] - collaboration - Sharable catalog - small nodes over monolithic notebooks - catalog - easily load anything without needing to run - No need to write read/write code - pipeline - No need to keep execution order in your head - easily run a slice of a pipeline - plugins - pip install - make your own - hooks - flexible expandable cli Reasons Not to Kedro # [2] - Already utilizing another DAG framework - Data is not in a widely supported format - Micro short-lived project - Large Project / Deadline - Use a lower profile project to learn first - Team not willing to change - Need minimal dependencies - God Project - kedro owns everything?? References: [1]: #reasons-to-kedro [2]: #reasons-not-to-kedro

What's New in Kedro 0.16.6

Kedro 0.16.6 [1] is out! Let’s take a look through the release notes Deployment Docs # [2] This is really exciting to see more deployment options coming from the kedro team. It really shows the power of the framework. The power of some of these orchestrations options is incredible. - Argo [3] - Prefect [4] - Kubeflow [5] - Batch [6] - SageMaker [7] Most of them hinge on a sweet combination of the kedro cli, docker image, and the pipeline knowing your nodes dependencies. Argo, Prefect, and Kubeflow have an interesting technique where they translate the pipeline and its dependencies from kedro to their language. Batch uses the aws cli to submit jobs, one node per job, and listen for them to complete. It will submit all nodes with completed dependencies at once, meaning that we can get some massive parallelization. I did a quick and dirty test of one of these by simulating the technique in a bash script and saw a 40 hr pipeline finish in about 1 hour. I am excited to get thi...

A brain dump of stories

I started making stories as kind of a brain dump a few times per day and posting them to [LinkedIn](https://www.linkedin.com/in/waylonwalker/(https://www.linkedin.com/in/waylonwalker/). Here are the last 11 days of stories. I store all the stories on my website with the hopes of doing something with them on my own platform eventually. For now it makes it easy to make these posts. cd static/stories ls | xargs -I {} echo '![](https://waylonwalker.com/stories/{})' Stories 10-10-2020 - 10-21-2020 # [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] References: [1]: #stories-10-10-2020---10-21-2020 [2]: https://waylonwalker.com/stories/TIL-kedro-sorts-nodes.png [3]: https://waylonwalker.com/stories/disable-base-pip.png [4]: https://waylonwalker.com/stories/discovered-social-cards.png [5]: https://waylonwalker.com/stories/find-kedro-de1-contributor.png [6]: https://waylonwalker.com/stories/hacktoberfest-2020-kedro-538-tests-pass.png [7]: https://waylonwalk...

Kedro Basics

Learn Kedro in 5 days Day 0 Setup # [1] - vm - install - python - editor Day 1 # [2] - kedro new - kedro viz Day 2 # [3] - catalog - filter catalog - load data - fsspec Day 3 # [4] - pipeline - nodes Day 4 # [5] - filter pipeline - run partial pipeline Day 5 # [6] - kedro docker - GitHub Actions Advanced Kedro # [7] - hooks - custom datasets - modular pipelines References: [1]: #day-0-setup [2]: #day-1 [3]: #day-2 [4]: #day-3 [5]: #day-4 [6]: #day-5 [7]: #advanced-kedro

What's New in Kedro 0.16.4

If we take a look at the release notes [1] I see one major feature improvement on the list, auto-discovery of hooks. ## Major features and improvements * Enabled auto-discovery of hooks implementations coming from installed plugins. This one comes a bit surprising as it was just casually mentioned in #435 [2] [2] Think pytest # [3] As mentioned in #435 [2] this is the model that pytest uses. Not all plugins automatically start doing things right out of the box but require a CLI argument. simplicity # [4] It feels a bit crazy that simply installing a package will change the way that your pipeline gets executed. I do like that it requires just a bit less reaching into the framework stuff for the average user. Most folks will be able to write in the catalog and nodes without much change to the rest of the project. Implementation # [5] Reading through the docs [6], they show us that we can make our hooks automatically register by adding a kedro.hooks endpoint that points to a ...

Kedro Catalog

I am exploring a kedro catalog meta data hook, these are some notes about what I am thinking. Process # [1] - metadata will be attached to the dataset object under a .metadata attribute - metadata will be updated after_node_run - metadata will be empty until a pipeline is ran with the hook on - optionally a function to add metadata will be added - metadata will be stored in a file next to the filepath - meta Problems This Hook Should solve # [2] - what datasets have a columns with sales in the name - what datasets were updated after last tuesday - which pipeline node created this dataset - how many rows are in this dataset (without reloading all datasets) implementation details # [3] - metadata will be attached to each dataset as a dictionary - list/dict comprehensions can be used to make queries Metadata to Capture # [4] try pandas method -> try spark -> try dict/list -> none - column names - length - Null count - created_by node name Database? # [5] Is there...

Gracefully adopt kedro, the catalog

Why use kedro catalog? # [1] While using the catalog alone will not reap all of the benefits of the framework, it does get you and your project ready for the full framework eventually. For me the full benefit of the catalog comes when you combine it with the pipeline and dont even touch read/write steps at all. Taking a step into kedro by adopting the catalog first will give you a way to organize all of your data loads in one place, and stop manually writing read/write code, which can be different for each data and storage type. You just don’t need to think about it. --- - iperitive loading style - organizes your data - all file locations can be quickly identified - can be dropped into kedro later --- “can be dropped into kedro later” Let’s talk a bit more about that 2 Ways to Gracefully adopt the catalog # [2] How do I get started with the kedro catalog - add with the code api - load from yaml (recommended) 1. Adding to the catalog with the code api # [3] how to use ...

How to find things in your kedro catalog

kedro 0.16.2 just dropped last week with a long-awaited feature… catalog search! I went as far as monkey patching this into each of my projects. I work jump between a few really big projects that have tons of datasets. Being able to quickly search for what I need is so useful. The Catalog # [1] The kedro data catalog is a key component to the kedro framework. It handles all data loading and saving for you. It is configurable and hackable. Having all your data connections listed in one place make it so easy to pick your project up and move it to a completely new environment. That sweet imperative loading style saves so much read/write overhead. I can load all my data with a single command whether it’s in amazon s3, google cloud platform, or a local file. Kick start a toy project # [2] Just like with most of these articles, I am going to create a conda environment so that I don’t break any existing projects and scaffold up a toy project to learn from. conda create -n kedro0162 py...

How Kedro handles your inputs

Passing inputs into kedro is a key concept. Understanding how it accepts a single catalog key as input is quite trivial that easily makes sense, but passing a list or dictionary of catalog entries can be a bit confusing. *args/**args review # [1] Check out this post for a review of how *args **kwargs work in python. understanding python *args and **kwargs [2] python args and kwargs [3] article by @_waylonwalker [4] All Kedro inputs are catalog Entries # [5] When kedro runs your pipeline it uses the catalog to imperatively load your data, meaning that you don’t tell kedro how to load your data, you tell it where your data is and what type it is. These catalog entries are like a key-value store. You just need to give the key when setting up a node. Single Inputs # [6] These are fairly straightforward to understand. In the example below when kedro runs the pipeline it will load the input from the catalog, then pass that input to the func, then save the returned value to the out...

`j`	Scroll down
`k`	Scroll up
`g` `g`	Scroll to top
`Shift` `G`	Scroll to bottom
`d`	Half-page down
`u`	Half-page up

`j` / `↓`	Next post (in feeds)
`k` / `↑`	Previous post (in feeds)
`Enter` / `o`	Open highlighted post
`Shift` `O`	Open in new tab
`g` `h`	Go to home
`g` `s`	Focus search
`[`	Previous page
`]`	Next page
`b`	Toggle left sidebar
`Shift` `B`	Toggle right sidebar
`s`	Toggle simple/rich feed view

`/`	Focus search input
`⌘CtrlK`	Focus search (alternative)
`y` `y`	Copy URL to clipboard
`?`	Show this help
`Esc`	Close / clear highlight