Posts tagged: data

All posts with the tag "data"

70 posts latest post 2025-06-09
Publishing rhythm
Jun 2025 | 1 posts

Kedro Static Viz 0.3.0 is out with Hooks Support

kedro-static-viz [1] is out with support for the newly released hooks feature. This means that you can have kedro-static-viz automatically deploy a full gatsby site before_pipeline_run keeping your visualization always up to date. Even though it is a static site there is no functionality lost. The only thing that’s missing is the flask server. With kedro-static-viz [1] you can deploy your visualization to a number of static hosting providers such as GitHub pages free of charge with wicked fast performance ⚡ It’s Fast # [2] Even though it’s built on gatsbyjs the full site builds in under 2s even on slower hardware. This is because the site is already pre-rendered and stripped of any excess. It’s zipped up right into the python package and is typically used with the cli, but now can be used with python, or as a hook as well. What is kedro-viz [3] 🤔 # [4] Kedro viz is a fantastic kedro plugin that allows you to visualize your data pipeline. Kedro allows you to quickly build produc...

Brainstorming Kedro Hooks

This post is a 🧠 branstorming work in progress. I will likely use it as a storage location/brain dump of hook ideas. What is Kedro 🤔 # [1] If you are completely unsure what kedro is be sure to check out my what is kedro [2] post after_catalog_created # [3] - filepath replacer - bucket replacer before_pipeline_run # [4] - preflight - check that data exists - run kedro_static_viz - run mypy - run interrogate - run flake8 after_pipeline_run # [5] - Great Expectations - send email - send slack before_node_run # [6] after_node_run # [7] - Great Expectations - save stats/meta data - Execution Order # [8] hooks are executed in reverse order of the hooks list. hooks with tryfirst will be moved to the end of the list hooks with trylast will be moved to the end of the list - after_catalog_created - before_pipeline_run - args - run_params = run_params = {‘run_id’: ‘2020-05-23T15.24.23.958Z’, ‘project_path’: ‘/mnt/c/temp/kedro0160’, ’env’: ’local’, ‘kedro_version’: ‘...

Create Custom Kedro Dataset

Kedro provides an efficient way to build out data catalogs with their yaml api. It allows you to be very declaritive about loading and saving your data. For the most part you just need to tell Kedro what connector to use and its filepath. When running Kedro takes care of all of the read/write, you just reference the catalog key. But what is happening behind the scenes # [1] Under the hood there is an AbstractDataSet that each connector inherits from. It sets up a lot of the behind the scenes structure for us so that we dont have to. For the most part kedro has connectors for about anything that you want to load, csv, parquet, sql, json, from about anywhere, http, s3, localfile system are just some of the examples. Here is a DataSet implementation from their docs. Here you can see the barebones example straight from the docs. Parameters from the yaml catalog will get passed in from pathlib import Path import pandas as pd from kedro.io import AbstractDataSet class MyOwnDataSet(...

creating the kedro-preflight hook

Kedro Hooks Intro - kedro hooks are an exciting upcoming feature of kedro 0.16.0. They allow you to hook into catalog_created,pipeline_run, and node_run(nouns). With a before, or after (adjective). This really reminds me of reacts lifecycle hooks, that let you hook into various state of react web components. This is going to make kedro so extendable by the community. I am super pumped to see what the community is able to do with this ability. kedro hooks are an exciting upcoming feature of kedro 0.16.0. They allow you to hook into catalog_created,pipeline_run, and node_run(nouns). With a before, or after (adjective). This really reminds me of reacts lifecycle hooks, that let you hook into various state of react web components. This is going to make kedro so extendable by the community. I am super pumped to see what the community is able to do with this ability. What is Kedro [1] If you are completely unsure what kedro is be sure to check out my what is kedro post Docs # [2] a w...

📝 Kedro Preflight Notes

This is a very rough idea for a kedro package to prevent time lost to get partway through a pipeline run only to realize that you dont have access to data or resources. Must Haves # [1] - check that inputs exist or are of a type to skip (sql) Good to haves - check that all input and output databases are accessible with good credentials - check for s3 bucket access - check for spark install Implementation # [2] @hook_spec def before_pipeline_run(run_params, pipeline, catalog): run params # [3] { "run_id": str "project_path": str, "env": str, "kedro_version": str, "tags": Optional[List[str]], "from_nodes": Optional[List[str]], "to_nodes": Optional[List[str]], "node_names": Optional[List[str]], "from_inputs": Optional[List[str]], "load_versions": Optional[List[str]], "pipeline_name": str, "extra_params": Optional[Dict[str, Any]] } References: [1]: #must-haves [2]: #implementation [3]: #run-params

📢 Announcing find-kedro

find-kedro is a small library to enhance your kedro experience. It looks through your modules to find kedro pipelines, nodes, and iterables (lists, sets, tuples) of nodes. It then assembles them into a dictionary of pipelines, each module will create a separate pipeline, and __default__ being a combination of all pipelines. This format is compatible with the kedro _create_pipelines format. [1] [2] [3] [4] # [5] kedro is a ✨ fantastic project that allows for super-fast prototyping of data pipelines, while yielding production-ready pipelines. find-kedro enhances this experience by adding a pytest like node/pipeline discovery eliminating the need to bubble up pipelines through modules. When working on larger pipeline projects, it is advisable to break your project down into different sub-modules which requires knowledge of building python libraries, and knowing how to import each module correctly. While this is not too difficult, in some cases, it can trip up even the most se...

Create New Kedro Project

This is a quickstart to getting a new kedro [1] pipeline up and running. After this article you should be able to understand how to get started with kedro [1]. You can learn more about this Hello World Example [2] in the docs [2] 🧹 Install Kedro [1] 🛢 Create the Example Pipeline 💨 Run the example 📉 Show the pipeline visualization Create a Virtual Environment [3] # [4] I use conda to control my virtual environments and will create a new environment called kedro_iris with the following command. note the latest compatible version of python is 3.7. EDIT: as of kedro 0.16.0 kedro supports up to 3.8 conda create -n kedro_iris python=3.8 -y [5] Options Activate your conda environment # [6] I try to keep my base environment as clean as possible. I have ran into too many issues installing things in the base environment. Almost always its some dependency that starts causing issues making it even harder to realize where its coming from as I never even installed it in base. source...

What is YOUR Advice for New Data Scientists

- Learn the business - Learn Git [1] - Your code does not need to be amazing - Keep Learning Learn Git # [2] You dont have to start out as a git wizard with the cleanest possible commit history. At first dont let yourself get too wrapped up in it, the most important part is that you make commits. You will find needs for more advanced stuff later. git add . git commit -m "FEAT added new function to calculate revenue by product family" git push Get comfortable with this, then learn how to branch, rebase, stash, etc… Your code does not need to be amazing # [3] Get the job done. Keep it in small bite size pieces. Make readable function definitions and variable names. You will thank yourself for naming things well later. Readability counts more than performance in most cases of data science. If it gets the job done try not to over worry about things like performance. A few extra seconds to clean a dataset or build a model is not worth hours of your time. As you go you will have c...

What is Kedro

What is Kedro [1] This is my original what-is-kedro article. There is a brand new one --- Kedro is an open source data pipeline framework. It provides guardrails to set your project up right from the start without needing to know deeply how to setup your own python library for data pipelining. It includes really great ways to manipulate catalogs and pipelines. This article will cover the 10K view of kedro, future articles will dive deper into each one. kedro [2] is an open-source data pipeline framework. It provides guardrails to set your project up right from the start without needing to know deeply how to set up your own python library for data pipelining. It includes great ways to manipulate catalogs and pipelines. This article will cover the 10K view of kedro [2], future articles will dive deeper into each one. Libraries # [3] Currently, kedro [2] is broken down into 3 different libraries. 💎 kedro [2] 📉 kedro-viz [4] 🏗 kedro-docker [5] kedro [2] # [6] [7] kedro [2] ...

Kedro

See all of my kedro related posts in [[ tag/kedro ]]. #kedrotips [1] # [2] I am tweeting out most of these snippets as I add them, you can find them all here #kedrotips [3]. 🗣 Heads up # [4] Below are some quick snippets/notes for when using kedro to build data pipelines. So far I am just compiling snippets. Eventually I will create several posts on kedro. These are mostly things that I use In my everyday with kedro. Some are a bit more essoteric. Some are helpful when writing production code, some are useful more usefule for exploration. 📚 Catalog # [5] [6] Photo by jesse orrico on Unsplash CSVLocalDataSet # [7] python import pandas as pd iris = pd.read_csv('https://raw.githubusercontent.com/kedro-org/kedro/d3218bd89ce8d1148b1f79dfe589065f47037be6/kedro/template/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/iris.csv') data_set = CSVLocalDataSet(filepath="test.csv", load_args=None, save_args={"index": False}) iris_data_set.save(iris) reloaded_iris = iris_data_se...

Filtering Pandas

query # [1] Good for method chaining, i.e. adding more methods or filters without assigning a new variable. # is skus.query('AVAILABILITY == " AVAILABLE"') # is not skus.query('AVAILABILITY != " AVAILABLE"') masking # [2] general purpose, this is probably the most common method you see in training/examples # is skus[skus['AVAILABILITY'] == 'AVAILABLE'] # is not skus[~skus['AVAILABILITY'] == 'AVAILABLE'] isin # [3] capable of including multiple strings to include # is in df[df.AVAILABILITY.isin(['AVAILABLE', 'AVL'])] # is not in df[~df.AVAILABILITY.isin(['AVAILABLE', 'AVL'])] contains # [4] Good For partial matches # contains df[df.AVAILABILITY.str.contains('AVA')] # not contains df[~df.AVAILABILITY.str.contains('AVA')] MASKS # [5] anything that we put inside of square brackets can be set as a variable then passed in. service_mask = skus['AVAILABILITY'] == 'AVAILABLE' name_mask = skus['NAME'] == 'Dell chromebook 11' Operators # [6] & - and ~ - not | - or AVAILABLE and ...

Clean up Your Data Science with Named Tuples

If you are a regular listener of TalkPython [1] or PythonBytes you have hear Michael Kennedy talk about Named Tuples many times, but what are they and how do they fit into my data science workflow. Example # [2] As you graduate your scripts into modules and libraries you might start to notice that you need to pass a lot of data around to all of the functions that you have created. For example if you are running some analysis utilizing sales, inventory, and pricing data. You may need to calculate total revenue, inventory on hand. You may need to pass these data sets into various models to drive production or pricing based on predicted volumes. Load data # [3] Here we setup functions that can load data from the sales database. Assume that we also have similar functions to get_inventory and get_pricing. def get_engine(): engine = create_engine('postgresql://scott:tiger@localhost:5432/mydatabase') def get_sales(): ''' gets sales history from the sales database ''' engine = ge...

Background Tasks in Python for Data Science

This post is intended as an extension/update from background tasks in python [1]. I started using background the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started. This post is intended as an extension/update from background tasks in python [1]. I started using background the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started. I use it in more places than I probably should Before we get into it, I want to make a shout out to Kenneth Reitz for making this so easy. Kenneth is a python God for all that he has given to the community in so many w...

Generating Readme Tables From Pandas

Generating Readme Tables From Pandas # [1] I commonly have a need to paste the first few lines of a dataset into a markdown file. I use two handy packages to do this, tabulate and pyperclip. Lets say I have a Pandas DataFrame in memory as df already. All I would need to do to convert the first 5 rows to markdown and copy it to the clipboard is the following. from tabulate import tabulate import pyperclip md = tabulate.tabulate(df.head(), df.columns, tablefmt='pipe') pyperclip.copy(md) This is a super handy snippet that I use a lot. Folks really appreciate it when they can see a sample of the data without opening the entire file. References: [1]: #generating-readme-tables-from-pandas
Stepping Up My SQL Game

Stepping Up My SQL Game

In 2018 I transitioned from a Product Engineering (Mechanical) role to a Data Scientist Role. I entered this space with strong subject matter expertise with our products, our data, munging through data in pyhon, and data visualization in python. My sql skills were lacking to say the least. I had learned what I needed to know to get data from our relational databases, then use pandas to do any further analysis. Just run something like the following and you have data. SELECT * FROM Table Where col_1 = 'col_1_filter' This technique works great for small data sets that you only need to run once. There is no shame to pull in a big dataset and start munging with it in pandas to get some results, and make decisions. The problem becomes when your dataset becomes too big or you need to run the query on a frequent basis. Doing the aggregations on the server run much quicker, as it reduces the time spent in io. My longest running steps are currently io related. Reducing these steps have im...

background tasks in python

I have tried most of the different methods in the past and found that copying and pasting the threadpoolexecutor example [1] or the processpoolexecutor example [2] from the standard library documentation to be the most reliable. Since this is often something that I stuff in the back of a utility module of a library it is not something that I write often enough to be familiar with, which makes it both hard to write and hard to read and debug. If you are looking for a good overview of the difference concurrency Raymond Hettinger [3] has a great talk about the difference between the various different methods, when to use them and why. Recently a new python library was released to make running tasks in the background very simple. The background [4] project by Kenneth Reitz is a high level implementation of python 3’s ThreadPoolExecutor. I have been playing around with this project over the last week and I will say that this is definitely the simplest way to run background tasks in pyth...