Posts tagged: data

All posts with the tag "data"

70 posts latest post 2025-06-09
Publishing rhythm
Jun 2025 | 1 posts
sqlite-utils now supports plugins sqlite-utils 3.34 is out with a major new feature: support for plugins. sqlite-utils is my combination Python library and command-line tool for manipulating SQLite databases. It recently celebrated... Simon Willison’s Weblog · simonwillison.net [1] As the title states sqlite-utils now supports plugins. I dug in just a bit and Simon implemented this completely with entrypoints, no framework or library at all. Note This post is a thought [2]. It’s a short note that I make about someone else’s content online #thoughts References: [1]: https://simonwillison.net/2023/Jul/24/sqlite-utils-plugins/ [2]: /thoughts/
External Link duckdb.org [1] Harlequin is a pretty sweet example of what textual can be used to create. Its a terminal based sql ide for DuckDB. Note This post is a thought [2]. It’s a short note that I make about someone else’s content online #thoughts References: [1]: https://duckdb.org/docs/guides/sql_editors/harlequin [2]: /thoughts/
[1] To persist data in duckdb you need to first make a connection to a duck db database. con = duckdb.connect('file.db') Then work off of the connection con rather than duckdb. con.sql('CREATE TABLE test(i INTEGER)') con.sql('INSERT INTO test VALUES (42)') # query the table con.table('test').show() # explicitly close the connection con.close() Note This post is a thought [2]. It’s a short note that I make about someone else’s content online #thoughts References: [1]: /static/https://duckdb.org/docs/api/python/overview.html [2]: /thoughts/
Redirecting… duckdb.org [1] duckdb can just query any pandas dataframe that is in memory. I tried running it against a list of objects and got this error. Great error message that gives me supported types right in the message. Make sure that "posts" is either a pandas.DataFrame, duckdb.DuckDBPyRelation, pyarrow Table, Dataset, RecordBatchReader, Scanner, or NumPy ndarrays with supported format Note This post is a thought [2]. It’s a short note that I make about someone else’s content online #thoughts References: [1]: https://duckdb.org/docs/guides/python/sql_on_pandas [2]: /thoughts/
Full-text search - Datasette documentation docs.datasette.io [1] Enable full-text search in sqlite using sqlite-utils. $ sqlite-utils enable-fts mydatabase.db items name description Note This post is a thought [2]. It’s a short note that I make about someone else’s content online #thoughts References: [1]: https://docs.datasette.io/en/latest/full_text_search.html#enabling-full-text-search-for-a-sqlite-table [2]: /thoughts/
sqlite-utils command-line tool - sqlite-utils sqlite-utils.datasette.io [1] I want to like jq, but I think Simon is selling me on sqlite, maybe its just me but this looks readable, hackable, editable, memorizable. Everytime I try jq, and its 5 minutes fussing with it just to get the most basic thing to work. I know enough sql out of the gate to make this work off the top of my head curl https://thoughts.waylonwalker.com/posts/ | sqlite-utils memory - 'select title, message from stdin where stdin.tags like "%python%"' | jq Note This post is a thought [2]. It’s a short note that I make about someone else’s content online #thoughts References: [1]: https://sqlite-utils.datasette.io/en/stable/cli.html#querying-data-directly-using-an-in-memory-database [2]: /thoughts/
sqlite-utils command-line tool - sqlite-utils sqlite-utils.datasette.io [1] insert a json array directly into into sqlite with sqlite-utils. echo '{"name": "Cleo", "age": 4}' | sqlite-utils insert dogs.db dogs - Note This post is a thought [2]. It’s a short note that I make about someone else’s content online #thoughts References: [1]: https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-json-data [2]: /thoughts/
Kedro rich is a very new and unstable (it’s good, just not ready) plugin for kedro to make the command line prettier. Install kedro rich # [1] There is no pypi package yet, but it’s on github. You can pip install it with the git [2] url. pip install git+https://github.com/datajoely/kedro-rich Kedro run # [3] You can run your pipeline just as you normally would, except you get progress bars and pretty prints. kedro run [4] Kedro catalog # [5] Listing out catalog entries from the command line now print out a nice pretty table. kedro catalog list [6] Give it a star # [7] Go to the GitHub repo [8] and give it a star, Joel deserves it. References: [1]: #install-kedro-rich [2]: /glossary/git/ [3]: #kedro-run [4]: https://images.waylonwalker.com/kedro-rich-run.png [5]: #kedro-catalog [6]: https://images.waylonwalker.com/kedro-rich-catalog-list.png [7]: #give-it-a-star [8]: https://github.com/datajoely/kedro-rich
I keep my nodes short and sweet. They do one thing and do it well. I turn almost every DataFrame transformation into its own node. It makes it must easier to pull catalog entries, than firing up the pipeline, running it, and starting a debugger. For this reason many of my nodes can be built from inline lambdas. Examples # [1] Here are two examples, the first one lambda x: x is sometimes referred to as an identity function. This is super common to use in the early phases of a project. It lets you follow standard layering conventions, without skipping a layer, overthinking if you should have the layer or not, and leaves a good placholder to fill in later when you need it. Many times I just want to get the data in as fast as possible, learn about it, then go back and tidy it up. from kedro.pipeline import node my_first_node = node( func=lambda x: x, inputs='raw_cars', output='int_cars', tags=['int',] ) my_first_node = node( func=lambda cars: cars[['mpg', 'cyl', 'disp',]].query('disp>200'), inputs='raw_cars', output='int_cars', tags=['pri',] ) Note: try not to take the idea of a one liner too far. If your one line function wraps several lines down it probably deserv...
As you work on your kedro projects you are bound to need to add more dependencies to the project eventually. Kedro uses a fantastic command pip-compile under the hood to ensure that everyone is on the same version of packages at all times, and able to easily upgrade them. It might be a bit different workflow than what you have seen, let’s take a look at it. git status # [2] Before you start mucking around with any changes to dependencies make sure that your git status is clean. I’d even reccomend starting a new branch for this, and if you are working on a team potentially submit this as its own PR for clarity. git status git checkout main git checkout -b add-rich-dependency requirements.in # [3] New requirements get added to a requirements.in file. If you need to specify an exact version, or a minimum version you can do that, but if all versions generally work you can leave it open. # requirements.in rich Here I added the popular rich package to my requirements.in file. Since I am ok with the latest version I am not going to pin anything, I am going to let the pip resolver pick the latest version that does not conflict with any of my dependencies for me. build-reqs # [4] ...
I am a huge believer in practicing your craft. Professional athletes spend most of their time honing their skills and making themsleves better. In Engineering many spend nearly 0 time practicing. I am not saying that you need to spend all your free time practicing, but a few minutes trying new things can go a long way in how you understand what you are doing and make a hue impact on your long term productivity. What is Kedro [1] Start practicing # [2] practice building pipelines with #kedro today Go to your playground directory, and if you don’t have one, make one. cd ~/playground get pipx # [3] Install pipx in your system python. This is one of the very few, and possibly the only python library that deserves to be installed in your system directory, primarily because its used to sanbox clis in their own virtual environment [4] automatically for you. pip install pipx make a new project # [5] From inside your playground directory, start your new kedro project. This is quite simple and painless. So much so that if you mess this one up doing something wild, it might be easier to make a new one that fixing the wild one. pipx run kedro new # answer the questions it asks I u...
I just installed a brand new Ubuntu 21.10 Impish Indri, and wanted a kedro project to play with so I did what any good kedroid would do, I went to my command line and ran pipx run kedro new --starter spaceflights But what I got back was not what I expected! Fatal error from pip prevented installation. Full pip output in file: /home/walkers/.local/pipx/logs/cmd_2022-01-01_20.42.16_pip_errors.log Some possibly relevant errors from pip install: ERROR: Could not find a version that satisfies the requirement kedro (from versions: none) ERROR: No matching distribution found for kedro Error installing kedro. This is weird, why cant I run kedro new with pipx? Lets try pip. pip install kedro Same issue. ERROR: Could not find a version that satisfies the requirement kedro (from versions: none) ERROR: No matching distribution found for kedro What is Kedro [1] Curious what kedro is? Check out this article. What’s up # [2] wrong python version The issue is that kedro only runs on up to python 3.8, and on Ubuntu 21.10 when you apt install python3 you get python 3.9 and the standard repos don’t have an old enough version to run kedro. How to fix this? # [3] Theres a couple of wa...

kedro catalog create

I use kedro catalog create to boost my productivity by automatically generating yaml catalog entries for me. It will create new yaml files for each pipeline, fill in missiing catalog entries, and respect already existing catalog entries. It will reformat the file, and sort it based on catalog key. https://youtu.be/_22ELT4kja4 What is Kedro [1] 👆 Unsure what kedro is? Check out this post. Running Kedro Catalog Create # [2] The command to ensure there are catalog entries for every dataset in the passed in pipeline. kedro catalog create --pipeline history_nodes - Create’s new yaml file, if needed - Fills in new dataset entries with the default dataset - Keeps existing datasets untouched - it will reformat your yaml file a bit - default sorting will be applied - empty newlines will be removed CONF_ROOT # [3] Kedro will respect your CONF_ROOT settings when it creates a new catalog file, or looks for existing catalog files. You can change the location of your configuration f...

nvim conf 2021 | IDE's are slow | Waylon Walker

https://youtu.be/E18m4KkJUnI --- Slides 👇 # [1] welcome # [2] Other possible titles # [3] - Using Vim as a Team Lead - I 💜 Tmux - Why I stopped using @code - Get there fast - How I vim It’s ok # [4] Use a graphical IDE if it works for you. Trick it out # [5] vim is so well integrated into the terminal, take advantage It wasn’t working for me anymore # [6] dozens of instances # [7] As a team lead I bounce betweeen a dozen projects a per day https://pbs.twimg.com/media/FAEmRjYUcAUk2eR?format=jpg&name=large [8] Move With Intent # [9] Running vim inside tmux lets me move swiftly between the exact project I need. https://twitter.com/_WaylonWalker/status/1438849269407047686/photo/1// [10]: <> (__) Hub and Spoke # [11] - direct link to specific projects - fuzzy into all projects - fuzzy into open projects How I navigate tmux in 2021 [12]#hub-and-spoke Other Things That Make this Possible # [13] - tmux - direnv vim adjacent things yes, vim is ugly, make it your...

Kedro-Broken-Urls

Broken Urls # [1] - https://github.com/josephhaaga) [ ] https://example.com/file.h5 - https://raw.githubusercontent.com/kedro-org/kedro/develop/static/img/pipeline_visualisation.png - https://example.com/file.txt - https://github.com/jmespath/jmespath.py. - https://github.com/tsanikgr) - https://example.com/file.csv - https://kedro.readthedocs.io/en/latest/04_user_guide/15_hooks.html - https://kedro.readthedocs.io/en/stable/07_extend_kedro/04_hooks.html - https://github.com/EbookFoundation/free-programming-books/blob/master/books/free-programming-books.md#python - https://github.com/quantumblacklabs/private-kedro/blob/develop/docs/source/04_user_guide/04_data_catalog.md - http://example.com/api/test - https://example.com/file.parquet - https://kedro.readthedocs.io/en/stable/11_faq/01_faq.html#how-do-i-upgrade-kedro - https://example.com/file.xlsx - https://www.datacamp.com/community/tutorials/docstrings-python - https://github.com/mmchougule) - https://example.com/f...

Setting Parameters in kedro

Parameters are a place for you to store variables for your pipeline that can be accessed by any node that needs it, and can be easily changed by changing your environment. Parameters are stored in the repository in yaml files. https://youtu.be/Jj5cQ5bqcjg What is Kedro [1] 👆 Unsure what kedro is? Check out this post. parameters files # [2] You can have multiple parameters files and choose which ones to load by setting your environment. By default kedro will give you a base and local parameters file. - conf/base/parameters.yml - conf/local/parameters.yml base # [3] The base environment should contain all of the default values you want to run. # /conf/base/parameters.yml test_size: 0.2 random_state: 3 features: - engines - passenger_capacity - crew - d_check_complete - moon_clearance_complete - iata_approved - company_rating - review_scores_rating NOTE base will always be loaded first. accessing parameters # [4] Parameters can be accessed through context or throug...

Writing your first kedro Nodes

https://youtu.be/-gEwU-MrPuA Before we jump in with anything crazy, let’s make some nodes with some vanilla data structures. import node # [1] You will need to import node from kedro.pipeline to start creating nodes. from kedro.pipeline import node func # [2] The func is a callable that will take the inputs and create the outputs. inputs / outputs # [3] Inputs and outputs can be None, a single catalog entry as a string, mutiple catalog entries as a List of strings, or a dictionary of strings where the key is the keyword argument of the func and the value is the catalog entry to use for that keyword. our first node # [4] Sometimes in our pipelines our data is coming from an api where we already have python functions built to pull with. Thats ok, kedro supposrts that with inputs=None. def create_range(): return range(100) make_range = node( func=create_range, inputs=None, outputs='range' ) second node # [5] Now we have some data to work from, lets use that as our inpu...

Running your Kedro Pipeline from the command line

Running your kedro pipeline from the command line could not be any easier to get started. This is a concept that you may or may not do often depending on your workflow, but its good to have under your belt. I personally do this half the time and run from ipython half the time. In production, I mostly use docker and that is all done with this cli. https://youtu.be/ZmccpLy-OEI What is Kedro [1] 👆 Unsure what kedro is? Check out this post. Kedro run # [2] To run the whole darn project all we need to do is fire up a terminal, activate our environment, and tell kedro to run. kedro run Specific Pipelines # [3] Running a sub pipeline that we have created is as easy as telling kedro which one we want to run. kedro run --pipeline dp Single Nodes # [4] While developing a node or a small list of nodes in a larger pipeline its handy to be able to run them one at a time. Besides the use case of developing a single node I would not reccomend leaning very heavy on running single nodes, le...

kedro Virtual Environment

Avoid serious version conflict issues, and use a virtual environment [1] anytime you are running python, here are three ways you can setup a kedro virtual environment. https://youtu.be/ZSxc5VVCBhM - conda - venv - pipenv conda # [2] I prefer to use conda as my virtual environment manager of choice as it give me both the interpreter and the packages I install. I don’t have to rely on the system version of python or another tool to maintain python versions at all, I get everything in one tool. conda create -n my-project python=3.8 -y conda activate my-project python -m pip install --upgrade pip pip install -e src conda info --envs - stores environment in a root directory i.e. ~/miniconda3 - conda can use its own way to manage environments environment.yml - the python interpreter is packaged with the environment virtualenv # [3] Virtual env (venv) is another very respectable option that is built right into python, and requires no additional installs or using a different dis...

Kedro Pipeline Create

Kedro pipeline create is a command that makes creating new pipelines much easier. There is much less boilerplate that you need to write yourself. https://youtu.be/HtyIKqlEoNw creating a new pipeline # [1] The kedro cli comes with the following command to scaffold out new pipelines. Note that it will not add it to your pipeline_registry, to be covered later, you will need to add it yourself. kedro pipeline create example results # [2] The directory structure that it creates looks like this. tree src/kedro_conda/pipelines src/kedro_conda/pipelines ├── __init__.py └── example ├── __init__.py ├── nodes.py ├── pipeline.py └── README.md References: [1]: #creating-a-new-pipeline [2]: #results