sqlite-utils now supports plugins
sqlite-utils 3.34 is out with a major new feature: support for plugins. sqlite-utils is my combination Python library and command-line tool for manipulating SQLite databases. It recently celebrated...
Simon Willison’s Weblog · simonwillison.net [1]
As the title states sqlite-utils now supports plugins. I dug in just a bit and Simon implemented this completely with entrypoints, no framework or library at all.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://simonwillison.net/2023/Jul/24/sqlite-utils-plugins/
[2]: /thoughts/
Posts tagged: data
All posts with the tag "data"
70 posts
latest post 2025-06-09
Publishing rhythm
External Link
duckdb.org [1]
Harlequin is a pretty sweet example of what textual can be used to create. Its a terminal based sql ide for DuckDB.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://duckdb.org/docs/guides/sql_editors/harlequin
[2]: /thoughts/
[1]
To persist data in duckdb you need to first make a connection to a duck db database.
con = duckdb.connect('file.db')
Then work off of the connection con rather than duckdb.
con.sql('CREATE TABLE test(i INTEGER)')
con.sql('INSERT INTO test VALUES (42)')
# query the table
con.table('test').show()
# explicitly close the connection
con.close()
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: /static/https://duckdb.org/docs/api/python/overview.html
[2]: /thoughts/
Redirecting…
duckdb.org [1]
duckdb can just query any pandas dataframe that is in memory.
I tried running it against a list of objects and got this error. Great error message that gives me supported types right in the message.
Make sure that "posts" is either a pandas.DataFrame, duckdb.DuckDBPyRelation, pyarrow Table, Dataset, RecordBatchReader, Scanner, or NumPy ndarrays with supported format
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://duckdb.org/docs/guides/python/sql_on_pandas
[2]: /thoughts/
Full-text search - Datasette documentation
docs.datasette.io [1]
Enable full-text search in sqlite using sqlite-utils.
$ sqlite-utils enable-fts mydatabase.db items name description
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://docs.datasette.io/en/latest/full_text_search.html#enabling-full-text-search-for-a-sqlite-table
[2]: /thoughts/
sqlite-utils command-line tool - sqlite-utils
sqlite-utils.datasette.io [1]
I want to like jq, but I think Simon is selling me on sqlite, maybe its just me but this looks readable, hackable, editable, memorizable. Everytime I try jq, and its 5 minutes fussing with it just to get the most basic thing to work. I know enough sql out of the gate to make this work off the top of my head
curl https://thoughts.waylonwalker.com/posts/ | sqlite-utils memory - 'select title, message from stdin where stdin.tags like "%python%"' | jq
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://sqlite-utils.datasette.io/en/stable/cli.html#querying-data-directly-using-an-in-memory-database
[2]: /thoughts/
sqlite-utils command-line tool - sqlite-utils
sqlite-utils.datasette.io [1]
insert a json array directly into into sqlite with sqlite-utils.
echo '{"name": "Cleo", "age": 4}' | sqlite-utils insert dogs.db dogs -
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://sqlite-utils.datasette.io/en/stable/cli.html#inserting-json-data
[2]: /thoughts/
Kedro rich is a very new and unstable (it’s good, just not ready) plugin for
kedro to make the command line prettier.
Install kedro rich # [1]
There is no pypi package yet, but it’s on github. You can pip install it with
the git [2] url.
pip install git+https://github.com/datajoely/kedro-rich
Kedro run # [3]
You can run your pipeline just as you normally would, except you get progress
bars and pretty prints.
kedro run
[4]
Kedro catalog # [5]
Listing out catalog entries from the command line now print out a nice pretty
table.
kedro catalog list
[6]
Give it a star # [7]
Go to the GitHub repo [8] and give it a
star, Joel deserves it.
References:
[1]: #install-kedro-rich
[2]: /glossary/git/
[3]: #kedro-run
[4]: https://images.waylonwalker.com/kedro-rich-run.png
[5]: #kedro-catalog
[6]: https://images.waylonwalker.com/kedro-rich-catalog-list.png
[7]: #give-it-a-star
[8]: https://github.com/datajoely/kedro-rich
I keep my nodes short and sweet. They do one thing and do it well. I
turn almost every DataFrame transformation into its own node. It makes
it must easier to pull catalog entries, than firing up the pipeline,
running it, and starting a debugger. For this reason many of my nodes
can be built from inline lambdas.
Examples # [1]
Here are two examples, the first one lambda x: x is sometimes referred
to as an identity function. This is super common to use in the early
phases of a project. It lets you follow standard layering conventions,
without skipping a layer, overthinking if you should have the layer or
not, and leaves a good placholder to fill in later when you need it.
Many times I just want to get the data in as fast as possible, learn
about it, then go back and tidy it up.
from kedro.pipeline import node
my_first_node = node(
func=lambda x: x,
inputs='raw_cars',
output='int_cars',
tags=['int',]
)
my_first_node = node(
func=lambda cars: cars[['mpg', 'cyl', 'disp',]].query('disp>200'),
inputs='raw_cars',
output='int_cars',
tags=['pri',]
)
Note: try not to take the idea of a one liner too far. If your
one line function wraps several lines down it probably deserv...
As you work on your kedro projects you are bound to need to add more
dependencies to the project eventually. Kedro uses a fantastic command
pip-compile under the hood to ensure that everyone is on the same version of
packages at all times, and able to easily upgrade them. It might be a bit
different workflow than what you have seen, let’s take a look at it.
git status # [2]
Before you start mucking around with any changes to dependencies make sure that
your git status is clean. I’d even reccomend starting a new branch for this,
and if you are working on a team potentially submit this as its own PR for
clarity.
git status
git checkout main
git checkout -b add-rich-dependency
requirements.in # [3]
New requirements get added to a requirements.in file. If you need to specify
an exact version, or a minimum version you can do that, but if all versions
generally work you can leave it open.
# requirements.in
rich
Here I added the popular rich package to my requirements.in file. Since
I am ok with the latest version I am not going to pin anything, I am going to
let the pip resolver pick the latest version that does not conflict with any of
my dependencies for me.
build-reqs # [4]
...
I am a huge believer in practicing your craft. Professional athletes
spend most of their time honing their skills and making themsleves
better. In Engineering many spend nearly 0 time practicing. I am not
saying that you need to spend all your free time practicing, but a few
minutes trying new things can go a long way in how you understand what
you are doing and make a hue impact on your long term productivity.
What is Kedro [1]
Start practicing # [2]
practice building pipelines with #kedro today
Go to your playground directory, and if you don’t have one, make one.
cd ~/playground
get pipx # [3]
Install pipx in your system python. This is one of the very few, and
possibly the only python library that deserves to be installed in your
system directory, primarily because its used to sanbox clis in their own
virtual environment [4] automatically for you.
pip install pipx
make a new project # [5]
From inside your playground directory, start your new kedro project.
This is quite simple and painless. So much so that if you mess this one
up doing something wild, it might be easier to make a new one that
fixing the wild one.
pipx run kedro new
# answer the questions it asks
I u...
I just installed a brand new Ubuntu 21.10 Impish Indri, and wanted a
kedro project to play with so I did what any good kedroid would do, I
went to my command line and ran
pipx run kedro new --starter spaceflights
But what I got back was not what I expected!
Fatal error from pip prevented installation. Full pip output in file:
/home/walkers/.local/pipx/logs/cmd_2022-01-01_20.42.16_pip_errors.log
Some possibly relevant errors from pip install:
ERROR: Could not find a version that satisfies the requirement kedro (from versions: none)
ERROR: No matching distribution found for kedro
Error installing kedro.
This is weird, why cant I run kedro new with pipx? Lets try pip.
pip install kedro
Same issue.
ERROR: Could not find a version that satisfies the requirement kedro (from versions: none)
ERROR: No matching distribution found for kedro
What is Kedro [1]
Curious what kedro is? Check out this article.
What’s up # [2]
wrong python version
The issue is that kedro only runs on up to python 3.8, and on Ubuntu
21.10 when you apt install python3 you get python 3.9 and the
standard repos don’t have an old enough version to run kedro.
How to fix this? # [3]
Theres a couple of wa...
kedro catalog create
I use kedro catalog create to boost my productivity by automatically
generating yaml catalog entries for me. It will create new yaml files for each
pipeline, fill in missiing catalog entries, and respect already existing
catalog entries. It will reformat the file, and sort it based on catalog key.
https://youtu.be/_22ELT4kja4
What is Kedro [1]
👆 Unsure what kedro is? Check out this post.
Running Kedro Catalog Create # [2]
The command to ensure there are catalog entries for every dataset in the passed
in pipeline.
kedro catalog create --pipeline history_nodes
- Create’s new yaml file, if needed
- Fills in new dataset entries with the default dataset
- Keeps existing datasets untouched
- it will reformat your yaml file a bit
- default sorting will be applied
- empty newlines will be removed
CONF_ROOT # [3]
Kedro will respect your CONF_ROOT settings when it creates a new catalog
file, or looks for existing catalog files. You can change the location of your
configuration f...
nvim conf 2021 | IDE's are slow | Waylon Walker
https://youtu.be/E18m4KkJUnI
---
Slides 👇 # [1]
welcome # [2]
Other possible titles # [3]
- Using Vim as a Team Lead
- I 💜 Tmux
- Why I stopped using @code
- Get there fast
- How I vim
It’s ok # [4]
Use a graphical IDE if it works for you.
Trick it out # [5]
vim is so well integrated into the terminal, take advantage
It wasn’t working for me anymore # [6]
dozens of instances # [7]
As a team lead I bounce betweeen a dozen projects a per day
https://pbs.twimg.com/media/FAEmRjYUcAUk2eR?format=jpg&name=large [8]
Move With Intent # [9]
Running vim inside tmux lets me move swiftly between the exact project I need.
https://twitter.com/_WaylonWalker/status/1438849269407047686/photo/1// [10]: <> (__)
Hub and Spoke # [11]
- direct link to specific projects
- fuzzy into all projects
- fuzzy into open projects
How I navigate tmux in 2021 [12]#hub-and-spoke
Other Things That Make this Possible # [13]
- tmux
- direnv
vim adjacent things
yes, vim is ugly, make it your...
Kedro-Broken-Urls
Broken Urls # [1]
- https://github.com/josephhaaga) [ ] https://example.com/file.h5
- https://raw.githubusercontent.com/kedro-org/kedro/develop/static/img/pipeline_visualisation.png
- https://example.com/file.txt
- https://github.com/jmespath/jmespath.py.
- https://github.com/tsanikgr)
- https://example.com/file.csv
- https://kedro.readthedocs.io/en/latest/04_user_guide/15_hooks.html
- https://kedro.readthedocs.io/en/stable/07_extend_kedro/04_hooks.html
- https://github.com/EbookFoundation/free-programming-books/blob/master/books/free-programming-books.md#python
- https://github.com/quantumblacklabs/private-kedro/blob/develop/docs/source/04_user_guide/04_data_catalog.md
- http://example.com/api/test
- https://example.com/file.parquet
- https://kedro.readthedocs.io/en/stable/11_faq/01_faq.html#how-do-i-upgrade-kedro
- https://example.com/file.xlsx
- https://www.datacamp.com/community/tutorials/docstrings-python
- https://github.com/mmchougule)
- https://example.com/f...
Setting Parameters in kedro
Parameters are a place for you to store variables for your pipeline that can be
accessed by any node that needs it, and can be easily changed by changing your
environment. Parameters are stored in the repository in yaml files.
https://youtu.be/Jj5cQ5bqcjg
What is Kedro [1]
👆 Unsure what kedro is? Check out this post.
parameters files # [2]
You can have multiple parameters files and choose which ones to load by setting
your environment. By default kedro will give you a base and local
parameters file.
- conf/base/parameters.yml
- conf/local/parameters.yml
base # [3]
The base environment should contain all of the default values you want to run.
# /conf/base/parameters.yml
test_size: 0.2
random_state: 3
features:
- engines
- passenger_capacity
- crew
- d_check_complete
- moon_clearance_complete
- iata_approved
- company_rating
- review_scores_rating
NOTE base will always be loaded first.
accessing parameters # [4]
Parameters can be accessed through context or throug...
Writing your first kedro Nodes
https://youtu.be/-gEwU-MrPuA
Before we jump in with anything crazy, let’s make some nodes with some vanilla
data structures.
import node # [1]
You will need to import node from kedro.pipeline to start creating nodes.
from kedro.pipeline import node
func # [2]
The func is a callable that will take the inputs and create the outputs.
inputs / outputs # [3]
Inputs and outputs can be None, a single catalog entry as a string, mutiple
catalog entries as a List of strings, or a dictionary of strings where the key
is the keyword argument of the func and the value is the catalog entry to use
for that keyword.
our first node # [4]
Sometimes in our pipelines our data is coming from an api where we already have
python functions built to pull with. Thats ok, kedro supposrts that with
inputs=None.
def create_range():
return range(100)
make_range = node(
func=create_range,
inputs=None,
outputs='range'
)
second node # [5]
Now we have some data to work from, lets use that as our inpu...
Running your Kedro Pipeline from the command line
Running your kedro pipeline from the command line could not be any easier to
get started. This is a concept that you may or may not do often depending on
your workflow, but its good to have under your belt. I personally do this half
the time and run from ipython half the time. In production, I mostly use docker
and that is all done with this cli.
https://youtu.be/ZmccpLy-OEI
What is Kedro [1]
👆 Unsure what kedro is? Check out this post.
Kedro run # [2]
To run the whole darn project all we need to do is fire up a terminal, activate
our environment, and tell kedro to run.
kedro run
Specific Pipelines # [3]
Running a sub pipeline that we have created is as easy as telling kedro which
one we want to run.
kedro run --pipeline dp
Single Nodes # [4]
While developing a node or a small list of nodes in a larger pipeline its handy
to be able to run them one at a time. Besides the use case of developing a
single node I would not reccomend leaning very heavy on running single nodes,
le...
kedro Virtual Environment
Avoid serious version conflict issues, and use a virtual environment [1] anytime
you are running python, here are three ways you can setup a kedro virtual
environment.
https://youtu.be/ZSxc5VVCBhM
- conda
- venv
- pipenv
conda # [2]
I prefer to use conda as my virtual environment manager of choice as it give me
both the interpreter and the packages I install. I don’t have to rely on the
system version of python or another tool to maintain python versions at all, I
get everything in one tool.
conda create -n my-project python=3.8 -y
conda activate my-project
python -m pip install --upgrade pip
pip install -e src
conda info --envs
- stores environment in a root directory i.e. ~/miniconda3
- conda can use its own way to manage environments environment.yml
- the python interpreter is packaged with the environment
virtualenv # [3]
Virtual env (venv) is another very respectable option that is built right into
python, and requires no additional installs or using a different dis...
Kedro Pipeline Create
Kedro pipeline create is a command that makes creating new
pipelines much easier. There is much less boilerplate that
you need to write yourself.
https://youtu.be/HtyIKqlEoNw
creating a new pipeline # [1]
The kedro cli comes with the following command to scaffold out
new pipelines. Note that it will not add it to your
pipeline_registry, to be covered later, you will need to add
it yourself.
kedro pipeline create example
results # [2]
The directory structure that it creates looks like this.
tree src/kedro_conda/pipelines
src/kedro_conda/pipelines
├── __init__.py
└── example
├── __init__.py
├── nodes.py
├── pipeline.py
└── README.md
References:
[1]: #creating-a-new-pipeline
[2]: #results