-
The ability to query s3 buckets so seamless looks like such a pleasure to work with if you have a use case for that. Kedro catalog takes care of this most of the time for me, but I wonder if there are some cross project searching use cases I might find for this.
Note
This post is a thought [1]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: /thoughts/
Posts tagged: data
All posts with the tag "data"
70 posts
latest post 2025-06-09
Publishing rhythm
hotel_bookings.csv
Discover what actually works in AI. Join millions of builders, researchers, and labs evaluating agents, models, and frontier technology through crowdsourced benchmarks, competitions, and hackathons.
kaggle.com [1]
nice dataset to use for example / test projects. I’m using it to play with duckdb currently.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://www.kaggle.com/datasets/ahmedsafwatgb20/hotel-bookingscsv?resource=download
[2]: /thoughts/
[1]
Migrating from kedro 0.18.4 to the latest version involves handling the deprecated OmegaConf loader. Switching over does not look as bad as I originally thought.
- installing kedro 0.18.5+
- set the CONFIG_LOADER_CLASS in settings.py
- swap out import statements
- config must be yaml or json
- getting values from config must be done with bracket __getattr__ style not with .get
- any Exceptions caught from Templated config loader will need to be swapped to OmegaConfig exceptions, similar to #3
- templated values must lead with an _
- Globals are handled different
- OmegaConfig does not support jinja2 sytax, but rather a ${variable} syntax
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: /static/https://docs.kedro.org/en/stable/configuration/config_loader_migration.html
[2]: /thoughts/
Today I learned how to VACUUM a sqlite database and cut its size in about half.
It’s a database that I have had running for quite awhile and has some decent
traffic on it.
Why is it important to do a VACUUM? In short its becuase the file system gets
fragmented with as data is updated. On delete the files are removed from the
database and marked as available for reuse in the filesystem, but the space is
not reclaimed.
To VACUUM a database, run the following sql command. You can do it right form
the sqlite shell by running sqlite3.
You will need about double the current size of the database as free space to
do the VACUUM, you need space for a full copy, journaling or write ahead
logs, and the existing database.
VACUUM;
The docs are fantastic for vacuum [1].
References:
[1]: https://www.sqlite.org/lang_vacuum.html
![[None]]
First I need to fetch my thoughts from the api, and put it in a local sqlite database using sqlite-utils.
fthoughts () {
# fetch thoughts
curl 'https://thoughts.waylonwalker.com/posts/waylonwalker/?page_size=9999999999' | sqlite-utils insert ~/.config/thoughts/database2.db post --pk=id --alter --ignore -
}
Now that I have my posts in a local sqlite database I can use sqlite-utils to enable full text search and populate the full text search on the post table using the title message and tags columns as search.
sthoughts () {
# search thoughts
# sqlite-utils enable-fts ~/.config/thoughts/database2.db post title message tags
# sqlite-utils populate-fts ~/.config/thoughts/database2.db post title message tags
sqlite-utils search ~/.config/thoughts/database2.db post "$*" | ~/git/thoughts/format_thought.py | bat --style=plain --color=always --language=markdown
}
alias st=sthoughts
Now I am ready to search my thoughts, which is a tiny blog format that I created mostly for leaving my own personal comment on web pages, so most of them have a link to some other online content, and their title is based on the authors title.
[1]
[2]
Note
This post is a thought [3]. It...
Open source, not open contribution with Ben Johnson (Changelog Interviews #433)
This week we're talking with Ben Johnson. Ben is known for his work on BoltDB, his work in open source, and as a freelance Go developer. Late January when Ben open sourced his newest project Litest...
Changelog · changelog.com [1]
Ben Johnson was on the Changelog a few years back covering his work on litestream, and talks about why he chose to go open source, but not open contribution.
You should have a good reason to move off of sqlite.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://changelog.com/podcast/433
[2]: /thoughts/
-
Very inspiring talk, TLDR, you probably don’t need a database server. sqlite will probably be faster, simpler to maintain, and simpler to test your application.
Note
This post is a thought [1]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: /thoughts/
I recently se tup minio object storage in my homelab [1] for litestream sqlite
backups. The litestream quickstart made it easy to get everything up and
running on localhost, but I hit a wall when dns was involved to pull it from a
different machine.
Here is what I got to work # [2]
First I had to configure the Key ID and Secret Access Key generated in the
minio ui.
❯ aws configure
AWS Access Key ID [****************VZnD]:
AWS Secret Access Key [****************xAm8]:
Default region name [us-east-1]:
Default output format [None]:
Then set the the s3 signature_version to s3v4.
aws configure set default.s3.signature_version s3v4
Now when I have minio running on https://my-minio-endpoint.com I can use the
aws cli to access the bucket.
Note that https://my-minio-endpoint.com resolves to the bucket endpoint
(default 9000) not the ui (default 9001).
aws --endpoint-url https://my-minio-endpoint.com s3 ls my_bucket
Now Configuring Litestream # [3]
Litestream also accepts the endpoint argument via config. I could not get it
to work just with the ui.
Note the aws configure step above is not required for litestream, only the
aws cli.
dbs:
- path: /path/to/database.db
replicas:
-...
GitHub - benbjohnson/litestream: Streaming replication for SQLite.
Streaming replication for SQLite. Contribute to benbjohnson/litestream development by creating an account on GitHub.
GitHub · github.com [1]
`litestream` is a sick cli tool for steaming replicas of sqlite. It automatically does daily snapshots, and streams all of the writes to the replica live.
install # [2]
Install is fast using installer, no compilation, just copy the binary and run.
curl https://i.wayl.one/benbjohnson/litestream
Note
This post is a thought [3]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://github.com/benbjohnson/litestream
[2]: #install
[3]: /thoughts/
why-is-postgres-default
Serious question.
No one ever got fired for choosing PostgreSQL # [1]
But, why. It’s the most loved db, right? Right? Maybe it’s time to rethink
it.
Don’t get me wrong, if I need a relational db as a service, PostgreSQL is going
to be my first choice, but why do I need to run a separate application for it?
Tutorials use sqlite # [2]
Why is that? Because there is nothing else to stand up. Nothing else to
maintain. And you probably already have it installed on just about anything
that has a battery.
SQLite runs in memory # [3]
Don’t need, or maybe don’t want to persist state. Run it in memory. This is a
nice feature for running tests.
Less exposure # [4]
SQLite is a file on your filesystem. It’s not a web service. It’s not a cloud
service. Not that postgres is insecure, but it is one more endpoint that you
have to think about securing.
this means that is probably also cheaper 🤑
SQLite is easy to replicate # [5]
Want to run your new feature with prod data? Pull a replica or...
Why I Built Litestream - Litestream
Despite an exponential increase in computing power, our applications require more machines than ever because of architectural decisions made 25 years ago. You can eliminate much of your complexity ...
litestream.io [1]
As applications scale to the edge, to put compute as close to the user as possible, database queries back to the master node get slower and slower. Enter sqlite replication, put the database wtih the application code and replicate from master.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://litestream.io/blog/why-i-built-litestream/
[2]: /thoughts/
I'm All-In on Server-Side SQLite
Ben Johnson has joined Fly.io
Fly · fly.io [1]
SQLite is the next big database trend. with more horizontal scaling, close to user read heavy applications, having your database in the same application stack makes a lot of sense. Tools like litestream are going to enable global distribution in an impressive way.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://fly.io/blog/all-in-on-sqlite-litestream/
[2]: /thoughts/
LiteFS Cloud: Distributed SQLite with Managed Backups
Documentation and guides from the team at Fly.io.
Fly · fly.io [1]
Fly.io’s solution to sqlite managed backups.I definitely want to look into this a bit, but moreso the tech under the hook litestream.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://fly.io/blog/litefs-cloud/
[2]: /thoughts/
[1]
sqlite has 3 different tokenizers, porter, ascii, trigram.
These can be used with sqlite-utils.
sqlite-utils enable-fts --tokenize porter database.db post title message tags
And with the python api.
db = Database('database.db')
db["post"].enable_fts(
["title", "message", "tags"], create_triggers=True, tokenize="trigram"
)
posts = list(db["post"].search(search))
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: /static/https://www.sqlite.org/fts5.html
[2]: /thoughts/
GitHub - simonw/datasette-render-markdown: Datasette plugin for rendering Markdown
Datasette plugin for rendering Markdown. Contribute to simonw/datasette-render-markdown development by creating an account on GitHub.
GitHub · github.com [1]
datasette really does everything doesn’t it!
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://github.com/simonw/datasette-render-markdown
[2]: /thoughts/
`ValueError: Constraint must have a name` in alembic 1.10.0 · Issue #1195 · sqlalchemy/alembic
Describe the bug ValueError: Constraint must have a name in alembic 1.10.0. Expected behavior Migration succeeds. To Reproduce Please try to provide a Minimal, Complete, and Verifiable example, wit...
GitHub · github.com [1]
After a nasty time with alembic upgrades, thoughts is about to get a new users table. This may have came from incorrectly setting up alembic for sqlite from the start, but I was able to fix the issue with this GitHub issue.
alembic sqlite ValueError: Constraint must have a name
The change I needed to make to get my migration to run.
+ batch_op.create_foreign_key('fk_post_author_id_user', 'user', ['author_id'], ['id'])
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://github.com/sqlalchemy/alembic/issues/1195
[2]: /thoughts/
![[None]]
Since using alembic I have been just running out a new revision checking its content and deleting it if its empty, today I learned there is an alembic check command to check for operations that need to be created.
❯ alembic check
INFO [alembic.runtime.migration] Context impl SQLiteImpl.
INFO [alembic.runtime.migration] Will assume non-transactional DDL.
No new upgrade operations detected.
Note
This post is a thought [1]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: /thoughts/
DuckDB vs. MotherDuck: When to Move to the Cloud | Kestra
DuckDB is fast and free. MotherDuck adds cloud storage, collaboration, and scale. Here
kestra.io [1]
duckdb is a new in process database that has been making its rounds in analytics for its high performance in those applications.
Mother duck is a centeralized server that brings manages storage, data sharing and an ide to duckdb.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://kestra.io/blogs/2023-07-28-duckdb-vs-motherduck
[2]: /thoughts/
Client Challenge
pypi.org [1]
Super useful way to show a tree view of an s3 bucket’s structure!
pip install s3-tree
s3-tree bucketname
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://pypi.org/project/s3-tree/
[2]: /thoughts/
GitHub - kndndrj/nvim-dbee: Interactive database client for neovim
Interactive database client for neovim. Contribute to kndndrj/nvim-dbee development by creating an account on GitHub.
GitHub · github.com [1]
A neovim database client that I need to check out.
Note
This post is a thought [2]. It’s a short note that I make
about someone else’s content online #thoughts
References:
[1]: https://github.com/kndndrj/nvim-dbee
[2]: /thoughts/