Posts tagged: data

Taming file zoos: Data science with DuckDB database files - Al... www.youtube.com

- The ability to query s3 buckets so seamless looks like such a pleasure to work with if you have a use case for that. Kedro catalog takes care of this most of the time for me, but I wonder if there are some cross project searching use cases I might find for this.

hotel_bookings.csv www.kaggle.com

hotel_bookings.csv Discover what actually works in AI. Join millions of builders, researchers, and labs evaluating agents, models, and frontier technology through crowdsourced benchmarks, competitions, and hackathons. kaggle.com [1] nice dataset to use for example / test projects. I’m using it to play with duckdb currently. References: [1]: https://www.kaggle.com/datasets/ahmedsafwatgb20/hotel-bookingscsv?resource=download

Migration guide for config loaders — kedro 0.19.11 documentation docs.kedro.org

[1] Migrating from kedro 0.18.4 to the latest version involves handling the deprecated OmegaConf loader. Switching over does not look as bad as I originally thought. - installing kedro 0.18.5+ - set the CONFIG_LOADER_CLASS in settings.py - swap out import statements - config must be yaml or json - getting values from config must be done with bracket __getattr__ style not with .get - any Exceptions caught from Templated config loader will need to be swapped to OmegaConfig exceptions, similar to #3 - templated values must lead with an _ - Globals are handled different - OmegaConfig does not support jinja2 sytax, but rather a ${variable} syntax References: [1]: /static/https://docs.kedro.org/en/stable/configuration/config_loader_migration.html

Today I learned how to VACUUM a sqlite database and cut its size in about half. It’s a database that I have had running for quite awhile and has some decent traffic on it.

Why is it important to do a VACUUM? In short its becuase the file system gets fragmented with as data is updated. On delete the files are removed from the database and marked as available for reuse in the filesystem, but the space is not reclaimed.

To VACUUM a database, run the following sql command. You can do it right form the sqlite shell by running sqlite3.

You will need about double the current size of the database as free space to do the VACUUM, you need space for a full copy, journaling or write ahead logs, and the existing database.

VACUUM;

The docs are fantastic for vacuum.

searching my thoughts locally None

![[None]] First I need to fetch my thoughts from the api, and put it in a local sqlite database using sqlite-utils. fthoughts () { # fetch thoughts curl 'https://thoughts.waylonwalker.com/posts/waylonwalker/?page_size=9999999999' | sqlite-utils insert ~/.config/thoughts/database2.db post --pk=id --alter --ignore - } Now that I have my posts in a local sqlite database I can use sqlite-utils to enable full text search and populate the full text search on the post table using the title message and tags columns as search. sthoughts () { # search thoughts # sqlite-utils enable-fts ~/.config/thoughts/database2.db post title message tags # sqlite-utils populate-fts ~/.config/thoughts/database2.db post title message tags sqlite-utils search ~/.config/thoughts/database2.db post "$*" | ~/git/thoughts/format_thought.py | bat --style=plain --color=always --language=markdown } alias st=sthoughts Now I am ready to search my thoughts, which is a tiny blog format that I created mostly for leaving my own personal comment on web pages, so most of them have a link to some other online content, and their title is based on the authors title. [1] [2] References: [1]: https://vhs.charm.s...

Open source, not open contribution with Ben Johnson (Changelog... changelog.com

Open source, not open contribution with Ben Johnson (Changelog Interviews #433) This week we're talking with Ben Johnson. Ben is known for his work on BoltDB, his work in open source, and as a freelance Go developer. Late January when Ben open sourced his newest project Litest... Changelog · changelog.com [1] Ben Johnson was on the Changelog a few years back covering his work on litestream, and talks about why he chose to go open source, but not open contribution. You should have a good reason to move off of sqlite. References: [1]: https://changelog.com/podcast/433

DjangoCon Europe 2023 | Use SQLite in production - YouTube www.youtube.com

- Very inspiring talk, TLDR, you probably don’t need a database server. sqlite will probably be faster, simpler to maintain, and simpler to test your application.

I recently se tup minio object storage in my homelab for litestream sqlite backups. The litestream quickstart made it easy to get everything up and running on localhost, but I hit a wall when dns was involved to pull it from a different machine.

Here is what I got to work #

First I had to configure the Key ID and Secret Access Key generated in the minio ui.

❯ aws configure
AWS Access Key ID [****************VZnD]:
AWS Secret Access Key [****************xAm8]:
Default region name [us-east-1]:
Default output format [None]:

Then set the the s3 signature_version to s3v4.

aws configure set default.s3.signature_version s3v4

Now when I have minio running on https://my-minio-endpoint.com I can use the aws cli to access the bucket.

Note that https://my-minio-endpoint.com resolves to the bucket endpoint (default 9000) not the ui (default 9001).

aws --endpoint-url https://my-minio-endpoint.com s3 ls my_bucket

Now Configuring Litestream #

Litestream also accepts the endpoint argument via config. I could not get it to work just with the ui.

Note the aws configure step above is not required for litestream, only the aws cli.

dbs:
  - path: /path/to/database.db
    replicas:
      - url: s3://my_bucket/
        endpoint: https://my-minio-endpoint.com
        region: us-east-1
        access-key-id: ****************VZnD
        secret-access-key: ************************************xAm8

Now run a litestream replication.

litestream replicate -config litestream.yml
# or put the config in /etc/litestream.yml and just run replicate
litestream replicate

benbjohnson/litestream: Streaming replication for SQLite. github.com

GitHub - benbjohnson/litestream: Streaming replication for SQLite. Streaming replication for SQLite. Contribute to benbjohnson/litestream development by creating an account on GitHub. GitHub · github.com [1] `litestream` is a sick cli tool for steaming replicas of sqlite. It automatically does daily snapshots, and streams all of the writes to the replica live. install # [2] Install is fast using installer, no compilation, just copy the binary and run. curl https://i.wayl.one/benbjohnson/litestream References: [1]: https://github.com/benbjohnson/litestream [2]: #install

why-is-postgres-default

Serious question. No one ever got fired for choosing PostgreSQL # [1] But, why. It’s the most loved db, right? Right? Maybe it’s time to rethink it. Don’t get me wrong, if I need a relational db as a service, PostgreSQL is going to be my first choice, but why do I need to run a separate application for it? Tutorials use sqlite # [2] Why is that? Because there is nothing else to stand up. Nothing else to maintain. And you probably already have it installed on just about anything that has a battery. SQLite runs in memory # [3] Don’t need, or maybe don’t want to persist state. Run it in memory. This is a nice feature for running tests. Less exposure # [4] SQLite is a file on your filesystem. It’s not a web service. It’s not a cloud service. Not that postgres is insecure, but it is one more endpoint that you have to think about securing. this means that is probably also cheaper 🤑 SQLite is easy to replicate # [5] Want to run your new feature with prod data? Pull a replica or...

Why I Built Litestream - Litestream litestream.io

Why I Built Litestream - Litestream Despite an exponential increase in computing power, our applications require more machines than ever because of architectural decisions made 25 years ago. You can eliminate much of your complexity ... litestream.io [1] As applications scale to the edge, to put compute as close to the user as possible, database queries back to the master node get slower and slower. Enter sqlite replication, put the database wtih the application code and replicate from master. References: [1]: https://litestream.io/blog/why-i-built-litestream/

I'm All-In on Server-Side SQLite · The Fly Blog fly.io

I'm All-In on Server-Side SQLite Ben Johnson has joined Fly.io Fly · fly.io [1] SQLite is the next big database trend. with more horizontal scaling, close to user read heavy applications, having your database in the same application stack makes a lot of sense. Tools like litestream are going to enable global distribution in an impressive way. References: [1]: https://fly.io/blog/all-in-on-sqlite-litestream/

LiteFS Cloud: Distributed SQLite with Managed Backups · The Fl... fly.io

LiteFS Cloud: Distributed SQLite with Managed Backups Documentation and guides from the team at Fly.io. Fly · fly.io [1] Fly.io’s solution to sqlite managed backups.I definitely want to look into this a bit, but moreso the tech under the hook litestream. References: [1]: https://fly.io/blog/litefs-cloud/

SQLite FTS5 Extension www.sqlite.org

[1] sqlite has 3 different tokenizers, porter, ascii, trigram. These can be used with sqlite-utils. sqlite-utils enable-fts --tokenize porter database.db post title message tags And with the python api. db = Database('database.db') db["post"].enable_fts( ["title", "message", "tags"], create_triggers=True, tokenize="trigram" ) posts = list(db["post"].search(search)) References: [1]: /static/https://www.sqlite.org/fts5.html

simonw/datasette-render-markdown: Datasette plugin for renderi... github.com

GitHub - simonw/datasette-render-markdown: Datasette plugin for rendering Markdown Datasette plugin for rendering Markdown. Contribute to simonw/datasette-render-markdown development by creating an account on GitHub. GitHub · github.com [1] datasette really does everything doesn’t it! References: [1]: https://github.com/simonw/datasette-render-markdown

`ValueError: Constraint must have a name` in alembic 1.10.0 · ... github.com

`ValueError: Constraint must have a name` in alembic 1.10.0 · Issue #1195 · sqlalchemy/alembic Describe the bug ValueError: Constraint must have a name in alembic 1.10.0. Expected behavior Migration succeeds. To Reproduce Please try to provide a Minimal, Complete, and Verifiable example, wit... GitHub · github.com [1] After a nasty time with alembic upgrades, thoughts is about to get a new users table. This may have came from incorrectly setting up alembic for sqlite from the start, but I was able to fix the issue with this GitHub issue. alembic sqlite ValueError: Constraint must have a name The change I needed to make to get my migration to run. + batch_op.create_foreign_key('fk_post_author_id_user', 'user', ['author_id'], ['id']) References: [1]: https://github.com/sqlalchemy/alembic/issues/1195

Use Alembic Check to check for possible upgrades None

![[None]] Since using alembic I have been just running out a new revision checking its content and deleting it if its empty, today I learned there is an alembic check command to check for operations that need to be created. ❯ alembic check INFO [alembic.runtime.migration] Context impl SQLiteImpl. INFO [alembic.runtime.migration] Will assume non-transactional DDL. No new upgrade operations detected.

DuckDB vs. MotherDuck — should you switch to the cloud version... kestra.io

DuckDB vs. MotherDuck: When to Move to the Cloud | Kestra DuckDB is fast and free. MotherDuck adds cloud storage, collaboration, and scale. Here kestra.io [1] duckdb is a new in process database that has been making its rounds in analytics for its high performance in those applications. Mother duck is a centeralized server that brings manages storage, data sharing and an ide to duckdb. References: [1]: https://kestra.io/blogs/2023-07-28-duckdb-vs-motherduck

s3-tree · PyPI pypi.org

s3-tree list s3 objects in tree-like format. PyPI · pypi.org [1] Super useful way to show a tree view of an s3 bucket’s structure! pip install s3-tree s3-tree bucketname References: [1]: https://pypi.org/project/s3-tree/

kndndrj/nvim-dbee: Interactive database client for neovim github.com

GitHub - kndndrj/nvim-dbee: Interactive database client for neovim Interactive database client for neovim. Contribute to kndndrj/nvim-dbee development by creating an account on GitHub. GitHub · github.com [1] A neovim database client that I need to check out. References: [1]: https://github.com/kndndrj/nvim-dbee

`j`	Scroll down
`k`	Scroll up
`g` `g`	Scroll to top
`Shift` `G`	Scroll to bottom
`d`	Half-page down
`u`	Half-page up

`j` / `↓`	Next post (in feeds)
`k` / `↑`	Previous post (in feeds)
`Enter` / `o`	Open highlighted post
`Shift` `O`	Open in new tab
`g` `h`	Go to home
`g` `s`	Focus search
`[`	Previous page
`]`	Next page
`b`	Toggle left sidebar
`Shift` `B`	Toggle right sidebar
`s`	Toggle simple/rich feed view

`/`	Focus search input
`⌘CtrlK`	Focus search (alternative)
`y` `y`	Copy URL to clipboard
`?`	Show this help
`Esc`	Close / clear highlight