-
sqlite vacuum
Today I learned how to VACUUM a sqlite database and cut its size in about half. It's a database that I have had running for quite awhile and has some decent traffic on it. Why is it important to do a VACUUM? In short its becuase the file system gets fragmented with as data is updated. On delete the files are removed from the database and marked as available for reuse in the filesystem, but the space is not reclaimed. To VACUUM a database, run the following sql command. You can do it right form
-
Set up minio bucket entrypoint
I recently se tup minio object storage in my homelab for litestream sqlite backups. The litestream quickstart made it easy to get everything up and running on localhost, but I hit a wall when dns was involved to pull it from a different machine. Here is what I got to work First I had to configure the Key ID and Secret Access Key generated in the minio ui. Then set the the s3 signature_version to s3v4. Now when I have minio running on https://my-minio-endpoint.com I can use the aws cli to acces
-
why-is-postgres-default
Serious question. No one ever got fired for choosing PostgreSQL But, why. It's the most loved db, right? Right? Maybe it's time to rethink it. Don't get me wrong, if I need a relational db as a service, PostgreSQL is going to be my first choice, but why do I need to run a separate application for it? Tutorials use sqlite Why is that? Because there is nothing else to stand up. Nothing else to maintain. And you probably already have it installed on just about anything that has a battery. SQLite ru
-
JUT | Read Notebooks in the Terminal
Trying to read a .ipynb file without starting a jupyter server? jut has you covered. https://youtu.be/t8AvImnwor0 watch the video version of this post on YouTube install is packaged and available on pypi so installing is as easy as pip installing it. installing jut with pip ! This is my first time including snippets of the video in the article like this, let me know what you think! examples running jut examples what are all the commands available for jut? Take a look at the help of the cli
-
Minimal Kedro Pipeline
How small can a minimum kedro pipeline ready to package be? I made one within 4 files that you can pip install. It's only a total of 35 lines of python, 8 in and 27 in . 📝 Note this is only a composable pipeline, not a full project, it does not contain a catalog or runner. Minimal Kedro Pipeline I have everything for this post hosted in this gihub repo , you can fork it, clone it, or just follow along. Installation Caveats This repo represents the minimal amount of structure to build a ked
-
Kedro - My Data Is Not A Table
In python data science/engineering most of our data is in the form of some sort of table, typically a DataFrame from a library like pandas, spark, or dask. DataFrames are the heart of most pipelines These containers for data contain many convenient methods to manipulate table like data structures. Sometimes we leverage other data types, namely vanilla types like lists and dicts, or even numpy data types. [[ what-is-kedro ]] unfamiliar with kedro, check out this post Sometimes datasets are not t
-
Gracefully adopt kedro, the catalog
Why use kedro catalog? While using the catalog alone will not reap all of the benefits of the framework, it does get you and your project ready for the full framework eventually. For me the full benefit of the catalog comes when you combine it with the pipeline and dont even touch read/write steps at all. Taking a step into kedro by adopting the catalog first will give you a way to organize all of your data loads in one place, and stop manually writing read/write code, which can be different fo
-
How to find things in your kedro catalog
kedro 0.16.2 just dropped last week with a long-awaited feature... catalog search ! I went as far as monkey patching this into each of my projects. I work jump between a few really big projects that have tons of datasets. Being able to quickly search for what I need is so useful. The Catalog The kedro data catalog is a key component to the kedro framework. It handles all data loading and saving for you. It is configurable and hackable. Having all your data connections listed in one place
-
How Kedro handles your inputs
Passing inputs into kedro is a key concept. Understanding how it accepts a single catalog key as input is quite trivial that easily makes sense, but passing a list or dictionary of catalog entries can be a bit confusing. *args/**args review Check out this post for a review of how work in python. [[ python-args-kwargs ]] python args and kwargs article by @_waylonwalker All Kedro inputs are catalog Entries When kedro runs your pipeline it uses the catalog to imperatively load your data, mea
-
Create Custom Kedro Dataset
Kedro provides an efficient way to build out data catalogs with their yaml api. It allows you to be very declaritive about loading and saving your data. For the most part you just need to tell Kedro what connector to use and its filepath. When running Kedro takes care of all of the read/write, you just reference the catalog key. But what is happening behind the scenes Under the hood there is an that each connector inherits from. It sets up a lot of the behind the scenes structure for us so
-
What is YOUR Advice for New Data Scientists
Learn the business Learn Git Your code does not need to be amazing Keep Learning Learn Git You dont have to start out as a git wizard with the cleanest possible commit history. At first dont let yourself get too wrapped up in it, the most important part is that you make commits. You will find needs for more advanced stuff later. Get comfortable with this, then learn how to , , , etc... Your code does not need to be amazing Get the job done. Keep it in small bite size pieces. Make readable
-
Filtering Pandas
query Good for method chaining, i.e. adding more methods or filters without assigning a new variable. masking general purpose, this is probably the most common method you see in training/examples isin capable of including multiple strings to include contains Good For partial matches MASKS anything that we put inside of square brackets can be set as a variable then passed in. Operators & - and ~ - not | - or AVAILABLE and NAME AVAILABLE or NAME AVAILABLE and not NAME
-
Clean up Your Data Science with Named Tuples
If you are a regular listener of TalkPython or PythonBytes you have hear Michael Kennedy talk about Named Tuples many times, but what are they and how do they fit into my data science workflow. Example As you graduate your scripts into modules and libraries you might start to notice that you need to pass a lot of data around to all of the functions that you have created. For example if you are running some analysis utilizing , , and data. You may need to calculate total revenue, inventory
-
Background Tasks in Python for Data Science
This post is intended as an extension/update from background tasks in python . I started using the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started. This post is intended as an extension/update from background tasks in python . I started using the week that Kenneth R