Posts tagged: python

What is YOUR Advice for New Data Scientists

- Learn the business - Learn Git [1] - Your code does not need to be amazing - Keep Learning Learn Git # [2] You dont have to start out as a git wizard with the cleanest possible commit history. At first dont let yourself get too wrapped up in it, the most important part is that you make commits. You will find needs for more advanced stuff later. git add . git commit -m "FEAT added new function to calculate revenue by product family" git push Get comfortable with this, then learn how to branch, rebase, stash, etc… Your code does not need to be amazing # [3] Get the job done. Keep it in small bite size pieces. Make readable function definitions and variable names. You will thank yourself for naming things well later. Readability counts more than performance in most cases of data science. If it gets the job done try not to over worry about things like performance. A few extra seconds to clean a dataset or build a model is not worth hours of your time. As you go you will have c...

What is Kedro

What is Kedro [1] This is my original what-is-kedro article. There is a brand new one --- Kedro is an open source data pipeline framework. It provides guardrails to set your project up right from the start without needing to know deeply how to setup your own python library for data pipelining. It includes really great ways to manipulate catalogs and pipelines. This article will cover the 10K view of kedro, future articles will dive deper into each one. kedro [2] is an open-source data pipeline framework. It provides guardrails to set your project up right from the start without needing to know deeply how to set up your own python library for data pipelining. It includes great ways to manipulate catalogs and pipelines. This article will cover the 10K view of kedro [2], future articles will dive deeper into each one. Libraries # [3] Currently, kedro [2] is broken down into 3 different libraries. 💎 kedro [2] 📉 kedro-viz [4] 🏗 kedro-docker [5] kedro [2] # [6] [7] kedro [2] ...

Long variable names are good

🏷️ Long variable names are a good thing. Self documenting code is more important than poorly documented code. Simply adding a few characters to your variable names can go a long ways. Containers are plural # [1] Aliases are welcome # [2] Scope is important References: [1]: #containers-are-plural [2]: #aliases-are-welcome

simple click 2

simple click

cli tools are super handy and easy to add to your python libraries to supercharge them. Even if your library is not a cli tool there are a number of things that a cli can do to your library. Example Ideas # [1] Things a cli can do to enhance your library. 🆚 print version 🕶 print readme 📝 print changelog 📃 print config ✏ change config 👩‍🎓 run a tutorial 🏗 scaffold a project with cookiecutter 🖱 Click [2] # [3] Click [2] is the most popular python cli tool framework for python. There are others, some old, some new comers that make take the crown. For now Click [2] is the gold standard if you want to make a powerful cli quickly. If you are dependency conscious and dont need a lot of tooling, use argparse [4]. Project Structure # [5] . ├── setup.py └── simple_click ├── cli.py └── __init__.py ❯ cli.py # [6] # simple_click/cli.py import click __version__ = "1.0.0" @click.group() def cli(): pass @cli.command() def version(): """prints project version""" click.echo(__...

SqlAlchemy Models

Make a connection # [1] from sqlalchemy import create_engine def get_engine(): return create_engine("sqlite:///mode_examples.sqlite") Make a session # [2] from sqlalchemy.orm import sessionmaker def get_session(): con = get_engine() Base.bind = con Base.metadata.create_all() Session = sessionmaker(bind=con) session = Session() return session Make a Base Class # [3] from sqlalchemy.ext.declarative import declarative_base Base = declarative_base() Base.metadata.bind = get_engine() Make your First Model # [4] class User(Base): __tablename__ = "users" username = Column('username', Text()) firstname = Column('firstname', Text()) lastname = Column('lastname', Text()) Make your own Base Class to inherit From # [5] class MyBaseHelper: def to_dict(self): return {k: v for k, v in self.__dict__.items() if k[0] != "_"} def update(self, **attrs): for key, value in attrs.items(): if hasattr(self, key): setattr(self, key, value) Use the Custom Base Class # [6] class User(Ba...

Building Cli apps in Python

Packages # [1] Click [2] # [3] Inputs # [4] Click primarily takes two forms of inputs Options and arguments. I think of options as keyword argument and arguments as regular positional arguments. Option # [5] - typically aliased with a shorthand (’-v’, ‘–verbose’) --- **From the Docs [6] To get the Python argument name, the chosen name is converted to lower case, up to two dashes are removed as the prefix, and other dashes are converted to underscores. @click.command() @click.option('-s', '--string-to-echo') def echo(string_to_echo): click.echo(string_to_echo) @click.command() @click.option('-s', '--string-to-echo', 'string') def echo(string): click.echo(string) --- Argument # [7] - positional - required - no help text supplied by click Yaspin [8] # [9] 88e1bcff-6a9c-4bd9-955c-fd130f2fa369.mp4 [10] Click Help Colors [11] # [12] [13] # [14] Colorama [15] # [16] Colorama Example [17] Click DidYouMean [18] # [19] References: [1]: #packages [2]: https://click.pal...

Kedro

See all of my kedro related posts in [[ tag/kedro ]]. #kedrotips [1] # [2] I am tweeting out most of these snippets as I add them, you can find them all here #kedrotips [3]. 🗣 Heads up # [4] Below are some quick snippets/notes for when using kedro to build data pipelines. So far I am just compiling snippets. Eventually I will create several posts on kedro. These are mostly things that I use In my everyday with kedro. Some are a bit more essoteric. Some are helpful when writing production code, some are useful more usefule for exploration. 📚 Catalog # [5] [6] Photo by jesse orrico on Unsplash CSVLocalDataSet # [7] python import pandas as pd iris = pd.read_csv('https://raw.githubusercontent.com/kedro-org/kedro/d3218bd89ce8d1148b1f79dfe589065f47037be6/kedro/template/%7B%7B%20cookiecutter.repo_name%20%7D%7D/data/01_raw/iris.csv') data_set = CSVLocalDataSet(filepath="test.csv", load_args=None, save_args={"index": False}) iris_data_set.save(iris) reloaded_iris = iris_data_se...

📝 Packages to Investigate Notes

- jmespath - Tabnine Bulwark # [1] |-|-| |github: |https://github.com/zaxr/bulwark| I definitely want to try this out with kedro. Bulwark is a package for convenient property-based testing of pandas dataframes, supported for Python 3.5+. Example # [2] import bulwark.decorators as dc @dc.IsShape((-1, 10)) @dc.IsMonotonic(strict=True) @dc.HasNoNans() def compute(df): # complex operations to determine result ... return result_df References: [1]: #bulwark [2]: #example

Debugging Python

Using pdb # [1] References: [1]: #using-pdb

Just Use Pathlib

Pathlib is an amazing cross-platform path tool. Import # [1] from pathlib import Path Create path object # [2] Current Directory cwd = Path('.').absolute() Users Home Directory home = Path.home() module directory module_path = Path(__file__) Others Let’s create a path relative to our current module. data_path = Path(__file__) / 'data' Check if files exist # [3] Make Directories # [4] data_path.mkdir(parents=True, exists_ok=True) rename files # [5] Path(data_path /'example.csv').rename('real.csv') List files # [6] Glob Files # [7] data_path.glob('*.csv') recursively data_path.rglob('*.csv') Write # [8] Path(data_path / 'meta.txt').write_text(f'created on {datetime.datetime.today()}) References: [1]: #import [2]: #create-path-object [3]: #check-if-files-exist [4]: #make-directories [5]: #rename-files [6]: #list-files [7]: #glob-files [8]: #write

Filtering Pandas

query # [1] Good for method chaining, i.e. adding more methods or filters without assigning a new variable. # is skus.query('AVAILABILITY == " AVAILABLE"') # is not skus.query('AVAILABILITY != " AVAILABLE"') masking # [2] general purpose, this is probably the most common method you see in training/examples # is skus[skus['AVAILABILITY'] == 'AVAILABLE'] # is not skus[~skus['AVAILABILITY'] == 'AVAILABLE'] isin # [3] capable of including multiple strings to include # is in df[df.AVAILABILITY.isin(['AVAILABLE', 'AVL'])] # is not in df[~df.AVAILABILITY.isin(['AVAILABLE', 'AVL'])] contains # [4] Good For partial matches # contains df[df.AVAILABILITY.str.contains('AVA')] # not contains df[~df.AVAILABILITY.str.contains('AVA')] MASKS # [5] anything that we put inside of square brackets can be set as a variable then passed in. service_mask = skus['AVAILABILITY'] == 'AVAILABLE' name_mask = skus['NAME'] == 'Dell chromebook 11' Operators # [6] & - and ~ - not | - or AVAILABLE and ...

Pyspark

I have been using pyspark since March 2019, here are my thoughts.

Making good documentation in python

Tools Sphinx # [1] Portray # [2] I just started using portray and it is amazingly simple to use! Methodology References: [1]: #sphinx [2]: #portray

Quick Progress Bars in python using TQDM

tqdm is one of my favorite general purpose utility libraries in python. It allows me to see progress of multipart processes as they happen. I really like this for when I am developing something that takes some amount of time and I am unsure of performance. It allows me to be patient when the process is going well and will finish in sufficient time, and allows me to 💥 kill it and find a way to make it perform better if it will not finish in sufficient time. [1] for more gifs like these follow me on twitter @waylonwalker [2] Add a simple Progress bar! from tqdm import tqdm from time import sleep for i in tqdm(range(10)): sleep(1) convenience TQDM also has a convenience function called trange that wraps the range function with a tqdm progress bar automatically. from tqdm import trange from time import sleep for i in trange(range(10)): sleep(1) notebook support There is also notebook support. If you are bouncing between ipython and jupyter I recomend importing from the auto ...

Clean up Your Data Science with Named Tuples

If you are a regular listener of TalkPython [1] or PythonBytes you have hear Michael Kennedy talk about Named Tuples many times, but what are they and how do they fit into my data science workflow. Example # [2] As you graduate your scripts into modules and libraries you might start to notice that you need to pass a lot of data around to all of the functions that you have created. For example if you are running some analysis utilizing sales, inventory, and pricing data. You may need to calculate total revenue, inventory on hand. You may need to pass these data sets into various models to drive production or pricing based on predicted volumes. Load data # [3] Here we setup functions that can load data from the sales database. Assume that we also have similar functions to get_inventory and get_pricing. def get_engine(): engine = create_engine('postgresql://scott:tiger@localhost:5432/mydatabase') def get_sales(): ''' gets sales history from the sales database ''' engine = ge...

Background Tasks in Python for Data Science

This post is intended as an extension/update from background tasks in python [1]. I started using background the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started. This post is intended as an extension/update from background tasks in python [1]. I started using background the week that Kenneth Reitz released it. It takes away so much boilerplate from running background tasks that I use it in more places than I probably should. After taking a look at that post today, I wanted to put a better data science example in here to help folks get started. I use it in more places than I probably should Before we get into it, I want to make a shout out to Kenneth Reitz for making this so easy. Kenneth is a python God for all that he has given to the community in so many w...

📝 Bash Notes

Bash is super powerful. File System Full # [1] Show Remaining Space on Drives df -h show largest files in current directory du . -h --max-depth=1 Move files then symlink them mkdir /mnt/mounted_drive mv ~/bigdir /mnt/mounted_drive ln -s /mnt/mounted_drive/bigdir ~/bigdir Fuzzy One Liners # [2] a() {source activate "$(conda info --envs | fzf | awk '{print $ edit in vim vf() { fzf | xargs -r -I % $EDITOR % ;} cat a file vf() { fzf | xargs -r -I % $EDITOR % ;} bash execute bf() { bash "$(fzf)" } git [3] add gadd() { git status -s | fzf -m | awk '{print $2}' | xargs git add && git status -s} git reset greset() { git status -s | fzf -m | awk '{print $2}' |xargs git reset && git status -s} Kill a process fkill() {kill $(ps aux | fzf | awk '{print($2)}')} Finding things # [4] Files # [5] fd-find [6] is amazing for finding files, it even respects your .gitignore file 😲. Install with apt install fd-find. fd md ag -g python find . -n "*.md" ++Vanilla Bonus Content # [7] ** sh...

Autoreload in Ipython

I have used %autoreload for several years now with great success and 🔥 rapid reloads. It allows me to move super fast when developing libraries and modules. They have made some great updates this year that allows class modules to be automatically be updated. What I like about autoreload # [1] 🔥 Blazing Fast 💥 Keeps me in the comfort of my text editor 👏 Allows me to use Jupyter when I need 👟 Extremely Reliable One of the biggest benefits that I find is that it shortens the distance between my module/library code and test code inside of a terminal/notebook. Now I primarily use jupyter notebooks for the presentation aspect. I develop code from the comfort of my editor with all of the tools I have setup, and run the functions in a notebook to get the output. From there I might do some aggregations or plots, but the 🥩 meat of development is done outside of jupyter. Now I primarily use jupyter notebooks for the presentation aspect. Enabling Autoreload # [2] 📐 config This is a sh...

Python Tips

Dictionaries # [1] Unpacking # [2] - **kwargs - func(**input) - locals().update(d) # [3] References: [1]: #dictionaries [2]: #unpacking [3]: #heading

`j`	Scroll down
`k`	Scroll up
`g` `g`	Scroll to top
`Shift` `G`	Scroll to bottom
`d`	Half-page down
`u`	Half-page up

`j` / `↓`	Next post (in feeds)
`k` / `↑`	Previous post (in feeds)
`Enter` / `o`	Open highlighted post
`Shift` `O`	Open in new tab
`g` `h`	Go to home
`g` `s`	Focus search
`[`	Previous page
`]`	Next page
`b`	Toggle left sidebar
`Shift` `B`	Toggle right sidebar
`s`	Toggle simple/rich feed view

`/`	Focus search input
`⌘CtrlK`	Focus search (alternative)
`y` `y`	Copy URL to clipboard
`?`	Show this help
`Esc`	Close / clear highlight