📢 Announcing find-kedro

Waylon Walker

#

kedro is a ✨ fantastic project that allows for super-fast prototyping of data pipelines, while yielding production-ready pipelines. find-kedro enhances this experience by adding a pytest like node/pipeline discovery eliminating the need to bubble up pipelines through modules.

When working on larger pipeline projects, it is advisable to break your project down into different sub-modules which requires knowledge of building python libraries, and knowing how to import each module correctly. While this is not too difficult, in some cases, it can trip up even the most senior engineers, losing precious feature development time to debugging a library.

#

find-kedro is deployed to pypi and can easily be pip installed.






        
pip install find-kedro

#

The recommended usage of find-kedro is to implement it directly into your projects run.py module






        
from kedro.context import KedroContext
from find_kedro import find_kedro

class ProjectContext(KedroContext):
    def _get_pipelines(self) -> Pipeline:
        return find_kedro()

Creating nodes

find-kedro will not execute any functions. It will simply look for variables that match the pattern and identify if they are a kedro.pipeline.Pipeline, kedro.pipeline.nodes.Node, or a list of kedro.pipeline.nodes. Node's. If so, it will collect them into the dictionary of pipelines.

There are typically three ways that pipelines are constructed with find-kedro; lists, single-nodes, and pipelines.

Lists

Any pattern matched list will be flattened and collected into the pipeline. Nodes can be created all at once in the list definition.






        
# my-proj/pipelinies/data_engineering/pipeline
from kedro.pipeline import node
from .nodes import split_data

pipeline = [
    node(
        split_data,
        ["example_iris_data", "params:example_test_data_ratio"],
        dict(
            train_x="example_train_x",
            train_y="example_train_y",
            test_x="example_test_x",
            test_y="example_test_y",
        ),
    )
]

It is also convenient many times to keep the node definition close to the function definition. Many times I define the list at the top of the file, then append to it as I go.






        
# my-proj/pipelinies/data_engineering/pipeline
from kedro.pipeline import node
from .nodes import split_data

nodes = []
nodes.append(
    node(
        split_data,
        ["example_iris_data", "params:example_test_data_ratio"],
        dict(
            train_x="example_train_x",
            train_y="example_train_y",
            test_x="example_test_x",
            test_y="example_test_y",
        ),
    )
)

Nodes

All pattern matched kedro.pipeline.node.Node objects will get collected into the pipeline.






        
# my-proj/pipelinies/data_engineering/pipeline
from kedro.pipeline import node
from .nodes import split_data

split_node = node(
        split_data,
        ["example_iris_data", "params:example_test_data_ratio"],
        dict(
            train_x="example_train_x",
            train_y="example_train_y",
            test_x="example_test_x",
            test_y="example_test_y",
        ),
    )

Pipeline

All pattern matched kedro.pipeline.Pipeline objects will get collected into the pipeline.






        
# my-project/pipelinies/data_engineering/pipeline
from kedro.pipeline import node, Pipeline
from .nodes import split_data

split_node = Pipeline(
    [
        node(
            split_data,
            ["example_iris_data", "params:example_test_data_ratio"],
            dict(
                train_x="example_train_x",
                train_y="example_train_y",
                test_x="example_test_x",
                test_y="example_test_y",
            ),
        )
    ]
)

Fully Qualified imports

When using fully qualified imports from my_proj.pipelines.data_science.nodes import split_data instead of relative imports from .nodes split_data you will need to make sure that your project is installed, in your current path, or you set the directory

The CLI provides a handy interface to search your project for nodes






        
Usage: find-kedro [OPTIONS]

Options:
  --file-patterns TEXT       glob-style file patterns for Python node module
                             discovery

  --patterns TEXT            prefixes or glob names for Python pipeline, node,
                             or list object discovery

  -d, --directory DIRECTORY  Path to save the static site to
  --version                  Prints version and exits
  -v, --verbose              Prints extra information for debugging
  --help                     Show this message and exit.

Example ran with a slightly modified default kedro new project.






        
❯ find-kedro
{
  "__default__": [
    "split_data([example_iris_data,params:example_test_data_ratio]) -> [example_test_x,example_test_y,example_train_x,example_train_y]",
    "train_model([example_train_x,example_train_y,parameters]) -> [example_model]",
    "predict([example_model,example_test_x]) -> [example_predictions]",
    "report_accuracy([example_predictions,example_test_y]) -> None"
  ],
  "src.default_kedro_159.pipelines.data_engineering.pipeline": [
    "split_data([example_iris_data,params:example_test_data_ratio]) -> [example_test_x,example_test_y,example_train_x,example_train_y]"
  ],
  "src.default_kedro_159.pipelines.data_science.pipeline": [
    "train_model([example_train_x,example_train_y,parameters]) -> [example_model]",
    "predict([example_model,example_test_x]) -> [example_predictions]",
    "report_accuracy([example_predictions,example_test_y]) -> None"
  ]
}

📢 Announcing find-kedro

Tags

#

#

#

Creating nodes

Lists

Nodes

Pipeline

Fully Qualified imports

#

#

#

#

Recent Posts

Recent Thoughts

Recent Stars