From Code to Production: Leveraging Kedro and MLflow for ML Pipelines

Introduction

Nachiketa Hebbar
9 min readNov 5, 2024

In today’s fast-evolving landscape of machine learning, the need for efficient MLOps (Machine Learning Operations) has become crucial. As data science projects grow in complexity, building reliable, reproducible pipelines is as important as developing accurate models.

In this blog, we’ll explore two powerful tools, Kedro and MLflow, that bring order and reliability to machine learning pipelines. Kedro provides a robust framework for structuring and modularizing MLpipelines, while MLflow enables experiment tracking, model versioning, and streamlined deployment. Together, these tools help transform basic scripts into production-grade workflows.

Using a movie recommendation system as our example, we’ll demonstrate how Kedro and MLflow can be combined to create a complete, scalable ML pipeline. While we’ll focus on movie recommendations, this approach is highly adaptable, making it suitable for practically any machine learning project involving data engineering, feature engineering, model training, and evaluation.

Let’s dive in and learn how to build production-ready machine learning pipelines with Kedro and MLflow!

The Problem that tools like Kedro and MLflow address

Building machine learning pipelines that are robust, scalable, and easily maintainable poses significant challenges for data teams. As projects grow in complexity, managing the different stages of a pipeline — from data ingestion to model deployment — becomes increasingly difficult. Some of the primary issues that arise in ML pipeline management include:

  • Lack of Reproducibility: Without a structured approach, replicating results can be challenging, especially when the pipeline involves multiple data sources, parameters, and model versions. Reproducibility is critical for ensuring that models perform consistently in production.
  • Experiment Tracking and Versioning: For iterative processes like model training, tracking each experiment’s parameters, metrics, and results is essential but often overlooked. Without robust experiment tracking, teams struggle to compare models or identify which version performed best.
  • Modularization and Code Maintenance: Pipelines are often built with ad-hoc scripts that make it difficult to modularize code. This lack of modularity leads to challenges in maintaining and scaling the codebase, especially when multiple team members are involved.
  • Deployment Readiness: Transitioning a pipeline from a development environment to production involves logging and versioning models and preparing them for deployment. Without tools to streamline this process, deployment becomes time-consuming and error-prone.

Kedro and MLflow address these challenges directly by bringing software engineering best practices into data science projects. Together, they provide a structured, traceable, and production-ready framework for managing ML workflows.

What Kedro and MLflow Do and Why Use Them Together

Kedro is an open-source Python framework that brings software engineering practices to data science projects. It structures data pipelines with modular code, making projects easier to maintain and scale. Its pipeline framework breaks down workflows into manageable nodes for data cleaning, feature engineering, and more.

MLflow is a tool for tracking experiments, managing models, and streamlining deployment. It logs parameters, metrics, and artifacts during model training, enabling easy comparison and model management.

Combined, Kedro and MLflow provide an end-to-end solution for managing ML workflows — from data processing to deploying a production-ready model.

Environment Setup

To work with Kedro and MLflow, setting up the environment properly is crucial. Follow these steps to prepare:

Initialize a New Kedro Project:
Create a new Kedro project to structure your machine learning workflow:

python -m venv myenv
source myenv/bin/activate # On Windows: .\myenv\Scripts\activate
pip install kedro mlflow kedro-mlflow
kedro new - starter=pandas-iris - directory=my_kedro_project

This will initialize your Kedro project and also create files and folders for you, with boiler plate codes defined for basic MLpipelines.

Integrate MLflow with Kedro:
Configure MLflow in your Kedro project by updating or creating the `mlflow.yml` file in `conf/base/`:

mlflow:
tracking:
uri: "http://localhost:5000" # Use your MLflow server URI
experiment:
name: "movie_recommendation_experiment"

Now you can launch the mlflow tracking server using the command

mlflow ui

Open `http://localhost:5000` in your browser to monitor logged runs, parameters, and metrics.

With the environment ready, we can now build the movie recommendation pipeline step by step using Kedro and see how MLflow tracks and logs each stage.

Setting up your Data and Modelling Pipeline with Kedro

This tutorial is intended to work for any machine learning project that requires certain data engineering + model training steps. I am going to implement these steps for movie recommender system project which contains the following steps in its pipeline:

Data Engineering + Modelling Pipeline

Now Kedro can help us visualize this pipeline, automate its logging, and make it easy to reproduce it with different setting.

To do so, the first step you should do is to modularize your code into different steps. This would mean creating a pipeline.py file which calls different modules from different python scripts responsible for different tasks in your pipeline from data cleaning to model training.

For example, here is how how a snippet of my code looked like after I ended up structuring my pipeline.py:

Code Snippet from pipeline.py(without Kedro)

Now this pipeline.py is just a bunch of python functions being called one after the other. Nothing special, right?

Exactly. But here is where we will start using Kedro to make this pipeline better. In kedro each step in the pipeline is defined in a class called “Nodes”. So you need to wrap each function in the node object.

Here is example of what a python function in your module would look like before and after getting wrapped in a node:

A data cleaning function before:

#Function To Extract User Data from Kafka Movie Stream
process_kafka(
input_csv_path='../input_data/raw_kafka_stream.csv',
output_csv_path='../output_data/processed_kafka_stream.csv'
)

The data cleaning function wrapped in the node object after:

from kedro.pipeline import node

node(
func=process_kafka,
inputs="params:input_csv_path",
outputs="processed_data_path",
name="process_kafka"
)

The node takes as argument the python function and also the inputs and outputs. You will notice the input and output csv file paths no longer need to be hardcoded. Inside your Kedro Project, a file called conf/base/catalog.yml will be auto created for you. The catalog.yml is basically your data registry for the project which contains paths to all data sources used in the project.

For example here is how I added paths to my raw and processed csv files in the catalog:

input_csv_path:
type: pandas.CSVDataset
filepath: data/01_raw/kafka_stream.csv

processed_data_path:
type: pandas.CSVDataset
filepath: data/02_intermediate/processed_kafka_stream.csv

Make sure to add all such file paths for data sources used in your project.

Defining Your Kedro Pipeline

Once you define your nodes, you have to add them sequentially onto your main pipeline.

You can simply do this by calling the Pipeline Class from Kedro and passing it your nodes as arguments, in your src/pipeline.py file which would also be auto created for you by Kedro.

Here is how the create pipeline function is defined in the case of my movie recommendation pipeline project:

#src/pipeline.py

from kedro.pipeline import Pipeline

def create_pipeline(**kwargs) -> Pipeline:
return Pipeline([
node(
func=process_kafka,
inputs="params:input_csv_path",
outputs="processed_data_path",
name="process_kafka"
),
node(
func=aggregate_ratings_minutes,
inputs="processed_data_path",
outputs="grouped_data_path",
name="aggregate_ratings"
),

node(
func=normalize_minutes,
inputs="grouped_data_path",
outputs="normalized_data_path",
name="normalize_minutes"
),
node(
func=train_and_evaluate,
inputs=["normalized_data_path","params:test_size","params:random_state"],
outputs=["model_rmse", "model_output_path"],
name="train_and_evaluate"
),

])

What the individual functions in each node do, doesn’t matter. Treat them as normal data cleaning/feature engineering/model training functions that could be used in any ML project.

Running Your Pipeline

Once you have defined all the nodes in your src/pipeline.py , to run the pipeline go to your terminal, cd inside your project terminal and run the following command:

kedro run

Once you execute this command , you will see each step of your pipeline get executed with the logs being printed as each node gets executed. If your pipeline crashes at any node, you can also rerun from a give node using the command.

kedro run --from-nodes "name_of_your_node_function"

Additionally you can visualize your pipeline using the command:

kedro viz

This should open up an interactive dashboard with visualization of your pipeline. Here is how the visualized pipeline for the movie recommendation project looks for example:

Kedro Pipeline Visualization

You can see each step of the pipeline visualized, making it easy to collaborate with for stakeholders. You can make use of additional filters in the dashboard to see only the nodes, datasets or parameters.

By now your pipeline is modular, it is easily reproducible and easy to collaborate with for stakeholders.

However to make your pipeline production ready, here is what’s missing: there is no easy way of managing the hundreds of experiments you could potentially try out with different parameters, the trained ML models are not versioned and tracked, and there is no clear history of changes.

Let’s Bring in the Big Guns: Integrating MLflow

With a few easy steps you can integrate your MLflow into your project.

Step 1: Initialize your mlflow configuration files

Inside your project terminal run the following command:

kedro mlfow init

This will create a mlflow.yml in your conf/local folder which contains default parameters to run Mlflow experiment tracking for your project.

Step 2: Configure the MLflow YAML file

Configure MLflow in your Kedro project by updating the `mlflow.yml` file in `conf/base/` directory:

mlflow:
tracking:
uri: "http://localhost:5000" # Use your MLflow server URI
experiment:
name: "movie_recommendation_experiment"

This allows you to set your experiment name and default port where you can launch the MLFlow server.

Step 3: Store your model artifacts with each run.

In the catalog.yml where you define the path where your model would get stored, indicate that the file is to be tracked by MLflow in each run. You can do this by changing the type of the file to MLflow artifact:

model_output_path:
type: kedro_mlflow.io.artifacts.MlflowArtifactDataset
dataset:
type: pickle.PickleDataset
filepath: data/06_models/nachi_model.pkl

Step 4: Save your model metrics and parameters with each run

To do so go into your actual training scripts, spot the important parameters and metrics that you might want to log, and indicate that to MLFlow using the log function as follows:

#training.py
mlflow.log_params({"Test Size":test_size})
mlflow.log_params({"Random State":random_state})
mlflow.log_metric("RMSE",rmse)

These metrics would be typically be your mean squared errors, accuracy, or something else depending upon the model quality metrics you chose.

Believe it or not that’s it! With these 4 steps, your models get stored as artifacts. They are now versioned and your metrics and parameters for each run are tracked.

Now MLFlow offers many other features, which you are free to explore in their documentation here. The best part is you can use any of kedro or Mlflow’s individual features in this plugin.

Now to run your experiment and view the MLFlow dashboard, like before run your pipeline first using kedro run command.

Now to open the mlflow ui, run the following command in your terminal:

kedro mlflow ui

This will open a MLFlow dashboard, with the information of your runs logged in the Experiments Tab. Here is how a typical dashboard would look:

Mlflow dashboard

You can see all the runs logged, including failed and successful runs.

You can also filter runs according to desired values of metrics. For example, to get all the runs where your root mean square error was less than 4, you can search for something like this in your dashboard:

All runs with RMSE less than 4

When you deploy into production, 10s or 100s of models on a monthly or even yearly basis, such tools become extremely important for you to organize and track your experiment. Clicking on any run will allow you to see detailed metrics, parameters and artifacts of the run as well.

Strengths and Limitations

Strengths:

  • Modular, Reproducible Structure: Kedro makes it easy to build and maintain structured pipelines that are modular and easy to reproduce.
  • Robust Experiment Tracking: MLFlow logs parameters, metrics, and artifacts, simplifying experiment tracking and model versioning.
  • Improved Collaboration: The combination of Kedro and MLflow supports team-based work by providing clear, standardized workflows and experiment traceability.
  • Scalability: This setup scales well for larger teams and complex projects, making long-term maintenance easier.
  • Visualization Tools: Kedro’s kedro viz and MLflow's UI help visualize pipelines and track experiment details effectively.

Limitations:

  • Learning Curve: As you might have figured out while setting up this project, integrating Kedro and MLflow may be challenging for teams new to MLOps practices. The initial learning curve is high in terms of the steps required in setup and to understand the different nomenclature in Kedro and MLflow
  • Customization Constraints: Kedro’s structured approach might feel restrictive for highly tailored workflows requiring more flexibility.

Conclusion

Kedro and MLFlow together elevate your ML projects from basic scripts to production-level workflows. Kedro modularizes and standardizes your code, making collaboration and scaling easy. MLflow tracks your experiments, manages models, and logs valuable metrics, simplifying experiment management. With this combination, your pipelines are not only well-structured but also monitored and reproducible, ready to scale and iterate seamlessly!

As you move forward, feel free to dive deeper into the advanced capabilities of MLFlow, such as model registry, deployment tools, and integration with other MLOps platforms. The journey to building effective, scalable, and monitored machine learning pipelines begins here — so take these tools, experiment, and make your projects shine!

References:

  1. Kedro documentation
  2. MLFlow documentation
  3. Kedro + MLFlow guide

--

--

Nachiketa Hebbar
Nachiketa Hebbar

Written by Nachiketa Hebbar

Youtuber| Computer Vision Engineer At Awiros. My channel: youtube.com/NachiketaHebbar

Responses (2)