A Data Science Life Cycle In 2024:You Must Know

Some data science life cycles narrowly focus on just the data, modelling, and assessment steps. Others are more comprehensive and start with business understanding and end with deployment.

And the one we’ll walk through is even more extensive to include operations. It also emphasizes agility more than other life cycles.

This life cycle has five steps:

Problem Definition
Data Investigation and Cleaning
Minimal Viable Model
Deployment and Enhancements
Data Science Ops

Problem Definition

Just like any good business or IT-focused life cycle, a good data science life cycle starts with “why”. If you’re asking “Why start with why?

Generally, the project lead or product manager manages this phase. Regardless, this initial phase should:

State clearly the problem to be solved and why
Motivate everyone involved to push toward this why
Define the potential value of the forthcoming project
Identify the project risks including ethical considerations
Identify the key stakeholders
Align the stakeholders with the data science team
Research related high-level information
Assess the resources (people and infrastructure) you’ll likely need
Develop and communicate a high-level, flexible project plan
Identify the type of problem being solved*
Get buy-in for the project

Data Investigation and Cleaning

Once you have the data, start exploring it. Your data scientists or business/data analysts will lead several activities such as:

Document the data quality
Clean the data
Combine various data sets to create new views
Load the data into the target location (often to a cloud platform)
Visualize the data
Present initial findings to stakeholders and solicit feedback

Minimal Viable Model

What is a Minimal Viable Model?

“The minimal viable model is the version of a new model which allows a team to collect the maximum amount of validated learning about the model’s effectiveness with the least effort”.

Minimal: The model is narrowly focused. It is not the best possible model but is sufficient enough to make a measurable impact on a subset of the overall problem.
Collect the maximum amount of validated learning about the model’s effectiveness: Develop a hypothesis and test it. This validated learning confirms or denies your team’s initial hypotheses. It has two main parts:
- Is the model technically performing better than the baseline?
- Is the model able to make a meaningful impact to the underlying business problem?
Least effort: Full-fledged deployments are typically costly and time-consuming. Therefore, find the simplest way to get the model out.

Deployment and Enhancements:

Many data science life cycles include “Deployment” or a similar term. This step creates the delivery mechanism you need to get the model out to the users or to another system.

Typically, the more “engineering-focused” team members such as data engineers, cloud engineers, machine learning engineers, application developers, and quality assurance engineers execute this phase.

Data Science MLOps

Most other data science life cycles end with a Deployment phase or even before that with Assessment.

However, as data science matures into mainstream operations, companies need to take a stronger product focus that includes plans to maintain the deployed systems long-term. There are three major overlapping facets of management to this.

Software Management

A productized data science solution ultimately sits as part of a broader software system. And like all software systems, the solution needs to be maintained. Common practices include:

Maintaining the various system environments
Managing access control
Triggering alert notifications for serious incidents
Executing test scripts with every new deployment
Meeting service level agreements (SLAs)
Implementing security patches

Model and Data Management

Data science product operations have additional considerations beyond standard software product maintenance:

Monitor the Data: The data comes from the “real world” which is beyond your control and presents unique challenges. Therefore, validate that the incoming data sets are of expected format and that the data comes in acceptable ranges.
Monitor Model Performance: Software functionality tends to be binary — it works or it doesn’t. However, models are probabilistic. So you often can’t say definitively whether the model “is working”. However, you can get a good feel by monitoring model performance to check against unacceptable swings in core metrics such as standard deviation or mean average percent error.
Run A/B Tests: Models can drift to become worse than random noise. They can also (nearly) always be improved. Therefore, during operations, continue to routinely hold-out small portions of your population as a control group to test performance against the running model. Occasionally, develop and deploy new test models to measure their performance against the incumbent production model.
Ensure Proper Model Governance: Regulations in certain industries require companies to be able to explain why a model made certain decisions. And even if you’re not in one of these regulated industries, you will want to be able to trace the specific set of data and the specific model used to evaluate specific outcomes.

Register Now