Predicting Diabetes Disease with EvalML (AutoML)

Published in

Heartbeat

9 min readDec 16, 2022

According to W.H.O, in 2014, 8.5% of adults aged 18 years and older had diabetes. 1.5 million deaths were directly related to diabetes in 2019, and 48% occurred in those under 70 years old. Diabetes was responsible for another 460,000 kidney deaths and raised blood glucose causes around 20% of cardiovascular deaths.

Diabetes is a chronic disease that occurs either when the pancreas does not produce enough insulin or when the body cannot effectively use its insulin.

In this article, we will use the EvalML library to search, choose the best pipeline, predict which patients will have diabetes, and also see some of its capabilities.

Keeping in mind this article’s objective, our significant attention will be on diabetes-associated variables, which include a variety of health metrics, including Pregnancies, Glucose, blood pressure, SkinThickness, Insulin, B.M.I, DiabetesPedigreeFunction, and Age as independent features. Analyzing these will help to identify concern areas and predict the diabetes outcome as dependent features.

What is EvalML (AutoML)?

EvalML is an open-source python module developed by Alteryx that makes automated machine learning (AutoML) and model comprehension easier.

EvalML helps with data preprocessing, visualization, and automated machine learning. It also provides several modification possibilities to improve prediction outcomes.

Use cases of EvalML

The EvaML has some real-life use cases which are very useful, such as:

Data checking.
Model understanding.
Detecting target leaks by passing information to the model during training.
Checks for columns that aren’t relevant for modeling.
Class imbalance.
Redundant features such as highly null columns, constant columns, etc.

How to install EvalML?

Run the command below to install it on your computer. It is important to note that you must be using python 3.8 or higher.

Install via pip.

pip install evalml

Install via conda (Anaconda needs to be installed on your computer).

conda install -c conda-forge evalml

We also need to install graphviz for plotting utilities in EvalML.

pip install graphviz

Let’s get started

Firstly, let us import the necessary libraries and read our dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Dataset

We obtained the dataset used to predict diabetes from Kaggle; it predicts whether or not a person would be diabetic. Keep in mind this is a relatively small dataset and only serves as an example.

df = pd.read_csv("../Downloads/base/diabetes_data.csv")
df.head()

Let’s get more information about the dataset by using the .shape attribute to check the dimension of the dataframe which consists of 768 Rowsand 9 Columns. Then .info() function for viewing the whole data description.

df.shape
df.info()

Now we’re checking for Null values and unique values. This function .isnull()indicates whether any values in the dataset are missing and it returns a Boolean value. The for loop is to check the number of distinct elements.

#Checking for null values
df.isnull().any()
#Checking for unique values.
unique_values ={}
for coln in df.columns:
  unique_values[coln] = df[coln].value_counts().shape[0]
pd.DataFrame(unique_values, index=["Unique Values"]).T

The .describe() function gives us some statistical information about our numeric columns; it only works on numeric columns and the .T is to transpose the Dataframe.

df.describe().T

Exploratory Data Analysis (EDA)

EDA aims to uncover patterns, review assumptions, and test hypotheses on data before formal modeling and graphical representations and visualizations.

A correlation heatmap is an important tool in EDA for data analysis because it allows the analyst to quickly and easily visualize the relationships between different variables in a dataset, identify potential problems or inconsistencies in the data, focus on the essential variables for further analysis, and identifying which variables are most important for predicting a particular outcome or variable of interest. By looking at the strength and direction of the correlations between different variables. The larger the correlation strength, the stronger the relationship and connection.

The summary data on primary features and hidden patterns in data may assist the doctor in identifying worry areas and problems, and resolving these can improve their accuracy in diagnosing diabetes.

sns.heatmap(df.corr(),cmap = 'YlOrRd', annot = True)
plt.title("Correlation Of Features")

Diabetes outcomes are classified into two categories based on quantitative discrete binary data values, withDiabetic denoted by 1with 268 incidences, and not Diabeticdenoted by 0 with 500.

sns.countplot(df['Outcome'], palette = 'YlOrRd')
plt.title("Count of Diabetes Outcome")
plt.show()

Using EvalML

Now let’s search for the best model and train our dataset using the EvalML library.

Firstly, we need to import EvalML.

import evalml

Data Modelling with EvalML

Let’s check our dataset.

We’re using the .head() function to see how our data appears, and it only displays the top 5data points.

df.head()

Feature Engineering

Let’s split our data. Prediction of the values will be made using the Outcome column as y, and training of the machine learning model will be done using the Pregnancies, Glocose, BloodPressure, SkinThickness,Insulin, B.M.I, DiabetesPedigreeFunction, and Age columns as x.

x = df.iloc[:,:-1]
x.head()

y = df.iloc[:,-1]
y.head()

Join 18,000 of your colleagues at Deep Learning Weekly for the latest products, acquisitions, technologies, deep-dives and more.

Train/Test the Dataset

The EvalML library will handle all preprocessing procedures and data splitting for us. The text_size is set to 20% by default, which means 20% for Testingand 80%for Training, but we can set the text_size of our choice. Although EvalML has an essential function called detect_problem_type to help detect the problem type from the dataset, the problem_type function is used to specify what type of problem the dataset is.

Note: EvalML reads Dataset as DataTable by converting DataFrame to DataTable . Woodworks library is used to transform a .csv dataset (another Alteryx project).

X_train, X_test, y_train, y_test = evalml.preprocessing.split_data(x, y,problem_type='binary')

EvalML can tackle a variety of problems.

.problem_types.ProblemTypes.all_problem_types is often used to check all problem types provided by EvalML.

evalml.problem_types.ProblemTypes.all_problem_types

After EvalML has done the preprocessing for us, now is the time to start our automated machine learning! AutoSearch() iterates over several pipelines to determine which combination of steps and estimators will produce the best-performing pipeline. We would normally have more preprocessing steps explicitly defined for a machine learning problem. Everything from standardization to one-hot encoding is up for discussion.

We must also specify the x_train, y_train , and problem_type. Once the necessary arguments have been passed to AutoSearch(), the work is carried out using the .search()method, and the Log Loss determines the best pipeline.

from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary')
automl.search()

We’ll look at some pipelines created throughout the AutoSearch() process by this function .rankings. We’ll also choose the best-performing pipeline, visualize it, and train it on training data. It is crucial to note that the AutoSearch() identifies the best pipeline for our model or is used throughout the search process, and the return pipeline is untrained and may be used for hyper-tuning.

automl.rankings

Using the built-in graph tool .graph()to see which pipeline model is optimal and which parameters were used. Although .best_pipeline does the same thing as .graph(), it does it in an unstructured manner.

best_pipeline = automl.best_pipeline
best_pipeline.graph()

If we wish to predict the scoredepending on some objective, we pass a list of objectives scores to the objectivesargument.

best_pipeline.score(X_test, y_test, objectives=["auc","f1","Precision","Recall"])

Suppose we want to hyper-tune the model depending on the aucscore; all we need to do is pass in some additional arguments to AutoMLSearch()function like objective, additional_objective, problem_type, max_batches, and optimize_thresholds. Although the objective is set to Autoby default, we can specify which objective we want to use: auc, f1,precision , or recall. you can execute this piece of code evalml.objectives.get_all_object_names() to view all objective type.

Meanwhile, after checking for the auc score, the additional_objectives checks for the given argument being passed.

automl_auc = AutoMLSearch(X_train=X_train, y_train=y_train,
problem_type='binary',
objective='auc',
additional_objectives=['f1', 'precision'],
max_batches=1,
optimize_thresholds=True)

automl_auc.search()

If we look at their ranking score, we can see that it has improved by providing us with the greatest pipeline.

automl_auc_.rankings

Let the model be predicted based on the aucscore and .graph() to see which pipeline model is optimal and which parameters were used.

best_pipeline_auc.score(X_test, y_test,  objectives=["auc"])

best_pipeline_auc.score(X_test, y_test,  objectives=["auc"])
best_pipeline_auc.graph()

To analyze and see the confusion matrix and binary objective vs. threshold we’ll have to import the graph_confusion_matrix and graph_binary_objective_vs_threshold from evalml.model_understanding.metrics and evalml.model_understanding.visualization respectively.

from evalml.model_understanding.metrics import graph_confusion_matrix

y_pred2 = best_pipeline_auc.predict(X_test)
graph_confusion_matrix(y_test, y_pred2)

from evalml.model_understanding.visualizations import graph_binary_objective_vs_threshold
graph_binary_objective_vs_threshold(best_pipeline_auc, X_test, y_test, "f1", steps=100)

Now save the model by pickling it and test the model against test data to evaluate it.

import pickle
best_pipeline_auc.save("model.pkl") 
final_model=automl.load('model.pkl')

Finally, let us check the outcome using the .predict() and predict_proba() functions: The predict_proba() checks for the probability of the outcome , meaning the probability of being diabetic is “1” or not diabetic is “0”.

While .predict() gives us the actual prediction of the outcome.

final_model.predict_proba(X_test)

final_model.predict(X_test)

Conclusion

So far, we have covered all of the EvalML foundations and how we can predict with EvalML. We’ve also discussed how to hypertune. Nevertheless there is still a lot to learn and understand. It should be mentioned that this may also be used for time series analysis, regression, and NLP.

EvalML enables users to join diverse tables/data sources, build transformed and aggregated features, and then utilize these features to search for the best machine learning models when used in conjunction with Alteryx’s current products, Featuretools and Compose.

Refer to the following sites to learn more about EvalML:

I hope you find this article useful!

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.