Journey in Data Science

How Feature Engineering Can Greatly Improved Model Predictions: The Case of Medical Insurance Cost (With Codes)

Photo by Martha Dominguez de Gouveia on Unsplash

When tacklinga data science problem, many learners are tempted to chase for complex and sophisticated machine learning algorithms, and concentrate their efforts in using different algorithms and tuning hyperparameters, and let the modeling algorithms do the job of understanding the hidden logic behind the data. While many machine learning models are very good in picking up patterns, sometimes a smart tweaking of features based on discovery of critical structures in exploratory data analysis can bring a big breakthrough in model performance, no matter which algorithm is used. It is the case for this relatively simple but tricky dataset.

The Dataset

The dataset, which is cited in Brett Lantz’s book “Machine Learning with R”, has only 7 columns: age, sex, bmi, smoker or not, number of children, region and finally the medical insurance charges, which is the dependent variable to be predicted. The dataset is clean, with no missing value, and contains 1,338 records. The dataset is available on GitHub here.

# import libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# loading dataset
dataset = pd.read_csv('insurance.csv')
dataset

dataset.describe(include='all')

The 3 Tiers Charging Structure

sns.pairplot(dataset[['age', 'bmi', 'children', 'charges']], diag_kind='kde')

The pairplot of numerical features reveals some interesting things. First the scatterplot of charges against age shows a 3 tiers structure, most data points concentrated into three roughly parallel uprising clusters. Secondly the scatterplot of charges against bmi also has some strange shape. Before further exploring, let’s have a look at the general shape of the categorical features first.

sns.catplot(x='smoker', y='charges', hue='sex',
            col='region', aspect=.7,
            kind='box', data=dataset)

The boxplots of distributions of charges of different sexes, smokers and non-smokers in different regions reveals that, although different regions have variation of mean values, they are following the same pattern, with smokers have much larger charges, while most non-smokers have lower charges, there are notable numbers of outlying cases.

Now we go back to the scatterplots to see how smoking affects health insurance costs.

sns.scatterplot(data=dataset, x='age', y='charges', hue='smoker')

Perhaps as expected, the scatterplot of charges against age with smokers and non-smokers separated shows that the high charges tier consists of smokers, the low charges tier of non-smokers, and the middle tier has a mixture of both groups.

The separation of smokers and non-smokers in the charges against bmi scatterplot reveals something which is even more interesting: The cluster of smokers forms a butterfly shape. Many people are tempted to think smokers and non-smokers form two linear regression lines of very different slopes. But if we look closer into the smokers cluster, instead of seeing it as a gradual increase of healthcare charges corresponding to bmi, perhaps it looks closer to a ruptured fault, before it the charges vary around 20k, after it they jump a step to around 40k. And the demarcation line, bmi=30, is exactly the line defining who are obese. So it will be useful to make obese a new feature, and further divide the smokers and non-smokers groups into obese and not obese.

df=dataset.copy()

# create the obese feature
df['obese']=df['bmi'].map(lambda x: 1 if x > 30 else 0)

# encode the features of sex and smoker into numbers
df['sex'] = df['sex'].map({'female': 0, 'male': 1})
df['smoker'] = df['smoker'].map({'no': 0, 'yes': 1})

#further derive features marking obese smokers, not-obese smokers, obese-non-smoker and not-obese-non-smoker
df['obese_smoker']=df.apply( lambda r: 'not-obese-non-smoker' if ((r['smoker']==0) & (r['obese']==0)) 
                            else ('obese-smoker' if ((r['smoker']==1) & (r['obese']==1)) 
                                  else ('obese-non-smoker' if ((r['smoker']==0) & (r['obese']==1)) 
                                        else 'not-obese-smoker')), axis=1)




sns.scatterplot(data=df, x='age', y='charges', hue='obese_smoker')

Further division of smokers group does show that obese smokers nearly exclusively form the high charges tier, and most of the not obese smokers occupy the middle tier. On the other hand, obesity does not have the same effect for non smokers, we can see perhaps there are more or less the same number of obese and not obese non smokers in the lower tier and middle tier. And the straight regression lines of these groups do not look fit very well with the clusters, indicating within each cluster, age does not exactly have a linear relation with charges, uprising curves may fit better.

sns.lmplot(data=df, x='age', y='charges', hue='obese_smoker')

And in the charges against bmi scatterplot, the separation of obese and not obese smokers clearly shows a pair of stepping up regression lines.

sns.lmplot(data=df, x='bmi', y='charges', hue='obese_smoker')

But for the non-smokers, there are no clear cut subgroups with other variables, perhaps with the exception that non-smokers with no children tend to concentrate in the lower part of the low charges tier. The outlying scatters in the middle tier apparently cannot be explained by available features in the dataset, perhaps related to other factors like genetic diseases or previous conditions.

non_smoker=df.loc[df.smoker==0]
non_smoker['has_children']=non_smoker.apply(lambda r: 1 if (r['children'] > 0) else 0, axis=1)
sns.scatterplot(data=non_smoker, x='age', y='charges', hue='has_children')

The final step of preprocessing is to use one-hot encoding to transform the remaining categorical features ‘region’ and ‘obese_smoker’ into different columns of binary numerical values. I will leave the features as they are for now, and later will use selections of different features to train the same set of models to illustrate the impact of feature engineering on model performances.

# one-hot encoding of region and obese_smoker features
df = pd.get_dummies(df, columns=['region'])
df = pd.get_dummies(df, columns=['obese_smoker'])
df

Making Models

For convenience of coding, I implement both models with the Keras library from Tensorflow. The first linear model is a single layer of linear neurons, which is equivalent to multiple linear regression. The second model is deep neural network(dnn), which has two hidden dense layers of 64 and 128 neurons, all activated by rectified linear units(relu), which are more capable of capturing complex patterns in data. Both models have a normalization layer to normalize the values of each feature, and use mean absolute error(mae) as the metrics for loss minimizing and evaluation. As the codes will be used repeatedly for different sets of features, I put all of them into a function.

def make_models(df):
    # separation of train and test datasets
    train_dataset = df.sample(frac=0.8, random_state=99)
    test_dataset = df.drop(train_dataset.index)

    train_labels = train_dataset.pop('charges')
    test_labels = test_dataset.pop('charges')

    # setting normalization layer
    normalizer = tf.keras.layers.Normalization(axis=-1)
    normalizer.adapt(np.array(train_dataset))

    # setting up single layer of neural network as linear model
    linear_model = tf.keras.Sequential([
        normalizer,
        layers.Dense(units=1)
    ])

    linear_model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=10),
        loss='mean_absolute_error', metrics=['MeanAbsoluteError'])
    
    # setting up deep neural network model
    dnn_model = keras.Sequential([
      normalizer,
      layers.Dense(64, activation='relu'),
      layers.Dense(128, activation='relu'),
      layers.Dense(1)
      ])

    dnn_model.compile(loss='mean_absolute_error', metrics=['MeanAbsoluteError'], optimizer=tf.keras.optimizers.Adam(0.001))

    def train_and_evaluate_model(model):
        # training model using train set
        history = model.fit(
            train_dataset,
            train_labels,
            epochs=100,
            verbose=0,
            # Calculate validation results on 20% of the training data.
            validation_split = 0.2)
        
        # model evaluation and prediction using test set
        loss, mae = model.evaluate(test_dataset, test_labels, verbose=2)
        
        test_predictions = model.predict(test_dataset).flatten()

        # plot predictions
        a = plt.axes(aspect='equal')
        plt.scatter(test_labels, test_predictions)
        plt.xlabel('True values (charges)')
        plt.ylabel('Predictions (charges)')
        lims = [0, 50000]
        plt.xlim(lims)
        plt.ylim(lims)
        _ = plt.plot(lims,lims)
        plt.show()
        
    print('linear model')
    train_and_evaluate_model(linear_model)
    print('dnn model')
    train_and_evaluate_model(dnn_model)

Experimenting Using Different Features Sets for Model Training

The Original Features Set

We begin with the original set of features encoded, the resulting linear and dnn models have maes of around 3,450 and 3,000 respectively. Compared with the mean charges of 13,270, they are decent but not great. The plots of predicted charges against true values show that, while both models do a good job of predicting cases of lower charges, but when the charges go higher than around 15,000, which is about the end of the low charges tier, both models have difficulties in grasping the tiered structure, resulting in groups whose predicted charges are consistently higher or lower than the true values. The dnn model does better in the higher end and has a better score.

df_original_features=df[['age','sex','bmi','children','smoker','charges','region_northeast','region_northwest','region_southeast','region_southwest']]
make_models(df_original_features)

linear model
9/9 - 0s - loss: 3458.0896 - mean_absolute_error: 3458.0896

dnn model
9/9 - 0s - loss: 3153.5691 - mean_absolute_error: 3153.5691

Add Obese Feature

Then we add the binary feature ‘obese’ into the features set. The linear model does not get better. But the dnn model gets some benefits from this additional information, which brings the overestimated group closer to their true values of charges, and it’s mae is reduced to around 2,100.

df_add_obese=df[['age','sex','bmi','children','smoker','charges','region_northeast','region_northwest','region_southeast','region_southwest', 'obese']]
make_models(df_add_obese)

linear model
9/9 - 0s - loss: 3449.2312 - mean_absolute_error: 3449.2312

dnn model
9/9 - 0s - loss: 2293.2112 - mean_absolute_error: 2293.2112

Add Feature Cross of Obese and Smoker

Apparently even with the features of ‘obese’ and ‘smoker’ present, the algorithms need some aids to fully realized the importance of dividing the ‘obese smoker’, ‘not obese smoker’, ‘obese non-smoker’ and ‘not obese non-smoker’ groups. When we add these four features, both models get a big breakthrough in accuracy. The mae of linear model reduces from around 3,450 to 1,650, and the dnn model from around 2,100 to 1,260. Both models get the majority of predictions closed to the true values, with the exception of the underestimated group. This group is likely the non-smokers who get mid tier charges, which cannot be explained by the features available.

make_models(df)

linear model
9/9 - 0s - loss: 1641.2939 - mean_absolute_error: 1641.2939

dnn model
9/9 - 0s - loss: 1258.0736 - mean_absolute_error: 1258.0736

Drop Features

With the four derived features at hand, the original features of ‘obese’ and ‘smoker’ may become redundant. How about dropping them from the features set? The performance of the linear model remains more or less the same, but surprisingly the mae of dnn model increases a bit to around 1,300.

df_drop_obsoletes=df.drop(['smoker','obese'], axis=1)
make_models(df_drop_obsoletes)

linear model
9/9 - 0s - loss: 1645.9388 - mean_absolute_error: 1645.9388

dnn model
9/9 - 0s - loss: 1321.8497 - mean_absolute_error: 1321.8497

And in doing one-hot encoding, many people advise for dropping one column to avoid the problem of multicollinearity in regression. The results of doing it with the addition of dropping the ‘obese’ and ‘smoker’ features, however, does not make notable difference to the linear model, the performace of the dnn model even furthur worsen to around 1,380 mae.

df_drop_more=df.drop(['smoker','obese','region_northeast','obese_smoker_not-obese-non-smoker'], axis=1)
make_models(df_drop_more)

linear model
9/9 - 0s - loss: 1644.4882 - mean_absolute_error: 1644.4882

dnn model
9/9 - 0s - loss: 1342.9359 - mean_absolute_error: 1342.9359

While the features of obese and not obese smokers lead to a big leap of model performance, the two non-smoker group apparently do not make any benefit. Dropping them along with the ‘obese’ and ‘smoker’ features again does not matter much for the linear model, while the mae of the dnn model changes to around 1,340. It seems that the dnn model flavors more than less information, though some of them may be redundant.

df_drop_more2=df.drop(['obese','smoker','obese_smoker_obese-non-smoker','obese_smoker_not-obese-non-smoker'], axis=1)
make_models(df_drop_more2)

linear model
9/9 - 0s - loss: 1653.7396 - mean_absolute_error: 1653.7396

dnn model
9/9 - 0s - loss: 1347.0553 - mean_absolute_error: 1347.0553

Age Squared

The final issue follows the observation that the charges against age clusters have a slightly concave upward shape rather than straightly linear. One way to fit this shape is squaring the values of age. It can be done in two ways, either by replacing the original values of the ‘age’ feature with the squared ones, or adding a feature of the squared values. The trials show that both approaches have similar effect: The mae of the linear model can be further reduced to around 1,330, while the mae of the dnn model remains at around 1,260. Seemingly the relu functions of the neurons of the dnn model already capture the shape of the curves.

df_age_replaced_by_squared=df.copy()
df_age_replaced_by_squared['age']=df_age_replaced_by_squared['age'].transform(lambda x: x*x)
make_models(df_age_replaced_by_squared)

linear model
9/9 - 0s - loss: 1331.4283 - mean_absolute_error: 1331.4283

dnn model
9/9 - 0s - loss: 1270.3137 - mean_absolute_error: 1270.3137

df_add_age_squared=df.copy()
df_add_age_squared['age_squared']=df_add_age_squared['age'].transform(lambda x: x*x)
make_models(df_add_age_squared)
linear model
9/9 - 0s - loss: 1331.0626 - mean_absolute_error: 1331.0626
dnn model
9/9 - 0s - loss: 1260.3253 - mean_absolute_error: 1260.3253

Conclusion: A Little Tweak in Features Can Have a Big Effect

The above trials show that, with the identification of obesity as a critical factor of this dataset and derive grouping features accordingly, the accuracy of prediction of simple and sophisticated models can have a big improvement. And with the values of ‘age’ feature squared, the performance of a simple multiple linear regression model can be brought close to a deep learning model. They are simple tweaks in features, not involving fancy codings or rocket science algorithms, just done by observing the data and acting accordingly.