Problem

We have data of medical insurance of patients. We will use the independent data to create a machine learning model which will estimate the Insurance charges. The medical charge is a numeric value so this problem is a regression problem.

Charge is dependent variable and these are independent variable:

Load Library and data

library(dplyr)
library(caret)

insurance <- readRDS("insurance.rds")

Exploratory Data Analysis (EDA)

It is already done in data analysis section.

knitr::kable(head(insurance))
age sex bmi children smoker region charges
19 female 27.900 0 yes southwest 16884.924
39 male 33.770 1 no southeast 1725.552
28 male 33.000 3 no southeast 4449.462
33 male 22.705 0 no northwest 21984.471
32 male 28.880 0 no northwest 3866.855
31 female 25.740 0 no southeast 3756.622
str(insurance)
## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 39 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Algorithm

As this problem is regression problem we will use Multiple Linear Regression Algorithm to make Medical insurance Predictive Model.

Simple Linear Regression

simple linear regression is a simple method for predicting the quantitative value and study relationships between two continuous variables suppose X and Y. Mathematically, simple linear regression can be written as:

\[Y=a+b∗X+e\]

Where \(Y\) is dependent variable, \(X\) is independent variable, \(a\) is the intercept , \(b\) is the slope of \(X\) and \(e\) is the error term in equation.

Linear regression method’s main task is to find the best-fitting straight line through the Y and X points

Multiple Linear Regression

Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data.

Multiple Linear regression uses multiple predictors. The equation for multiple linear regression looks like:

\[Y = \beta0 + \beta1x1+ \beta2x2+ ...+e\]

where:

\(Y\) is Response or dependent variable \(\beta0\) is intercept \(x1\) and \(x2\) are predictors or independent variable \(\beta1\) and \(\beta2\) are coefficeints for the \(x1\) and \(x2\) respectively and \(e\) is error term in equation.

Spliting the data

Machine Learning we require training data to train the model and testing data to test the model performance. We will use 75% of data as training data and remaining as a testing data.

set.seed(101) # Set Seed so that same sample can be reproduced in future also
 
sample <- sample.int(n = nrow(insurance), size = floor(.75*nrow(insurance)), replace = F)
train_df <- insurance[sample, ] # 75% data for the training
test_df  <- insurance[-sample, ] # remaining data  for testing
dim(train_df)
## [1] 1003    7
dim(test_df)
## [1] 335   7

We have 1003 rows in training sets and 335 rows for the testing data.

Train the Model

We will use linear regression model from caret package for our insurance problem.

knitr::kable(head(test_df))
age sex bmi children smoker region charges
2 39 male 33.770 1 no southeast 1725.552
4 33 male 22.705 0 no northwest 21984.471
6 31 female 25.740 0 no southeast 3756.622
8 37 female 27.740 3 no northwest 7281.506
15 27 male 42.130 0 yes southeast 39611.758
26 59 female 27.720 3 no southeast 14001.134
real_charges <- select(test_df,charges)
test_df <- select(test_df,-charges)
ins_model <- train(charges ~ .
                        ,train_df
                        ,method = "lm"
                        )
ins_model
## Linear Regression 
## 
## 1003 samples
##    6 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1003, 1003, 1003, 1003, 1003, 1003, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6364.451  0.7248959  4416.144
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(ins_model)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28647  -3039  -1072   1814  29678 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -10886.35    1183.76  -9.196   <2e-16 ***
## age                227.74      14.42  15.798   <2e-16 ***
## sexmale            274.18     396.51   0.691   0.4894    
## bmi                347.64      34.23  10.155   <2e-16 ***
## children           425.81     169.09   2.518   0.0119 *  
## smokeryes        23553.52     492.65  47.810   <2e-16 ***
## regionnorthwest   -930.97     574.91  -1.619   0.1057    
## regionsoutheast  -1312.03     562.47  -2.333   0.0199 *  
## regionsouthwest  -1258.10     575.88  -2.185   0.0291 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6249 on 994 degrees of freedom
## Multiple R-squared:  0.7337, Adjusted R-squared:  0.7315 
## F-statistic: 342.3 on 8 and 994 DF,  p-value: < 2.2e-16

The output of summary() shows there are eight coefficients while we passed only 6 features. This happened because the caret packages applied a technique called dummy coding to each of the factor - type variables .

  • residuals section provides the summary statistics for the errors in our Predictions. The residuals is equals to the true value minus the predicted value.

  • p-value denoted Pr(>|t|), shows the relationship between the dependent variable and independent variable . The level of significance is 0.05(threshold value). The independent variable with p-value less than significance level are considered statistically significant.

  • Multiple R-squared value provides a measure of how well our model as a whole explains the values of the dependent variable. The closer the value is to 1.0 the better the model perfectly explains the data.

cor(predict(ins_model,test_df), real_charges)
##        charges
## [1,] 0.8055643
real_charges$prediction <- predict(ins_model,test_df)
knitr::kable(head(real_charges))
charges prediction
2 1725.552 9123.411
4 21984.471 3865.569
6 3756.622 3809.918
8 7281.506 7530.165
15 39611.758 32424.453
26 14001.134 12152.521

Improve Model

BMI may have zero impact on medical expenditure for individuals in the normal weight range but , it may be strongly related to higher costs for the obese people i.e BMI >= 30.

We can model this relationship by creating a binary obesity indicator variable i.e 1 if BMI >= 30 and 0 if BMI < 30.

train_df$bmi30 <- ifelse(train_df$bmi >= 30, 1, 0)
test_df$bmi30 <- ifelse(test_df $bmi >= 30 ,1 ,0)

In linear Regression , the relationship between an independent variable and the dependent variable is assumed to be linear, but this may not be always true.

  • The effect of age on medical expenditure may not be constant through out the age group. The treatment may become disproportionately expensive for oldest population.

  • to show non linear age to the model we will use square of age.

  • When two features have a combined effect , this is known as interaction.

In my opinion obesity indicator (bmi30) and smoker indicator has relationship between eachother. We will bmi30*smoker to the model and test our hypothesis.

new_model <- train(charges ~ age+ age^2+ children + bmi+ bmi30*smoker + region 
                        ,train_df
                        ,method = "lm"
                        )

Compare two model

summary(new_model)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37254  -1952  -1143    -38  30576 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4451.53    1172.52  -3.797 0.000156 ***
## age                 239.93      11.17  21.488  < 2e-16 ***
## children            531.72     130.97   4.060  5.3e-05 ***
## bmi                 140.07      42.84   3.269 0.001115 ** 
## bmi30             -1022.20     529.17  -1.932 0.053683 .  
## smokeryes         12727.41     573.17  22.205  < 2e-16 ***
## regionnorthwest    -658.96     445.12  -1.480 0.139082    
## regionsoutheast   -1065.91     435.82  -2.446 0.014628 *  
## regionsouthwest   -1501.03     445.86  -3.367 0.000790 ***
## `bmi30:smokeryes` 19357.92     764.94  25.307  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4837 on 993 degrees of freedom
## Multiple R-squared:  0.8406, Adjusted R-squared:  0.8391 
## F-statistic: 581.8 on 9 and 993 DF,  p-value: < 2.2e-16
summary(ins_model)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28647  -3039  -1072   1814  29678 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -10886.35    1183.76  -9.196   <2e-16 ***
## age                227.74      14.42  15.798   <2e-16 ***
## sexmale            274.18     396.51   0.691   0.4894    
## bmi                347.64      34.23  10.155   <2e-16 ***
## children           425.81     169.09   2.518   0.0119 *  
## smokeryes        23553.52     492.65  47.810   <2e-16 ***
## regionnorthwest   -930.97     574.91  -1.619   0.1057    
## regionsoutheast  -1312.03     562.47  -2.333   0.0199 *  
## regionsouthwest  -1258.10     575.88  -2.185   0.0291 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6249 on 994 degrees of freedom
## Multiple R-squared:  0.7337, Adjusted R-squared:  0.7315 
## F-statistic: 342.3 on 8 and 994 DF,  p-value: < 2.2e-16

Our Multiple Linear Regression equation is:

\[charges = -4451.53 + 227.74 age + 274.18 sexmale + 347.64 bmi + 425.81children + 23553.52smokeryes + -930.97regionnorthwest + -1312.03 regionsoutheast + -1258.10 regionsouthwest\]

cor(predict(new_model,test_df), real_charges)
##        charges prediction
## [1,] 0.8729587   0.923047
real_charges$prediction_new <- predict(new_model,test_df)
knitr::kable(head(real_charges,10))
charges prediction prediction_new
2 1725.552 9123.4114 8079.622
4 21984.471 3865.5690 5987.571
6 3756.622 3809.9177 5525.875
8 7281.506 7530.1648 9247.715
15 39611.758 32424.4532 37925.052
26 14001.134 12152.5206 14116.453
28 12268.632 12954.1099 12717.822
36 1625.434 -115.4669 2309.169
41 3046.062 3826.7110 5032.727
54 37742.576 31797.3293 39005.883

Save Model

saveRDS(new_model,"Final_model.rds")

Model Deployment

We can deploy the model as webservices in R using Shiny framework made by Rstudio .