Problem

We have data of medical insurance of patients. We will use the independent data to create a machine learning model which will estimate the Insurance charges. The medical charge is a numeric value so this problem is a regression problem.

Charge is dependent variable and these are independent variable:

age: Integer indicating the age of the patients
sex: patients gender ,either male or female
bmi: Body Mass Index(BMI) ,BMI is equal to weight( in kilograms) divided by height(in meter) squared.It provides a sense of how over or under weight a person is relative to their height.
children: An integer indicating the number of children/dependent covered by the insurance plan.
smoker : Patients regularly smokes tobacco
region : Patients place of residence in U.S
charges: charge of the health insurance to patient yearly.

Load Library and data

library(dplyr)
library(caret)

insurance <- readRDS("insurance.rds")

Exploratory Data Analysis (EDA)

It is already done in data analysis section.

knitr::kable(head(insurance))

age	sex	bmi	children	smoker	region	charges
19	female	27.900	0	yes	southwest	16884.924
39	male	33.770	1	no	southeast	1725.552
28	male	33.000	3	no	southeast	4449.462
33	male	22.705	0	no	northwest	21984.471
32	male	28.880	0	no	northwest	3866.855
31	female	25.740	0	no	southeast	3756.622

str(insurance)

## 'data.frame':    1338 obs. of  7 variables:
##  $ age     : int  19 39 28 33 32 31 46 37 37 60 ...
##  $ sex     : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
##  $ bmi     : num  27.9 33.8 33 22.7 28.9 ...
##  $ children: int  0 1 3 0 0 0 1 3 2 0 ...
##  $ smoker  : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
##  $ region  : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
##  $ charges : num  16885 1726 4449 21984 3867 ...

Algorithm

As this problem is regression problem we will use Multiple Linear Regression Algorithm to make Medical insurance Predictive Model.

Simple Linear Regression

simple linear regression is a simple method for predicting the quantitative value and study relationships between two continuous variables suppose X and Y. Mathematically, simple linear regression can be written as:

\[Y=a+b∗X+e\]

Where \(Y\) is dependent variable, \(X\) is independent variable, \(a\) is the intercept , \(b\) is the slope of \(X\) and \(e\) is the error term in equation.

Linear regression method’s main task is to find the best-fitting straight line through the Y and X points

Multiple Linear Regression

Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data.

Multiple Linear regression uses multiple predictors. The equation for multiple linear regression looks like:

\[Y = \beta0 + \beta1x1+ \beta2x2+ ...+e\]

where:

\(Y\) is Response or dependent variable \(\beta0\) is intercept \(x1\) and \(x2\) are predictors or independent variable \(\beta1\) and \(\beta2\) are coefficeints for the \(x1\) and \(x2\) respectively and \(e\) is error term in equation.

Spliting the data

Machine Learning we require training data to train the model and testing data to test the model performance. We will use 75% of data as training data and remaining as a testing data.

set.seed(101) # Set Seed so that same sample can be reproduced in future also
 
sample <- sample.int(n = nrow(insurance), size = floor(.75*nrow(insurance)), replace = F)
train_df <- insurance[sample, ] # 75% data for the training
test_df  <- insurance[-sample, ] # remaining data  for testing

dim(train_df)

## [1] 1003    7

dim(test_df)

## [1] 335   7

We have 1003 rows in training sets and 335 rows for the testing data.

Train the Model

We will use linear regression model from caret package for our insurance problem.

knitr::kable(head(test_df))

	age	sex	bmi	children	smoker	region	charges
2	39	male	33.770	1	no	southeast	1725.552
4	33	male	22.705	0	no	northwest	21984.471
6	31	female	25.740	0	no	southeast	3756.622
8	37	female	27.740	3	no	northwest	7281.506
15	27	male	42.130	0	yes	southeast	39611.758
26	59	female	27.720	3	no	southeast	14001.134

real_charges <- select(test_df,charges)
test_df <- select(test_df,-charges)

ins_model <- train(charges ~ .
                        ,train_df
                        ,method = "lm"
                        )

ins_model

## Linear Regression 
## 
## 1003 samples
##    6 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 1003, 1003, 1003, 1003, 1003, 1003, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   6364.451  0.7248959  4416.144
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

summary(ins_model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28647  -3039  -1072   1814  29678 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -10886.35    1183.76  -9.196   <2e-16 ***
## age                227.74      14.42  15.798   <2e-16 ***
## sexmale            274.18     396.51   0.691   0.4894    
## bmi                347.64      34.23  10.155   <2e-16 ***
## children           425.81     169.09   2.518   0.0119 *  
## smokeryes        23553.52     492.65  47.810   <2e-16 ***
## regionnorthwest   -930.97     574.91  -1.619   0.1057    
## regionsoutheast  -1312.03     562.47  -2.333   0.0199 *  
## regionsouthwest  -1258.10     575.88  -2.185   0.0291 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6249 on 994 degrees of freedom
## Multiple R-squared:  0.7337, Adjusted R-squared:  0.7315 
## F-statistic: 342.3 on 8 and 994 DF,  p-value: < 2.2e-16

The output of summary() shows there are eight coefficients while we passed only 6 features. This happened because the caret packages applied a technique called dummy coding to each of the factor - type variables .

residuals section provides the summary statistics for the errors in our Predictions. The residuals is equals to the true value minus the predicted value.
p-value denoted Pr(>|t|), shows the relationship between the dependent variable and independent variable . The level of significance is 0.05(threshold value). The independent variable with p-value less than significance level are considered statistically significant.
Multiple R-squared value provides a measure of how well our model as a whole explains the values of the dependent variable. The closer the value is to 1.0 the better the model perfectly explains the data.

cor(predict(ins_model,test_df), real_charges)

##        charges
## [1,] 0.8055643

real_charges$prediction <- predict(ins_model,test_df)

knitr::kable(head(real_charges))

	charges	prediction
2	1725.552	9123.411
4	21984.471	3865.569
6	3756.622	3809.918
8	7281.506	7530.165
15	39611.758	32424.453
26	14001.134	12152.521

Improve Model

BMI may have zero impact on medical expenditure for individuals in the normal weight range but , it may be strongly related to higher costs for the obese people i.e BMI >= 30.

We can model this relationship by creating a binary obesity indicator variable i.e 1 if BMI >= 30 and 0 if BMI < 30.

train_df$bmi30 <- ifelse(train_df$bmi >= 30, 1, 0)
test_df$bmi30 <- ifelse(test_df $bmi >= 30 ,1 ,0)

In linear Regression , the relationship between an independent variable and the dependent variable is assumed to be linear, but this may not be always true.

The effect of age on medical expenditure may not be constant through out the age group. The treatment may become disproportionately expensive for oldest population.
to show non linear age to the model we will use square of age.
When two features have a combined effect , this is known as interaction.

In my opinion obesity indicator (bmi30) and smoker indicator has relationship between eachother. We will bmi30*smoker to the model and test our hypothesis.

new_model <- train(charges ~ age+ age^2+ children + bmi+ bmi30*smoker + region 
                        ,train_df
                        ,method = "lm"
                        )

Compare two model

summary(new_model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37254  -1952  -1143    -38  30576 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4451.53    1172.52  -3.797 0.000156 ***
## age                 239.93      11.17  21.488  < 2e-16 ***
## children            531.72     130.97   4.060  5.3e-05 ***
## bmi                 140.07      42.84   3.269 0.001115 ** 
## bmi30             -1022.20     529.17  -1.932 0.053683 .  
## smokeryes         12727.41     573.17  22.205  < 2e-16 ***
## regionnorthwest    -658.96     445.12  -1.480 0.139082    
## regionsoutheast   -1065.91     435.82  -2.446 0.014628 *  
## regionsouthwest   -1501.03     445.86  -3.367 0.000790 ***
## `bmi30:smokeryes` 19357.92     764.94  25.307  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4837 on 993 degrees of freedom
## Multiple R-squared:  0.8406, Adjusted R-squared:  0.8391 
## F-statistic: 581.8 on 9 and 993 DF,  p-value: < 2.2e-16

summary(ins_model)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28647  -3039  -1072   1814  29678 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -10886.35    1183.76  -9.196   <2e-16 ***
## age                227.74      14.42  15.798   <2e-16 ***
## sexmale            274.18     396.51   0.691   0.4894    
## bmi                347.64      34.23  10.155   <2e-16 ***
## children           425.81     169.09   2.518   0.0119 *  
## smokeryes        23553.52     492.65  47.810   <2e-16 ***
## regionnorthwest   -930.97     574.91  -1.619   0.1057    
## regionsoutheast  -1312.03     562.47  -2.333   0.0199 *  
## regionsouthwest  -1258.10     575.88  -2.185   0.0291 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6249 on 994 degrees of freedom
## Multiple R-squared:  0.7337, Adjusted R-squared:  0.7315 
## F-statistic: 342.3 on 8 and 994 DF,  p-value: < 2.2e-16

Our Multiple Linear Regression equation is:

\[charges = -4451.53 + 227.74 age + 274.18 sexmale + 347.64 bmi + 425.81children + 23553.52smokeryes + -930.97regionnorthwest + -1312.03 regionsoutheast + -1258.10 regionsouthwest\]

cor(predict(new_model,test_df), real_charges)

##        charges prediction
## [1,] 0.8729587   0.923047

real_charges$prediction_new <- predict(new_model,test_df)

knitr::kable(head(real_charges,10))

	charges	prediction	prediction_new
2	1725.552	9123.4114	8079.622
4	21984.471	3865.5690	5987.571
6	3756.622	3809.9177	5525.875
8	7281.506	7530.1648	9247.715
15	39611.758	32424.4532	37925.052
26	14001.134	12152.5206	14116.453
28	12268.632	12954.1099	12717.822
36	1625.434	-115.4669	2309.169
41	3046.062	3826.7110	5032.727
54	37742.576	31797.3293	39005.883

Save Model

saveRDS(new_model,"Final_model.rds")

Model Deployment

We can deploy the model as webservices in R using Shiny framework made by Rstudio .

Machine Learning

R User Group Nepal

2018-12-23