Setting up system

#list.files() # prints the files presnt in working directory

For Asking help with R

#?function
#help(function or Robject)

This is simulated dataset containing information about the customers.The data is about regression problem and we need to analyse this data find some insights and create a machine learning model.

Loading Library and Data

library(dplyr) #used for data manipulation/wrangling
library(ggplot2) #used for data visualization

# load the data
credit <- read.csv("credit.csv") 

Data Manipulation and Analysis

head(credit) # shows first 6 data 

Description of dataset

*.ID - Identification

*.Income - Income in $10,000’s

*.Limit - Credit limit

*.Rating - Credit rating

*.Cards - Number of credit cards

*.Age - Age in years

*.Education - Number of years of education

*.Gender - A factor with levels Male and Female

*.Student - A factor with levels No & Yes if an individual was a student

*.Married - A factor with levels No & Yes if an individual was married

*.Ethnicity - A factor with levels African/American/Asian/Caucasian enthnic group

*.Balance - Average credit card balance in $.

We need to analyse and create machine learning model to predict the Bank balance of Customer.

tail(credit) #last 6 data are shown by default

View() shows whole dataset in new tab.

glimpse(credit)
## Observations: 400
## Variables: 12
## $ ID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 2...
## $ Limit     <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300...
## $ Rating    <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 58...
## $ Cards     <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3...
## $ Age       <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, ...
## $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9,...
## $ Gender    <fct>  Male, Female,  Male, Female,  Male,  Male, Female, ...
## $ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No...
## $ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No,...
## $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian...
## $ Balance   <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, ...
#View(credit)
dim(credit) #prints dimension of the dataframe (Row, Column)
## [1] 400  12
#rownames(credit) #prints rows names

names(credit) #prints columns header
##  [1] "ID"        "Income"    "Limit"     "Rating"    "Cards"    
##  [6] "Age"       "Education" "Gender"    "Student"   "Married"  
## [11] "Ethnicity" "Balance"
#colnames(credit) #prints columns header
str(credit) #internal structure of the data
## 'data.frame':    400 obs. of  12 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ Gender   : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
##  $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
##  $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...
library(knitr)
kable(summary(credit)) #basic statistics 
ID Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
Min. : 1.0 Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000 Min. :23.00 Min. : 5.00 Male :193 No :360 No :155 African American: 99 Min. : 0.00
1st Qu.:100.8 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000 1st Qu.:41.75 1st Qu.:11.00 Female:207 Yes: 40 Yes:245 Asian :102 1st Qu.: 68.75
Median :200.5 Median : 33.12 Median : 4622 Median :344.0 Median :3.000 Median :56.00 Median :14.00 NA NA NA Caucasian :199 Median : 459.50
Mean :200.5 Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958 Mean :55.67 Mean :13.45 NA NA NA NA Mean : 520.01
3rd Qu.:300.2 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000 3rd Qu.:70.00 3rd Qu.:16.00 NA NA NA NA 3rd Qu.: 863.00
Max. :400.0 Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000 Max. :98.00 Max. :20.00 NA NA NA NA Max. :1999.00

Explore one variable

In this section we will work on one variable or column data. We will do some basic data maniuplation function from dplyr.

income<-select(credit,Income) #selection of the column
length(income)#length of the vector
## [1] 1
dim(income)
## [1] 400   1
head(income)
names(income) <-"Dollar"
head(income)
income<-rename(income,Income = Dollar)
income
min(income) #minimum of the data
## [1] 10.354
max(income) #maximum of the data
## [1] 186.634

###Missing value

head(is.na(income)) #gives logical value 
##      Income
## [1,]  FALSE
## [2,]  FALSE
## [3,]  FALSE
## [4,]  FALSE
## [5,]  FALSE
## [6,]  FALSE
sum(is.na(income))
## [1] 0
arrange(income,desc(Income)) #arrange the value in incre and decr order  

Distribution

ggplot(income)+geom_histogram(aes(x=Income))

ggplot(income)+geom_density(aes(x=Income))

ggplot(income)+geom_boxplot(aes(x="",y=Income))

summary(income)
##      Income      
##  Min.   : 10.35  
##  1st Qu.: 21.01  
##  Median : 33.12  
##  Mean   : 45.22  
##  3rd Qu.: 57.47  
##  Max.   :186.63

Explore multiple variable

In this we will work on multiple variable and analyse the data create some visuals using ggplot2 and base R.

plot(credit)

####Distribution of data The density plot shows the distribution of data . Skewness and Kurtosis gives the measure of dispersion of the data.

ggplot(credit,aes(x=Balance))+geom_density()

library(moments)
skewness(credit$Balance)
## [1] 0.5824006
kurtosis(credit$Balance)
## [1] 2.463755

.The skewness is measure of a dataset’s symmetry – or lack of symmetry and perfectly symmetrical dataset has 0 skewness. . Kurtosis is called the degree of peakedness of a distribution but it is actually measure of tail.

####Scatterplot The Scatterplot is used to plot data about two variables and it is also used to show the relation/correlation between two variable.

db1 <- select(credit,Income,Limit)
ggplot(db1)+geom_point(aes(x=Income,y=Limit))+ggtitle("Scatterplot Income vs Limit")

We can plot more than two variable data by using colors,shape to show third variable data.

ggplot(data = credit)+geom_point(aes(Age,Balance,color= Gender))+ labs(title="Age vs Bank Balance")+theme_minimal()

ggplot(data = credit)+geom_point(aes(Income,Balance,color= Married))+ labs(title="Age vs Bank Balance")+theme_minimal()

cat("Stats of Age:",summary(credit$Age))
## Stats of Age: 23 41.75 56 55.6675 70 98
cat("\nStats of Balance:",summary(credit$Balance))
## 
## Stats of Balance: 0 68.75 459.5 520.015 863 1999

Correlation

Corrlation is Degree and type of relationship between any two or more quantities .The value of correlation is between -1 to 1.The 0 shows the no relation between two variables. -1 shows strong negative corelation and +1 shows strong positive correlation.

cls <- sapply(credit,class)
cls
##        ID    Income     Limit    Rating     Cards       Age Education 
## "integer" "numeric" "integer" "integer" "integer" "integer" "integer" 
##    Gender   Student   Married Ethnicity   Balance 
##  "factor"  "factor"  "factor"  "factor" "integer"
credit%>% select(which(cls!="factor"))
num_credit<-credit%>% select(which(cls!="factor"))
num_credit<- as.data.frame(num_credit)
cor(num_credit)
##                     ID      Income       Limit      Rating       Cards
## ID         1.000000000  0.03720258  0.02417249  0.02198547 -0.03630425
## Income     0.037202580  1.00000000  0.79208834  0.79137763 -0.01827261
## Limit      0.024172487  0.79208834  1.00000000  0.99687974  0.01023133
## Rating     0.021985470  0.79137763  0.99687974  1.00000000  0.05323903
## Cards     -0.036304251 -0.01827261  0.01023133  0.05323903  1.00000000
## Age        0.058603022  0.17533840  0.10088792  0.10316500  0.04294829
## Education -0.001415034 -0.02769198 -0.02354853 -0.03013563 -0.05108422
## Balance    0.006064108  0.46365646  0.86169727  0.86362516  0.08645635
##                   Age    Education      Balance
## ID        0.058603022 -0.001415034  0.006064108
## Income    0.175338403 -0.027691982  0.463656457
## Limit     0.100887922 -0.023548534  0.861697267
## Rating    0.103164996 -0.030135627  0.863625161
## Cards     0.042948288 -0.051084217  0.086456347
## Age       1.000000000  0.003619285  0.001835119
## Education 0.003619285  1.000000000 -0.008061576
## Balance   0.001835119 -0.008061576  1.000000000
plot(num_credit)

In the below plot we use scatterplot to show the relation between two variables. Limit and Rating shows highly postive relation.

g<- ggplot(num_credit)
g+ geom_point(aes(x=Limit, y= Rating))+labs(title="Limit vs Rating Scatterplot",x="Limit",y="Rating")+theme_minimal()

g+ geom_point(aes(x=Income,y=Balance))+labs(title="Income vs Balance Scatterplot")+theme_classic()

g+ geom_point(aes(x=Rating,y=Balance))+labs(title="Rating vs Balance Scatterplot")+ theme_light()

####Boxplot Boxplot is used to plot the quartiles and shows outliers in the data.

ggplot(credit)+geom_boxplot(aes(x= Gender,y=Balance))

####Lineplot Lineplot is mostly used in Timeseries data but here we will use to show the Age of the people in dataset.

age_df<-as.data.frame(table(credit$Age))
names(age_df)<-c("Age","Count")
ggplot(age_df,aes(Age,Count))+geom_line(aes(group=1))+geom_point()+scale_x_discrete(breaks=c(23,33,43,53,63,73,83,98))+theme_minimal()

Countplot

This plot shows the count of the data.

ggplot(credit)+geom_bar(aes(x=Gender,fill=Gender))+labs(title="Countplot of Gender")+
  theme_minimal()+theme(legend.position = "")

credit$Cards<- as.factor(credit$Cards)
ggplot(credit,aes(x=Cards))+geom_bar(aes(y = (..count..)/sum(..count..),fill=Cards))+
  geom_text(aes(y = ((..count..)/sum(..count..)), label = scales::percent((..count..)/sum(..count..))), stat = "count", vjust = -0.25)+
  labs(title= "Cards count",y="Percentage")+
  theme_minimal()+theme(legend.position = "")

aggregate(Income~Ethnicity,credit,mean)

####Multiple plots In this section we will analyse data and use multiple plots in same plots to show our insights.

aggregate(Balance~Student+Gender,credit,mean)

*.Aggregate function Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

ggplot(data= aggregate(Balance~Student+Gender,credit,mean),aes(x=Gender,y=Balance))+geom_col(aes(fill=Gender))+facet_grid(~Student)

Male has more cash in Bank balance than the female.

ggplot(data=aggregate(Balance~Gender+Married,credit,mean),aes(Gender,Balance))+
  geom_col(aes(fill=Gender))+ggtitle("Barplot Gender vs Balance on basis of Marital Status") +facet_grid(~Married)+theme_minimal()+theme(legend.position = "")

This plot shows the relation Between the bank balance,Gender and their marital status.It seems married women has more cash in bank than the Married male in this dataset.

ggplot(data=aggregate(Balance~Ethnicity+Gender,credit,mean),aes(Ethnicity,Balance))+
  geom_col(aes(fill=Ethnicity))+ggtitle("Barplot Ethnicity vs Balance on basis of Gender") +facet_grid(~Gender)+theme_bw()+theme(legend.position = "")

This was all about the data analysis.I showed basic data analysis process and you can find lots of more data insights from this dataset. Now , we will move to reporting process and lets see Report making process in R+Rstudio.