Credit_analysis

Setting up system

#list.files() # prints the files presnt in working directory

For Asking help with R

#?function
#help(function or Robject)

This is simulated dataset containing information about the customers.The data is about regression problem and we need to analyse this data find some insights and create a machine learning model.

Loading Library and Data

library(dplyr) #used for data manipulation/wrangling
library(ggplot2) #used for data visualization

# load the data
credit <- read.csv("credit.csv")

Data Manipulation and Analysis

head(credit) # shows first 6 data

Description of dataset

*.ID - Identification

*.Income - Income in $10,000’s

*.Limit - Credit limit

*.Rating - Credit rating

*.Cards - Number of credit cards

*.Age - Age in years

*.Education - Number of years of education

*.Gender - A factor with levels Male and Female

*.Student - A factor with levels No & Yes if an individual was a student

*.Married - A factor with levels No & Yes if an individual was married

*.Ethnicity - A factor with levels African/American/Asian/Caucasian enthnic group

*.Balance - Average credit card balance in $.

We need to analyse and create machine learning model to predict the Bank balance of Customer.

tail(credit) #last 6 data are shown by default

View() shows whole dataset in new tab.

glimpse(credit)

## Observations: 400
## Variables: 12
## $ ID        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Income    <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 2...
## $ Limit     <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300...
## $ Rating    <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 58...
## $ Cards     <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3...
## $ Age       <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, ...
## $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9,...
## $ Gender    <fct>  Male, Female,  Male, Female,  Male,  Male, Female, ...
## $ Student   <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No...
## $ Married   <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No,...
## $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian...
## $ Balance   <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, ...

#View(credit)

dim(credit) #prints dimension of the dataframe (Row, Column)

## [1] 400  12

#rownames(credit) #prints rows names

names(credit) #prints columns header

##  [1] "ID"        "Income"    "Limit"     "Rating"    "Cards"    
##  [6] "Age"       "Education" "Gender"    "Student"   "Married"  
## [11] "Ethnicity" "Balance"

#colnames(credit) #prints columns header

str(credit) #internal structure of the data

## 'data.frame':    400 obs. of  12 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ Gender   : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
##  $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
##  $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...

library(knitr)
kable(summary(credit)) #basic statistics

ID	Income	Limit	Rating	Cards	Age	Education	Gender	Student	Married	Ethnicity	Balance
Min. : 1.0	Min. : 10.35	Min. : 855	Min. : 93.0	Min. :1.000	Min. :23.00	Min. : 5.00	Male :193	No :360	No :155	African American: 99	Min. : 0.00
1st Qu.:100.8	1st Qu.: 21.01	1st Qu.: 3088	1st Qu.:247.2	1st Qu.:2.000	1st Qu.:41.75	1st Qu.:11.00	Female:207	Yes: 40	Yes:245	Asian :102	1st Qu.: 68.75
Median :200.5	Median : 33.12	Median : 4622	Median :344.0	Median :3.000	Median :56.00	Median :14.00	NA	NA	NA	Caucasian :199	Median : 459.50
Mean :200.5	Mean : 45.22	Mean : 4736	Mean :354.9	Mean :2.958	Mean :55.67	Mean :13.45	NA	NA	NA	NA	Mean : 520.01
3rd Qu.:300.2	3rd Qu.: 57.47	3rd Qu.: 5873	3rd Qu.:437.2	3rd Qu.:4.000	3rd Qu.:70.00	3rd Qu.:16.00	NA	NA	NA	NA	3rd Qu.: 863.00
Max. :400.0	Max. :186.63	Max. :13913	Max. :982.0	Max. :9.000	Max. :98.00	Max. :20.00	NA	NA	NA	NA	Max. :1999.00

Explore one variable

In this section we will work on one variable or column data. We will do some basic data maniuplation function from dplyr.

income<-select(credit,Income) #selection of the column

length(income)#length of the vector

## [1] 1

dim(income)

## [1] 400   1

head(income)

names(income) <-"Dollar"
head(income)

income<-rename(income,Income = Dollar)
income

min(income) #minimum of the data

## [1] 10.354

max(income) #maximum of the data

## [1] 186.634

###Missing value

head(is.na(income)) #gives logical value

##      Income
## [1,]  FALSE
## [2,]  FALSE
## [3,]  FALSE
## [4,]  FALSE
## [5,]  FALSE
## [6,]  FALSE

sum(is.na(income))

## [1] 0

arrange(income,desc(Income)) #arrange the value in incre and decr order

Distribution

ggplot(income)+geom_histogram(aes(x=Income))

ggplot(income)+geom_density(aes(x=Income))

ggplot(income)+geom_boxplot(aes(x="",y=Income))

summary(income)

##      Income      
##  Min.   : 10.35  
##  1st Qu.: 21.01  
##  Median : 33.12  
##  Mean   : 45.22  
##  3rd Qu.: 57.47  
##  Max.   :186.63

Explore multiple variable

In this we will work on multiple variable and analyse the data create some visuals using ggplot2 and base R.

plot(credit)

####Distribution of data The density plot shows the distribution of data . Skewness and Kurtosis gives the measure of dispersion of the data.

ggplot(credit,aes(x=Balance))+geom_density()

library(moments)
skewness(credit$Balance)

## [1] 0.5824006

kurtosis(credit$Balance)

## [1] 2.463755

.The skewness is measure of a dataset’s symmetry – or lack of symmetry and perfectly symmetrical dataset has 0 skewness. . Kurtosis is called the degree of peakedness of a distribution but it is actually measure of tail.

####Scatterplot The Scatterplot is used to plot data about two variables and it is also used to show the relation/correlation between two variable.

db1 <- select(credit,Income,Limit)

ggplot(db1)+geom_point(aes(x=Income,y=Limit))+ggtitle("Scatterplot Income vs Limit")

We can plot more than two variable data by using colors,shape to show third variable data.

ggplot(data = credit)+geom_point(aes(Age,Balance,color= Gender))+ labs(title="Age vs Bank Balance")+theme_minimal()

ggplot(data = credit)+geom_point(aes(Income,Balance,color= Married))+ labs(title="Age vs Bank Balance")+theme_minimal()

cat("Stats of Age:",summary(credit$Age))

## Stats of Age: 23 41.75 56 55.6675 70 98

cat("\nStats of Balance:",summary(credit$Balance))

## 
## Stats of Balance: 0 68.75 459.5 520.015 863 1999

Correlation

Corrlation is Degree and type of relationship between any two or more quantities .The value of correlation is between -1 to 1.The 0 shows the no relation between two variables. -1 shows strong negative corelation and +1 shows strong positive correlation.

cls <- sapply(credit,class)
cls

##        ID    Income     Limit    Rating     Cards       Age Education 
## "integer" "numeric" "integer" "integer" "integer" "integer" "integer" 
##    Gender   Student   Married Ethnicity   Balance 
##  "factor"  "factor"  "factor"  "factor" "integer"

credit%>% select(which(cls!="factor"))

num_credit<-credit%>% select(which(cls!="factor"))

num_credit<- as.data.frame(num_credit)

cor(num_credit)

##                     ID      Income       Limit      Rating       Cards
## ID         1.000000000  0.03720258  0.02417249  0.02198547 -0.03630425
## Income     0.037202580  1.00000000  0.79208834  0.79137763 -0.01827261
## Limit      0.024172487  0.79208834  1.00000000  0.99687974  0.01023133
## Rating     0.021985470  0.79137763  0.99687974  1.00000000  0.05323903
## Cards     -0.036304251 -0.01827261  0.01023133  0.05323903  1.00000000
## Age        0.058603022  0.17533840  0.10088792  0.10316500  0.04294829
## Education -0.001415034 -0.02769198 -0.02354853 -0.03013563 -0.05108422
## Balance    0.006064108  0.46365646  0.86169727  0.86362516  0.08645635
##                   Age    Education      Balance
## ID        0.058603022 -0.001415034  0.006064108
## Income    0.175338403 -0.027691982  0.463656457
## Limit     0.100887922 -0.023548534  0.861697267
## Rating    0.103164996 -0.030135627  0.863625161
## Cards     0.042948288 -0.051084217  0.086456347
## Age       1.000000000  0.003619285  0.001835119
## Education 0.003619285  1.000000000 -0.008061576
## Balance   0.001835119 -0.008061576  1.000000000

plot(num_credit)

In the below plot we use scatterplot to show the relation between two variables. Limit and Rating shows highly postive relation.

g<- ggplot(num_credit)
g+ geom_point(aes(x=Limit, y= Rating))+labs(title="Limit vs Rating Scatterplot",x="Limit",y="Rating")+theme_minimal()

g+ geom_point(aes(x=Income,y=Balance))+labs(title="Income vs Balance Scatterplot")+theme_classic()

g+ geom_point(aes(x=Rating,y=Balance))+labs(title="Rating vs Balance Scatterplot")+ theme_light()

####Boxplot Boxplot is used to plot the quartiles and shows outliers in the data.

ggplot(credit)+geom_boxplot(aes(x= Gender,y=Balance))

####Lineplot Lineplot is mostly used in Timeseries data but here we will use to show the Age of the people in dataset.

age_df<-as.data.frame(table(credit$Age))
names(age_df)<-c("Age","Count")
ggplot(age_df,aes(Age,Count))+geom_line(aes(group=1))+geom_point()+scale_x_discrete(breaks=c(23,33,43,53,63,73,83,98))+theme_minimal()

Countplot

This plot shows the count of the data.

ggplot(credit)+geom_bar(aes(x=Gender,fill=Gender))+labs(title="Countplot of Gender")+
  theme_minimal()+theme(legend.position = "")

credit$Cards<- as.factor(credit$Cards)
ggplot(credit,aes(x=Cards))+geom_bar(aes(y = (..count..)/sum(..count..),fill=Cards))+
  geom_text(aes(y = ((..count..)/sum(..count..)), label = scales::percent((..count..)/sum(..count..))), stat = "count", vjust = -0.25)+
  labs(title= "Cards count",y="Percentage")+
  theme_minimal()+theme(legend.position = "")

aggregate(Income~Ethnicity,credit,mean)

####Multiple plots In this section we will analyse data and use multiple plots in same plots to show our insights.

aggregate(Balance~Student+Gender,credit,mean)

*.Aggregate function Splits the data into subsets, computes summary statistics for each, and returns the result in a convenient form.

ggplot(data= aggregate(Balance~Student+Gender,credit,mean),aes(x=Gender,y=Balance))+geom_col(aes(fill=Gender))+facet_grid(~Student)

Male has more cash in Bank balance than the female.

ggplot(data=aggregate(Balance~Gender+Married,credit,mean),aes(Gender,Balance))+
  geom_col(aes(fill=Gender))+ggtitle("Barplot Gender vs Balance on basis of Marital Status") +facet_grid(~Married)+theme_minimal()+theme(legend.position = "")

This plot shows the relation Between the bank balance,Gender and their marital status.It seems married women has more cash in bank than the Married male in this dataset.

ggplot(data=aggregate(Balance~Ethnicity+Gender,credit,mean),aes(Ethnicity,Balance))+
  geom_col(aes(fill=Ethnicity))+ggtitle("Barplot Ethnicity vs Balance on basis of Gender") +facet_grid(~Gender)+theme_bw()+theme(legend.position = "")

This was all about the data analysis.I showed basic data analysis process and you can find lots of more data insights from this dataset. Now , we will move to reporting process and lets see Report making process in R+Rstudio.