Data Sources

Sensor
Text
Image
Videos
Survey

Data Types

Basic structured data types 1.Numeric 2. Categorical

continuous windspeed
Discrete Count of the occurence of the event
Categorical Type of Tv Screen
Binary Yes/No 0/1
Ordinal 1,2,3,4,5

Loading Data

mpg <- read.csv("mpg.csv")

Fuel economy data from 1999 and 2008 for 38 popular models of car

Description

This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008 - this was used as a proxy for the popularity of the car.

Data Exploration

str(mpg)

## 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: Factor w/ 15 levels "audi","chevrolet",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ model       : Factor w/ 38 levels "4runner 4wd",..: 2 2 2 2 2 2 2 3 3 3 ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : Factor w/ 10 levels "auto(av)","auto(l3)",..: 4 9 10 1 4 9 1 9 4 10 ...
##  $ drv         : Factor w/ 3 levels "4","f","r": 2 2 2 2 2 2 2 1 1 1 ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : Factor w/ 5 levels "c","d","e","p",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ class       : Factor w/ 7 levels "2seater","compact",..: 2 2 2 2 2 2 2 2 2 2 ...

str() function shows the structure of the dataframe.

Chop the Head and Tail

head(mpg)

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact

head() gives the first 6 rows of the dataframe by default.

tail(mpg)

tail() gives the last 6 rows of the dataframe by default.

Central Tendency

Basic Step in exploring data

Mean

mean(mpg$cyl) ## $ select the column of dataframe

## [1] 5.888889

Median

median(mpg$cyl)

## [1] 6

Mode

Estimates of Variability

Variance

var(mpg$displ)

## [1] 1.669158

Standard Deviation(SD)

sd(mpg$cyl)

## [1] 1.611534

Range

range(mpg$displ)

## [1] 1.6 7.0

Percentile

quantile(mpg$displ)

##   0%  25%  50%  75% 100% 
##  1.6  2.4  3.3  4.6  7.0

Summary

summary(mpg)

##      manufacturer                 model         displ            year     
##  dodge     :37    caravan 2wd        : 11   Min.   :1.600   Min.   :1999  
##  toyota    :34    ram 1500 pickup 4wd: 10   1st Qu.:2.400   1st Qu.:1999  
##  volkswagen:27    civic              :  9   Median :3.300   Median :2004  
##  ford      :25    dakota pickup 4wd  :  9   Mean   :3.472   Mean   :2004  
##  chevrolet :19    jetta              :  9   3rd Qu.:4.600   3rd Qu.:2008  
##  audi      :18    mustang            :  9   Max.   :7.000   Max.   :2008  
##  (Other)   :74    (Other)            :177                                 
##       cyl               trans    drv          cty             hwy       
##  Min.   :4.000   auto(l4)  :83   4:103   Min.   : 9.00   Min.   :12.00  
##  1st Qu.:4.000   manual(m5):58   f:106   1st Qu.:14.00   1st Qu.:18.00  
##  Median :6.000   auto(l5)  :39   r: 25   Median :17.00   Median :24.00  
##  Mean   :5.889   manual(m6):19           Mean   :16.86   Mean   :23.44  
##  3rd Qu.:8.000   auto(s6)  :16           3rd Qu.:19.00   3rd Qu.:27.00  
##  Max.   :8.000   auto(l6)  : 6           Max.   :35.00   Max.   :44.00  
##                  (Other)   :13                                          
##  fl             class   
##  c:  1   2seater   : 5  
##  d:  5   compact   :47  
##  e:  8   midsize   :41  
##  p: 52   minivan   :11  
##  r:168   pickup    :33  
##          subcompact:35  
##          suv       :62

Summary() function gives the basic statistics for the numeric data and count for the categorical data.

Explore Data Distribution

Loading Packages

library(ggplot2)

ggplot2 is great visualization package from the tidyverse package universe.

Boxplot

boxplot(mpg$displ)

Boxplot shows the percentile of the given value.The top and bottom of the box are the 75th and 25th percentiles, respectively. The median is shown by the horizontal line in the box. The dashed lines, referred to aswhiskers, extend from the top and bottom to indicate the range for the bulk of the data.

Histogram

hist(mpg$displ)

A histogram is a visualization, with bins on the x-axis and data count on the y-axis.

DensityPlot

hist(mpg$displ,freq = FALSE)
lines(density(mpg$displ),col="red",lwd=2)

Exploring Binary and Categorical Data

Mode

Mode <- function(x) {
which.max(table(x))
}

Mode(mpg$manufacturer)

## dodge 
##     3

Bar Chart

barplot(table(mpg[["drv"]]))

Correlation

cor(mpg[c(3,5,8,9)])

##            displ        cyl        cty        hwy
## displ  1.0000000  0.9302271 -0.7985240 -0.7660200
## cyl    0.9302271  1.0000000 -0.8057714 -0.7619124
## cty   -0.7985240 -0.8057714  1.0000000  0.9559159
## hwy   -0.7660200 -0.7619124  0.9559159  1.0000000

library(corrplot)

## corrplot 0.84 loaded

corrplot(cor(mpg[c(3,5,8,9)]))

Scatterplot

plot(mpg$displ,mpg$cty,xlab= "displ", ylab="cty")

The standard way to visualize the relationship between two measured data variables is with a scatterplot.

Exploring Two or More Variables(Multivariate)

Categorical and Numeric Data

boxplot(displ ~ drv, data = mpg)
title("Boxplot of Drive Type and Displacement")

Visualizing Multiple Variables

plot(displ~cty,data =mpg)
title("Plot of displacement and Miles Per gallon for Car")

Basic of Data Transformation and Visualization

Prerequisites

In this Section we will learn how to do Data Transformation using dplyr package of tidyverse.

library(dplyr)

Filter rows with filter()

filter() allows you to subset observations based on their values.

tail(filter(mpg,drv=="f", year == 2008))

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
44	volkswagen	jetta	2.5	2008	5	manual(m5)	f	21	29	r	compact
45	volkswagen	new beetle	2.5	2008	5	manual(m5)	f	20	28	r	subcompact
46	volkswagen	new beetle	2.5	2008	5	auto(s6)	f	20	29	r	subcompact
47	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
48	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
49	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

Arrange rows with arrange()

filter() changes order of selected rows.

head(arrange(mpg,drv,displ))

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p	compact
audi	a4 quattro	1.8	1999	4	auto(l5)	4	16	25	p	compact
audi	a4 quattro	2.0	2008	4	manual(m6)	4	20	28	p	compact
audi	a4 quattro	2.0	2008	4	auto(s6)	4	19	27	p	compact
subaru	impreza awd	2.2	1999	4	auto(l4)	4	21	26	r	subcompact
subaru	impreza awd	2.2	1999	4	manual(m5)	4	19	26	r	subcompact

Select columns with select()

select() allows to look in subset of data using columns name.

head(select(mpg,class,trans,year,displ))

class	trans	year	displ
compact	auto(l5)	1999	1.8
compact	manual(m5)	1999	1.8
compact	manual(m6)	2008	2.0
compact	auto(av)	2008	2.0
compact	auto(l5)	1999	2.8
compact	manual(m5)	1999	2.8

If you use “-” before the name of column ,that column will be discarded.

Add new variables with mutate()

mutate() helps to add new column at the end of the dataframe using value of existing column.

head(mutate(mpg,dmpg = hwy - cty))

manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	dmpg
audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	11
audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	8
audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	11
audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	9
audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	10
audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact	8

We calculated difference between city miles per gallon and highway miles per gallon.

Grouped summaries with summarise()

summarise() collapses the dataframe to a single Rows.

summarise(mpg,avg_hwy=mean(hwy))

avg_hwy
23.44017

When summarise() is used with group_by(), the unit of analysis is changed from the complete dataset to individual groups.

grouped_mpg<- group_by(mpg,year,cyl)
summarise(grouped_mpg,avg_hwy = mean(hwy))

## # A tibble: 7 x 3
## # Groups:   year [?]
##    year   cyl avg_hwy
##   <int> <int>   <dbl>
## 1  1999     4    28.4
## 2  1999     6    22.3
## 3  1999     8    17.0
## 4  2008     4    29.3
## 5  2008     5    28.8
## 6  2008     6    23.5
## 7  2008     8    18.0

Data visualisation

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Prerequisites

In this section we will use ggplot2, one of the core members of the tidyverse.

ScatterPlot

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Barplot

ggplot(data = mpg) + geom_bar(aes(x=class))

ggplot(data = mpg) + geom_bar(aes(x=class, fill = class))

DensityPlot and Histogram

ggplot(data = mpg) + geom_histogram(aes(x= displ))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = mpg) + geom_density(aes(x=displ))

Boxplot

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

Data Sources

Data Types

Loading Data

Fuel economy data from 1999 and 2008 for 38 popular models of car

Description

Data Exploration

Chop the Head and Tail

Central Tendency

Mean

Median

Mode

Estimates of Variability

Variance

Standard Deviation(SD)

Range

Percentile

Summary

Explore Data Distribution

Loading Packages

Boxplot

Histogram

DensityPlot

Exploring Binary and Categorical Data

Mode

Bar Chart

Correlation

Scatterplot

Exploring Two or More Variables(Multivariate)

Categorical and Numeric Data

Visualizing Multiple Variables

Basic of Data Transformation and Visualization

Prerequisites

Filter rows with filter()

Arrange rows with arrange()

Select columns with select()

Add new variables with mutate()

Grouped summaries with summarise()

Data visualisation

Prerequisites

ScatterPlot

Barplot

DensityPlot and Histogram

Boxplot

Facets