First we use read.csv() function to read in the US retail sales dataset.
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# read in the dataset.
retail <- read.csv("Module 1/US_Retail_sales_2.csv")
There are 181 observations and 5 variables. It is a monthly data.
monthly_sale_ts <- ts(retail$SALES...millions., start = c(2001,1),end = c(2015, 12), frequency = 12)
autoplot(monthly_sale_ts, main = "Monthly Retail Sales", ylab = "Sales in Millions")
We first use window() function to extract the data for the last four years, then we can use ggseasonal or seasonal to create seasonal plot.
last_four_yrs <- window(monthly_sale_ts, start = c(2012, 1), end = c(2015, 12))
ggseasonplot(last_four_yrs, main = "Seasonal plot of last 4 years", xlab = "Month", ylab = "Sales in Millions")
seasonplot(last_four_yrs, main = "Seasonal plot of last 4 years", xlab = "Month", ylab = "Sales in Millions", year.labels = T)
The sales achieves the peak on December each year and the sales in January is the lowest of each year.
To make a scatter plot, we can use function plot(Y ~ X). Please remember to add the main title and labels to the axis.
injury <- read.csv("Module 1/Rail_safety.csv")
plot(injury$Injuries ~ injury$Train.miles, main = "Scatter plot of injury and train miles", xlab = "Train miles", ylab = "Number of Injuries")
To make the time series plots, we need first creat time series objects in R through function ts(). As the it is the annual data, the frequency should be 1 (1 observation per year).
injury_ts <- ts(injury$Injuries, start = 1990, end = 2009, frequency = 1)
miles_ts <- ts(injury$Train.miles, start = 1990, end = 2009, frequency = 1)
injury_per_mile_ts <- ts(injury$Injuries.per.T.M, start = 1990, end = 2009, frequency = 1)
Then we can use autoplot() function to create time series plots. Please remember to add the titles and labels to axis.
autoplot(injury_ts, main = "Time Series Plot of Injuries", xlab = "Year", ylab = "Number of Injuries")
autoplot(miles_ts, main = "Time Series Plot of Miles", xlab = "Year", ylab = "Number of Miles")
autoplot(injury_per_mile_ts, main = "Time Series Plot of Injuries per Mile", xlab = "Year", ylab = "Number of Injuries per Miles")
Yes, the number of injuries is increasing over the time, and it achieved the peak in the year of 2007.
Please note that the build-in function mad() is calculate the Median Absolute Deviation. Here we are calculate the Mean Absolute Deviation.
MAD_func <- function(x){
return(mean(abs(x - mean(x))))
}
# Injury
mean(injury$Injuries)
median(injury$Injuries)
MAD_func(injury$Injuries)
sd(injury$Injuries)
# Miles
mean(injury$Train.miles)
median(injury$Train.miles)
MAD_func(injury$Train.miles)
sd(injury$Train.miles)
# Injury per miles
mean(injury$Injuries.per.T.M)
median(injury$Injuries.per.T.M)
MAD_func(injury$Injuries.per.T.M)
sd(injury$Injuries.per.T.M)
Injuries: mean: 726.55, median: 629.5, MAD: 233.26, S: 303.1238
Miles: mean: 83.95, median: 83, MAD: 7.85, S: 9.10451
Injuries per mile: mean: 843.4324, median: 776.9231, MAD: 197.4803, S: 265.6596
Again, we can use plot(Y ~ X) function to draw a scatter plot.
electricity <- read.csv("Module 1/Electricity.csv")
plot(electricity$Forecast ~ electricity$Actual, main = "Electricity Consumption", xlab = "Actual Consumption", ylab = "Predicted Consumption")
We can use cor() function to calculate the linear correlation coefficient. As the correlation coefficient is 0.95, there is a strong positive linear correlation between the predicted and the actual electricity consumption.
cor(electricity$Actual, electricity$Forecast)
## [1] 0.9547757
prediction_1 <- c(13.1, 15.4, 11.6, 20.7, 9.1, 11.9, 19.5, 14.2, 14.7, 7.4, 18.1, 19.3, 22.3, 11.4, 10.5, 21.1, 15.6, 22.0, 14.8, 16.8)
prediction_2 <- c(18.0, 13.8, 7.8, 22.1, 11.5, 13.1, 15.0, 14.7, 12.0, 6.3, 24.0, 16.4, 23.4, 15.2, 5.5, 21.2, 20.8, 20.2, 15.9, 16.7)
actual <- c(12.5, 16.3, 11.6, 21.4, 10.3, 12.9, 18.2, 14.9, 12.2, 8.0, 20.0, 18.1, 21.6, 12.7, 9.8, 20.2, 14.8, 19.7, 12.6, 16.6)
We can use scale() function to get the Z-scores for all the observations, and then extract the 5th.
scale(actual)[5]
## [1] -1.196853
We can also manually calculate the Z-score by \(\frac{obs - mean}{sd}\).
(actual[5] - mean(actual)) / sd(actual)
## [1] -1.196853
sqrt(mean( (actual - prediction_1)^2 ))
## [1] 1.269449
sqrt(mean( (actual - prediction_2)^2 ))
## [1] 2.838486
Model 1 is better, as it has a lower RMSE, which indicates a more accurate prediction.
As we are asked to construct a 95% confidence prediction itnerval, the corresponding Z score we are going to use is qnorm(1 - 5%/2), 1.96. We can simply apply the following fomula for the upper and lower bound: \[ \text{Lower Bound} = \text{Point Estimation} - Z * \text{RMSE} \\ \text{Upper Bound} = \text{Point Estimation} + Z * \text{RMSE} \]
19.5 - 1.96 * 1.269449
## [1] 17.01188
19.5 + 1.96 * 1.269449
## [1] 21.98812
The 95% confidence prediction interval is [ 17.01, 21.99].