The average monthly temperatures for Boulder, Colorado, from January 1991 to December 2015 are given in Boulder_2.xlsx (Source: U.S. Department of Commerce, National Oceanic and Atmospheric Administration). Plot the time series, and create a seasonal plot for the last four years of the series. Comment on your results.
1.Read in the data
Set working directory, and download the dataset in your working directory.
temperature <- read.csv("Boulder_2.csv")
str(temperature)
## 'data.frame': 300 obs. of 3 variables:
## $ Year : int 1991 1991 1991 1991 1991 1991 1991 1991 1991 1991 ...
## $ Month : chr "JAN" "FEB" "MAR" "APR" ...
## $ Temperature: num 29.9 40.9 42.8 47.8 58.2 66.6 70.5 69.2 61.7 52.1 ...
tail(temperature)
## Year Month Temperature
## 295 2015 JUL 70.3
## 296 2015 AUG 70.5
## 297 2015 SEP 69.4
## 298 2015 OCT 56.2
## 299 2015 NOV 40.8
## 300 2015 DEC 33.1
2. Create time series object
Before using the ts() function, remember to library the package you want to use.
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
temp.ts <- ts(temperature$Temperature, start = c(1991,1), end = c(2015,12), frequency = 12)
temp.ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1991 29.9 40.9 42.8 47.8 58.2 66.6 70.5 69.2 61.7 52.1 36.8 35.3
## 1992 35.9 40.6 43.3 54.3 59.1 62.9 68.3 66.3 64.4 54.1 34.1 29.2
## 1993 28.3 30.6 42.4 47.6 57.5 64.5 69.5 67.3 58.8 48.7 35.6 35.4
## 1994 35.5 31.9 43.9 47.6 60.8 70.0 71.2 70.9 65.0 50.6 36.6 36.1
## 1995 34.5 38.3 42.1 45.1 50.9 62.4 70.5 74.0 60.4 50.5 45.0 36.3
## 1996 29.7 37.7 37.9 50.4 58.9 66.9 71.5 69.5 60.8 53.1 40.6 36.5
## 1997 31.3 32.8 45.5 42.8 57.4 66.5 71.4 68.7 64.0 52.7 37.9 33.9
## 1998 36.5 36.4 38.7 46.5 58.8 62.1 72.8 70.7 67.1 50.4 44.0 32.2
## 1999 36.4 42.1 46.0 44.5 55.6 64.8 73.5 69.3 58.5 51.9 48.0 36.9
## 2000 36.4 41.0 42.9 51.2 61.0 67.4 74.7 73.0 63.1 49.6 31.4 31.2
## 2001 33.0 32.3 40.8 50.7 58.4 68.9 75.0 71.8 65.0 53.8 43.9 35.0
## 2002 33.1 36.0 37.5 52.9 56.2 70.5 76.8 71.3 64.0 45.9 40.3 36.6
## 2003 40.2 31.6 43.7 50.6 57.4 62.8 75.9 72.9 60.5 57.4 39.0 36.4
## 2004 35.4 33.6 48.2 49.2 59.9 62.7 69.2 66.4 62.9 51.9 39.7 36.5
## 2005 35.5 37.9 42.0 48.4 57.6 65.4 75.1 69.7 66.5 52.6 45.0 33.3
## 2006 40.7 33.7 39.4 53.9 61.0 71.6 74.4 71.6 58.4 51.0 43.4 35.3
## 2007 27.2 34.6 47.6 47.9 58.0 67.7 74.8 73.7 64.5 55.2 44.9 30.2
## 2008 31.6 36.1 40.8 47.8 57.1 66.1 75.0 69.6 60.9 46.0 46.0 31.1
## 2009 38.2 39.7 44.3 47.3 59.3 63.0 69.6 69.6 63.1 44.5 43.8 26.7
## 2010 33.0 30.1 42.7 48.8 53.9 66.9 72.5 72.4 66.6 55.0 39.8 37.2
## 2011 33.1 32.0 45.2 48.9 53.7 67.6 73.5 75.1 63.4 52.9 42.3 32.2
## 2012 38.9 32.3 50.8 54.3 60.2 74.2 74.8 73.2 66.0 50.8 46.1 33.7
## 2013 33.0 32.1 40.5 43.8 57.7 69.9 72.2 72.2 65.1 47.8 43.2 31.5
## 2014 34.6 32.0 43.6 49.8 56.6 66.2 72.2 69.2 63.8 55.3 38.3 33.9
## 2015 36.5 36.6 46.1 50.1 52.4 68.3 70.3 70.5 69.4 56.2 40.8 33.1
3. Make time series plot
Remember to add main title to your plot and add labels to X-axis and Y-axis.
autoplot(temp.ts,main = "Everage tempurature", xlab = "Time", ylab = "°F")
4. Create seasonal plot for the last your years
First we need to extract the last four years data. We can use window() function to extract a part of time series object. The first argument is the time series object from which you want to extract. The second and third argument is the start date and end date.
temp.last4.ts <- window(temp.ts,c(2012,1),c(2015,12))
Then we can use seasonplot() or ggseasonplot() to create a season plot. We can set argument year.labels = T to enable labels for each year, and we can use col to specify each year’s color. Here we also set year.label.left to T to plot labels on the left side of the season plot.
seasonplot(temp.last4.ts,year.labels = T, year.labels.left = T,col = 1:4, main = "Season Plot for Average Tempurature", ylab = "°F", xlab = "Month")
ggseasonplot(temp.last4.ts,year.labels = T, year.labels.left = T, main = "Season Plot for Average Tempurature", ylab = "°F", xlab = "Month")
For the temperature data (Boulder_2.xlsx) in Exercise 2.1, compute the summary statistics (mean, median, MAD, and S) overall and for each month. Comment on your results. Does it make sense to compute summary statistics across all values, rather than month by month? Explain why or why not.
1. Read in the dataset
We’ve already read in the data set in Problem 1.
2. Calculate overall summary statistics
mean
mean(temperature$Temperature)
## [1] 51.58367
median
median(temperature$Temperature)
## [1] 50.55
MAD
We need to calculate deviation first, and then calculate the mean absolute deviation.
temp_deviation <- temperature$Temperature - mean(temperature$Temperature)
mean(abs(temp_deviation))
## [1] 12.43013
Standard deviation
sd(temperature$Temperature)
## [1] 14.12608
3. Summary Statistics by Month
We can use apply() function to help us calculate the monthly summary statistics.
First, we need to convert the time-series object to a matrix. We make each row as a year and each column as a month, and rename the rows and columns by function colnames() and rownames().
temp_mat <- matrix(temp.ts,nrow = 25,ncol = 12, byrow = T)
colnames(temp_mat) <- 1:12
rownames(temp_mat) <- 1991:2015
temp_mat
## 1 2 3 4 5 6 7 8 9 10 11 12
## 1991 29.9 40.9 42.8 47.8 58.2 66.6 70.5 69.2 61.7 52.1 36.8 35.3
## 1992 35.9 40.6 43.3 54.3 59.1 62.9 68.3 66.3 64.4 54.1 34.1 29.2
## 1993 28.3 30.6 42.4 47.6 57.5 64.5 69.5 67.3 58.8 48.7 35.6 35.4
## 1994 35.5 31.9 43.9 47.6 60.8 70.0 71.2 70.9 65.0 50.6 36.6 36.1
## 1995 34.5 38.3 42.1 45.1 50.9 62.4 70.5 74.0 60.4 50.5 45.0 36.3
## 1996 29.7 37.7 37.9 50.4 58.9 66.9 71.5 69.5 60.8 53.1 40.6 36.5
## 1997 31.3 32.8 45.5 42.8 57.4 66.5 71.4 68.7 64.0 52.7 37.9 33.9
## 1998 36.5 36.4 38.7 46.5 58.8 62.1 72.8 70.7 67.1 50.4 44.0 32.2
## 1999 36.4 42.1 46.0 44.5 55.6 64.8 73.5 69.3 58.5 51.9 48.0 36.9
## 2000 36.4 41.0 42.9 51.2 61.0 67.4 74.7 73.0 63.1 49.6 31.4 31.2
## 2001 33.0 32.3 40.8 50.7 58.4 68.9 75.0 71.8 65.0 53.8 43.9 35.0
## 2002 33.1 36.0 37.5 52.9 56.2 70.5 76.8 71.3 64.0 45.9 40.3 36.6
## 2003 40.2 31.6 43.7 50.6 57.4 62.8 75.9 72.9 60.5 57.4 39.0 36.4
## 2004 35.4 33.6 48.2 49.2 59.9 62.7 69.2 66.4 62.9 51.9 39.7 36.5
## 2005 35.5 37.9 42.0 48.4 57.6 65.4 75.1 69.7 66.5 52.6 45.0 33.3
## 2006 40.7 33.7 39.4 53.9 61.0 71.6 74.4 71.6 58.4 51.0 43.4 35.3
## 2007 27.2 34.6 47.6 47.9 58.0 67.7 74.8 73.7 64.5 55.2 44.9 30.2
## 2008 31.6 36.1 40.8 47.8 57.1 66.1 75.0 69.6 60.9 46.0 46.0 31.1
## 2009 38.2 39.7 44.3 47.3 59.3 63.0 69.6 69.6 63.1 44.5 43.8 26.7
## 2010 33.0 30.1 42.7 48.8 53.9 66.9 72.5 72.4 66.6 55.0 39.8 37.2
## 2011 33.1 32.0 45.2 48.9 53.7 67.6 73.5 75.1 63.4 52.9 42.3 32.2
## 2012 38.9 32.3 50.8 54.3 60.2 74.2 74.8 73.2 66.0 50.8 46.1 33.7
## 2013 33.0 32.1 40.5 43.8 57.7 69.9 72.2 72.2 65.1 47.8 43.2 31.5
## 2014 34.6 32.0 43.6 49.8 56.6 66.2 72.2 69.2 63.8 55.3 38.3 33.9
## 2015 36.5 36.6 46.1 50.1 52.4 68.3 70.3 70.5 69.4 56.2 40.8 33.1
To get the monthly summary statistics, we want to calculate each column’s mean, median, MAD and sd. So we can use apply() function. The first argument of apply() is a matrix or data.frame. The second argument “MARGIN” is to indicate whether we want to calculate each column’s or each row’s data. MARGIN = 1, means calculating by each row. MARGIN = 2, means calculating by each column.
Here, for we want monthly summary statistics, we will set MARGIN = 2. The third argument is the function you want to use, for example, mean(), median(), sd()
apply(temp_mat, 2, mean)
## 1 2 3 4 5 6 7 8 9 10 11
## 34.336 35.316 43.148 48.888 57.504 66.636 72.608 70.724 63.356 51.600 41.060
## 12
## 33.828
apply(temp_mat, 2, median)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 34.6 34.6 42.9 48.8 57.7 66.6 72.5 70.7 63.8 51.9 40.8 33.9
apply(temp_mat, 2, sd)
## 1 2 3 4 5 6 7 8
## 3.467453 3.649621 3.211843 3.080357 2.604080 3.131145 2.354060 2.308910
## 9 10 11 12
## 2.822983 3.268792 4.159928 2.739605
To get MAD, we need to create a new function by ourselves.
MAD_func <- function(vec){
d_vec <- vec - mean(vec)
MAD <- mean(abs(d_vec))
return(MAD)
}
apply(temp_mat, 2, MAD_func)
## 1 2 3 4 5 6 7 8 9 10
## 2.75968 3.16064 2.43392 2.37152 1.90752 2.42144 2.00032 1.87296 2.22528 2.54400
## 11 12
## 3.43040 2.21664
Suppose you have two forecasting models: Model 1 and Model 2. The prediction of Model 1 is:
prediction_1 <- c(21.6 , 21.4 , 21.9 , 22.8 , 19.4 , 17.8 , 13.8 , 24.5 , 22.5 , 18.1 , 15.5 , 16.5 , 16.9 , 15.7 , 9.7)
The prediction of Model 2 is:
prediction_2 <- c(20.2 , 21.3 , 23.3 , 20.9 , 21.4 , 21.1 , 13.5 , 23.2 , 20.8 , 20.9 , 18.9 , 16 , 18 , 15.6 , 11.8)
The actual values are:
actual <- c(21 , 21 , 22.8 , 21.4 , 18.7 , 18.1 , 14.3 , 24.4 , 22.8 , 19.2 , 17.8 , 16.4 , 17.3 , 15.2 , 10.4)
Please calculate the RMSE of Model 1 and Model 2
Which model is better? Why?
a. Calculate RMSE
prediction_1 <- c(21.6 , 21.4 , 21.9 , 22.8 , 19.4 , 17.8 , 13.8 , 24.5 , 22.5 , 18.1 , 15.5 , 16.5 , 16.9 , 15.7 , 9.7)
prediction_2 <- c(20.2 , 21.3 , 23.3 , 20.9 , 21.4 , 21.1 , 13.5 , 23.2 , 20.8 , 20.9 , 18.9 , 16 , 18 , 15.6 , 11.8)
actual <- c(21 , 21 , 22.8 , 21.4 , 18.7 , 18.1 , 14.3 , 24.4 , 22.8 , 19.2 , 17.8 , 16.4 , 17.3 , 15.2 , 10.4)
prediction_error_1 <- actual - prediction_1
prediction_error_2 <- actual - prediction_2
RMSE_1 <- sqrt(mean(prediction_error_1^2))
RMSE_2 <- sqrt(mean(prediction_error_2^2))
c(RMSE_1, RMSE_2)
## [1] 0.8805301 1.4252485
b. Which one is better?
Model 1 is better. Model 1 has a smaller RMSE, and a smaller RMSE means the prediction is more accurate.
We have a table of the average temperatures (in °F) on the day (January 20) of the president’s inauguration in Washington, DC (Inauguration.csv).
Summarize the data numerically (mean, median, MAD and SD) and graphically.
Create a 95 percent prediction interval for the inaugurations of President Obama in 2009 and in 2013. Please use the mean and standard deviation of all past years, and based on normal distribution to make the prediction interval.
The actual values were 28º and 40°. Does that come as a surprise, given the width of the prediction interval?
a. Summary
DC_temp <- read.csv("Inauguration.csv")
str(DC_temp)
## 'data.frame': 20 obs. of 2 variables:
## $ Year : int 1937 1941 1945 1949 1953 1957 1961 1965 1969 1973 ...
## $ Temperature: int 33 29 35 38 49 44 22 38 35 42 ...
summary(DC_temp$Temperature)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.0 32.0 35.0 35.9 40.5 55.0
sd(DC_temp$Temperature)
## [1] 10.57753
DC_temp_ts <- ts(DC_temp$Temperature, start = 1937, frequency = 0.25)
autoplot(DC_temp_ts, main = "Average temperatures (in °F)", xlab = "Year", ylab = "°F")
b. Prediction Interval
Exclude the year of 2009 and 2013. Then calculate the mean and SD.
DC_temp_past <- DC_temp$Temperature[-c(19,20)]
mean(DC_temp_past)
## [1] 36.11111
sd(DC_temp_past)
## [1] 10.96995
To make a 95% percent prediction interval, we need the Z-score corresponding to 97.5%. 1 - (1 - 95%)/2 = 97.5%
qnorm(0.975)
## [1] 1.959964
mean(DC_temp_past) - 1.96 * sd(DC_temp_past)
## [1] 14.61
mean(DC_temp_past) + 1.96 * sd(DC_temp_past)
## [1] 57.61222