Mr. Perera who lives in Soratha Mawatha - Wijerama wants to sell his house. He wants to decide a price for his house to list it in the market. He believes that the size of the house is one likely determinant of price. He asked from 200 homes in the neighbourhood “what price should you ask for your home when you put it on the market?” and square feet. Collected data are available to you in the file house.csv
.
Fit a suitable simple linear regression model for the data and carry out complete diagnostic tests for the model to check the validity of the regression assumptions.
Comment on each and every output received in part (1). You should clearly states corresponding hypotheses as appropriate.
What price should you suggest for Mr. Perera’s home? Square feet of Mr Perera’s House is 2000.
library(tidyverse)
library(broom)
house <- read_csv("house.csv")
attach(house)
glimpse(house)
Rows: 200
Columns: 2
$ price <dbl> 68.92162, 63.97395, 69.51606, 68.86582, 69.24784, 68.54808,…
$ houseSize <dbl> 2503.489, 2362.630, 2522.676, 2491.663, 2511.292, 2476.032,…
summary(house)
price houseSize
Min. :59.89 Min. :2208
1st Qu.:65.79 1st Qu.:2437
Median :67.28 Median :2504
Mean :67.45 Mean :2502
3rd Qu.:69.27 3rd Qu.:2579
Max. :74.02 Max. :2769
qplot(data=house, x=price, geom=c("histogram"))+
geom_histogram(color="black", fill="#d95f02")
Interpretation:
qplot(data=house, x=houseSize, geom=c("histogram"))+
geom_histogram(color="black", fill="#1b9e77")
Interpretation:
qplot(data=house, y=price, x=houseSize) +
xlab("House size (sqft)") +
ylab("Price") +
labs(title="Price vs House size (sqft)")
cor(house$price, house$houseSize)
[1] 0.9331807
Interpretation:
slrfit <- lm(price ~ houseSize, data=house)
slrfit
Call:
lm(formula = price ~ houseSize, data = house)
Coefficients:
(Intercept) houseSize
5.15772 0.02489
qplot(data=house, y=price, x=houseSize) +
xlab("House size (sqft)") +
ylab("Price") +
labs(title="Price vs House size (sqft)") +
geom_abline(intercept = 5.15772, slope = 0.02489, colour="forestgreen", lwd=2)
Write the fitted regression model:
slrfit_values <- augment(slrfit)
slrfit_values
# A tibble: 200 x 8
price houseSize .fitted .resid .std.resid .hat .sigma .cooksd
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 68.9 2503. 67.5 1.45 1.48 0.00500 0.975 0.00552
2 64.0 2363. 64.0 0.00442 0.00455 0.0145 0.981 0.000000152
3 69.5 2523. 68.0 1.56 1.60 0.00520 0.974 0.00670
4 68.9 2492. 67.2 1.68 1.73 0.00506 0.973 0.00757
5 69.2 2511. 67.7 1.58 1.62 0.00504 0.974 0.00662
6 68.5 2476. 66.8 1.76 1.80 0.00534 0.973 0.00869
7 64.8 2471. 66.7 -1.86 -1.91 0.00547 0.972 0.0100
8 71.6 2634. 70.7 0.840 0.864 0.0134 0.979 0.00508
9 70.2 2581. 69.4 0.808 0.830 0.00799 0.979 0.00277
10 65.9 2424. 65.5 0.414 0.425 0.00796 0.980 0.000725
# … with 190 more rows
Write the code to plot residual vs fitted values
qplot(data=slrfit_values, y=.resid, x=.fitted)
qplot(data=slrfit_values, x=.resid, geom=c("histogram"))+
geom_histogram(color="black", fill="lightblue")
ggplot(slrfit_values,
aes(sample=.resid))+
stat_qq() + stat_qq_line()+labs(x="Theoretical Quantiles", y="Sample Quantiles")
shapiro.test(slrfit_values$.resid)
Shapiro-Wilk normality test
data: slrfit_values$.resid
W = 0.99755, p-value = 0.9897
H0:
H1:
Decision:
Conclusion:
summary(slrfit)
Call:
lm(formula = price ~ houseSize, data = house)
Residuals:
Min 1Q Median 3Q Max
-2.67119 -0.63937 0.00189 0.67506 2.79721
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.1577228 1.7063810 3.023 0.00284 **
houseSize 0.0248925 0.0006813 36.535 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9781 on 198 degrees of freedom
Multiple R-squared: 0.8708, Adjusted R-squared: 0.8702
F-statistic: 1335 on 1 and 198 DF, p-value: < 2.2e-16
\(R^2 = 87.02\%\)
Interpretation:
Interpretation:
Interpretation:
Interpretation:
Interpretation:
newhouseSize <- data.frame(houseSize = c(2000, 2500, 2300, 2400, 2760))
predict(slrfit, newdata=newhouseSize , interval="predict")
fit lwr upr
1 54.94275 52.89457 56.99092
2 67.38900 65.45527 69.32274
3 62.41050 60.45773 64.36326
4 64.89975 62.96113 66.83837
5 73.86105 71.89660 75.82551