In this notebook, we will be dealing with scatter plot, correlation analysis, and simple linear regression analysis.
Scatter plot can be created by using the following code:
plot(datax, datay, main="Title", xlab="xaxis", ylab="yaxis ", pch=19)
Here, pch
represents the dot type, and the list of it can be found in the following link:
http://www.sthda.com/english/wiki/r-plot-pch-symbols-the-different-point-shapes-available-in-r
We will illustrate one example here:
df <- read.csv('Data/NHANES.csv')
plot(df$Weight, df$Height, main='Weight vs Height', xlab='Weight', ylab='Height', pch=19)
The overlaps are difficult to see, so we can use a point with a hollow center:
plot(df$Weight, df$Height, main='Weight vs Height', xlab='Weight', ylab='Height', pch=1, col='blue')
Correlation coefficient can be simply computed by using the following code:
cor(data1, data2, use = "complete.obs")
use = "complete.obs"
means to exclude the missing observations.
cor(df$Weight, df$Height, use = "complete.obs")
## [1] 0.7489522
Since the correlation coefficient is a relatively large number, we can conclude that there is a strong correlation between height and weight.
Simple linear regression analysis can be conducted by using the following R code:
summary(lm(datay ~ datax, data = df))
Here, datay
represents the y-axis data, datax
represents the x-axis data, and df
is the name of the dataframe.
summary(lm(Height~Weight, data=df))
##
## Call:
## lm(formula = Height ~ Weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.942 -7.064 2.090 9.137 35.019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.220e+02 3.846e-01 317.2 <2e-16 ***
## Weight 5.482e-01 4.942e-03 110.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 9632 degrees of freedom
## (366 observations deleted due to missingness)
## Multiple R-squared: 0.5609, Adjusted R-squared: 0.5609
## F-statistic: 1.231e+04 on 1 and 9632 DF, p-value: < 2.2e-16
This shows the intercept, the coefficient and the p-values. People who are interested in learning more about the given information should consult a statistics textbook.
We can plot a regression line in a scatter plot by using the following code:
abline(lm(datay ~ datax, data = df))
with the scatter plot code:
plot(df$Weight, df$Height, main='Weight vs Height', xlab='Weight', ylab='Height', pch=1, col='blue')
abline(lm(Height~Weight, data=df), col='red')
We can add multiple variables in the linear regression analysis by adding a variable afterwards:
summary(lm(datay ~ datax1 + datax2, data = df))
if there are more than two variables, add + datax3
after + datax2
argument.
summary(lm(Height~Weight+Age, data=df))
##
## Call:
## lm(formula = Height ~ Weight + Age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.096 -7.441 1.442 8.969 36.975
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.205e+02 3.850e-01 312.90 <2e-16 ***
## Weight 4.990e-01 5.467e-03 91.28 <2e-16 ***
## Age 1.345e-01 6.915e-03 19.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.12 on 9631 degrees of freedom
## (366 observations deleted due to missingness)
## Multiple R-squared: 0.5775, Adjusted R-squared: 0.5774
## F-statistic: 6583 on 2 and 9631 DF, p-value: < 2.2e-16
This shows that both weight and age are significant predictors of height.
©2021 by Daiki Tagami. All rights reserved.