In this notebook, we will be dealing with scatter plot, correlation analysis, and simple linear regression analysis.

Scatter plot

Scatter plot can be created by using the following code:

plot(datax, datay, main="Title", xlab="xaxis", ylab="yaxis ", pch=19)

Here, pch represents the dot type, and the list of it can be found in the following link:

http://www.sthda.com/english/wiki/r-plot-pch-symbols-the-different-point-shapes-available-in-r

We will illustrate one example here:

df <- read.csv('Data/NHANES.csv')
plot(df$Weight, df$Height, main='Weight vs Height', xlab='Weight', ylab='Height', pch=19)

The overlaps are difficult to see, so we can use a point with a hollow center:

plot(df$Weight, df$Height, main='Weight vs Height', xlab='Weight', ylab='Height', pch=1, col='blue')

Correlation analysis

Correlation coefficient can be simply computed by using the following code:

cor(data1, data2, use = "complete.obs")

use = "complete.obs" means to exclude the missing observations.

cor(df$Weight, df$Height, use = "complete.obs")
## [1] 0.7489522

Since the correlation coefficient is a relatively large number, we can conclude that there is a strong correlation between height and weight.

Simple linear regression analysis

Simple linear regression analysis can be conducted by using the following R code:

summary(lm(datay ~ datax, data = df))

Here, datay represents the y-axis data, datax represents the x-axis data, and df is the name of the dataframe.

summary(lm(Height~Weight, data=df))
## 
## Call:
## lm(formula = Height ~ Weight, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.942  -7.064   2.090   9.137  35.019 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.220e+02  3.846e-01   317.2   <2e-16 ***
## Weight      5.482e-01  4.942e-03   110.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.37 on 9632 degrees of freedom
##   (366 observations deleted due to missingness)
## Multiple R-squared:  0.5609, Adjusted R-squared:  0.5609 
## F-statistic: 1.231e+04 on 1 and 9632 DF,  p-value: < 2.2e-16

This shows the intercept, the coefficient and the p-values. People who are interested in learning more about the given information should consult a statistics textbook.

Plot regression line

We can plot a regression line in a scatter plot by using the following code:

abline(lm(datay ~ datax, data = df))

with the scatter plot code:

plot(df$Weight, df$Height, main='Weight vs Height', xlab='Weight', ylab='Height', pch=1, col='blue')
abline(lm(Height~Weight, data=df), col='red')

Multiple Linear Regression Analysis

We can add multiple variables in the linear regression analysis by adding a variable afterwards:

summary(lm(datay ~ datax1 + datax2, data = df))

if there are more than two variables, add + datax3 after + datax2 argument.

summary(lm(Height~Weight+Age, data=df))
## 
## Call:
## lm(formula = Height ~ Weight + Age, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.096  -7.441   1.442   8.969  36.975 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.205e+02  3.850e-01  312.90   <2e-16 ***
## Weight      4.990e-01  5.467e-03   91.28   <2e-16 ***
## Age         1.345e-01  6.915e-03   19.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.12 on 9631 degrees of freedom
##   (366 observations deleted due to missingness)
## Multiple R-squared:  0.5775, Adjusted R-squared:  0.5774 
## F-statistic:  6583 on 2 and 9631 DF,  p-value: < 2.2e-16

This shows that both weight and age are significant predictors of height.

©2021 by Daiki Tagami. All rights reserved.