Assignment2.Rmd

---
title: "Quantitative Research Skills 2022, Assignment 2"
author: "Rong Guang"
date: "Sep 27 2022"
output:
  html_document:
    theme: flatly
    highlight: haddock
    toc: yes
    toc_depth: 2
    number_section: no
  word_document:
    toc: yes
    toc_depth: '2'
subtitle: Factor Analysis
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

# Assignment 2: Factor Analysis

Begin working with this when you have completed the **Hands-On Exercise 2**.

*******************************************************************************

**Your task:**

Practice with 2-3 sets of variables (one at a time) in the USHS data, trying different number of factors (2-3), comparing orthogonal and oblique rotations, and hence going through ALL the phases required to do FA (as instructed and shown in the **Hands-On-Exercise 2**).

It should be possible to analyse these particular sets of variables in the USHS data with FA:
(Some of them might be easier to interpret than others).

    k21_1:k21_9, # various things in life (obs: originally wrong scale! "3" -> NA!), 9 variables
    k22_1:k22_10, # how often (felt various things), 10 variables
    k23:k34, # General Health Questionnaire (1-4), 12 variables
    k46_1:k46_11, # on how many days/week eat various food/(11)skip lunch (0,...,7=every day), 11 variables
    k95_1:k95_18, # feelings of studies (1:disagree,6:agree), 18 variables
    
**OBS:** If you use the k21* variables, remember to wrangle them first ("3" - NA!) See Hands-On-Exercise 1b!

**Note:** With FA, the variables can have different directions (like the items 2 & 3 in k22*), even different scales. 

*PERHAPS you could also try to combine variables across different measures: some items of k21 together with some items from k95, for instance!* (I have not tried to analyse that kind of combinations, but it should be perfectly possible, at least if the items would be somehow substantially related to each other. TRY IT!!)

What is also required in this **Assignment 2**, is your **interpretations** (you don't have to be a substance expert, I just encourage you to practice, as those phenomena should be easy enough to understand without a special expertise). Interpreting your analyses is very important, and this course is a safe place to practice that part, too.

**So TRY to interpret the factors you find and give them describing NAMES.**

Finally, show that you can work with the new **factor score variables** in the original data: visualise those scores with histograms, box plots, scatter plots, using also some other variables of the USHS. You will find a lot of good R code for producing such visualisations in the earlier exercises.


*******************************************************************************

# Analysing the USHS data with Factor Analysis (step by step)

Use as many R code chunks and text areas as you need - it's now all up to you!

*Please choose only your BEST FACTOR ANALYSIS and show it here step by step!*

Start by reading in the data and converting the factors (gender and age), just like in the exercise.

## Enviroment and data preparation

**1. Package loading**

```{r}
library (tidyverse)#a combination of useful packages
library(dplyr) # data wrangling 
library(psych) # factor analysis
library(patchwork) # picture layout
library(finalfit) # descriptive and inferential statistics well-tabulated
library(naniar)#install.packages("naniar")#Visualizing missing data
library(broom)#tidy tables
library(ggplot2)# plotting graphs 
library(tidyr)#data sheet to longer or wider
library(GPArotation)#matrix rotation
```

**2. Data wrangling <br />**

**2.1 Load USHS data**

```{r}
#read raw data
USHS  <-  read.csv("daF3224e.csv", sep = ";")
#select reduced set of variables into a new data set
USHSpart  <-  USHS  %>% 
  select(
    fsd_id,
    k1, # age
    k2, # gender (1=male, 2=female, 3=not known/unspecified), NA (missing)
    bv1, # university (from register), see label info! (40 different codes)
    bv3, # field of study (from register), various codes
    bv6, # higher educ institution (0=uni applied sciences, 1=university)
    k7, # learning disability
    k8_1:k8_4, # current well-being
    k9_1:k9_30, # symptoms (past month), skip 31 ("other") due missingness
    k11a, k11b, # height (cm) a=male, b=female
    k12a, k12b, # weight (kg) a=male, b=female
    k13, # what to you think of your weight? (1=under,5=over)
    k14:k20, # various questions (mostly yes/no)
    k21_1:k21_9, # various things in life (obs: originally wrong scale! "3" -&gt; NA!)
    k22_1:k22_10, # how often (felt various things) (obs: 2 &amp; 3 should be inversed! see quest.)
    k23:k34, # General Health Questionnaire (1-4, directions already OK)
   k95_1:k95_18) # (end of select!) 

```

**2.2 Variable recreation and generation<br />**

&emsp;&emsp;In this section, I first generated height, weight, age and gender variables. Considering height and weight are better interpreted by referencing each other, I then generated a Body Mass Index (BMI) variable that merges the pair using  well-accepted formula: 
$$
BMI=\frac{Weight (kg)}{Height (m)^{2}}
$$
Next, I  generated a new categorical variable by slicing BMI into underweight (<18.5), normal weight(~25), overweight(~30) and obese(>30). Previous studies have found that BMI is a good gauge of risk for diseases that can occur with more body fat, indicating the higher our BMI, the higher our risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers. As such, it would be interesting to explore its relation with Finnish student's general health, dietary status and feeling about study. 

```{r}
# Recreate height, weight, age and gender variables
USHSv2  <-  USHSpart  %>% 
  mutate(height = if_else(is.na(k11a), k11b, k11a), #merge height variables of two sexes into one
         weight = if_else(is.na(k12a), k12b, k12a), #merge weight variables of two sexes into one
         gender = k2  %>%  factor()  %>%  fct_recode("Male" = "1",
                                                 "Female" = "2",
                                                 "Not known" = "3"), #add variable name and level label to gender
         age = k1  %>%  
           factor()  %>%  
           fct_recode("19-21" = "1", "22-24" = "2", "25-27" = "3", 
                      "28-30" = "4", "31-33" = "5", "33+" = "6"))#add variable name and level label to age

# Generate BMI(body mass index) and BMI.factor(4 categories)
USHSv2 <- USHSv2 %>% 
  mutate(BMI = weight/(height/100)^2) # adopting the formula of BMI

USHSv2 <- USHSv2 %>% 
  mutate(BMI.factor = BMI %>%  #turn BMI into factor and add labels(underweight:<18.5,                                                   #normal weight: ~25, overweight: ~30; obese: >30)
                    cut(breaks=c(1,18.5,25,30,100), include.lowest = T) %>% 
                    fct_recode("Underweight" = "[1,18.5]", 
                               "Normal weight" = "(18.5,25]",
                               "Overweight" = "(25,30]",
                               "Obese" = "(30,100]") %>% 
                    ff_label("BMI ranges"))

USHSv2[, c(42:45)] <- list(NULL) #  I already got "gender". K11a~k12b were removed.

USHSv2 %>% 
  select(height, weight, age, gender, BMI, BMI.factor) %>% 
  head %>%  
  knitr::kable() # examine the data #sum(is.na(USHSv2$k1)) #apply(USHSv2, 2, function(x)sum(is.na(x))) %>% tidy()#examine NAs for each variable
unique(USHSv2$k21_1)
```

**2.3 data cleansing<br />**

&emsp;&emsp;In item k21(k21_1 to k21_9), the choice "difficult to say" was assigned a value of 3, which could be misleading since these were ordinal variables where 3 indicated a specific strength instead of NA. In this section, this was corrected by converting 3 to NA.

```{r}
#inspect the unique values in k21_1 through k21_9
USHSv2 %>% 
  select(starts_with(("k21"))) %>% apply(.,2, function(x)unique(x))%>% knitr::kable()
```

&emsp;&emsp;Finding: the value 3(which I want to convert to NA) is present now.

```{r}
#Convert the value 3 in k21_1 to k21_9 to NA
USHSv2 <- USHSv2 %>% 
  mutate(across(starts_with("k21"), ~replace(., . == 3, NA)))

#inspect the unique values in k21_1 through k21_9 in the updated data set
#inspect if all 3s convert to NA
USHSv2 %>% select(starts_with(("k21"))) %>% 
  apply(.,2, function(x)unique(x)) %>% knitr::kable()
```

&emsp;&emsp;Finding: 3s were successfully replaced.

**2.4 data inspecting<br />**

&emsp;&emsp;In this section, missing values of all the variables were inspected to find out if there was a pattern.
```{r}
#inspect the  NAs in each column
#To get a clear view, separate into 4 pictures
ncol(USHSv2)#get the number of columns
USHSv2 %>%  select(1:25) %>% vis_miss() 
USHSv2 %>%  select(26:50) %>% vis_miss()
USHSv2 %>%  select(51:75) %>% vis_miss()
USHSv2 %>%  select(76:104) %>% vis_miss() 
```
&emsp;&emsp;Findings were variables k1, k7, and k9_1~k9_3 suffered from quite a number of NAs, suggesting a quantitative inspection. Besides, several horizontal black lines (around 9) showed in the graph, suggesting these respondents tend to give more NAs consecuteively than others.  

```{r}
#a closer look of the identified high-NA variables
USHSv2 %>%  select(k1,k7,k9_1:k9_30) %>% vis_miss()
#calclate the percent of NAs for these variables
USHSv2 %>% select (k1,k7,k9_1:k9_30) %>% 
  apply(.,2,function(x)sum(is.na(x))/nrow(.)) %>% 
  tidy
```


&emsp;&emsp;Findings were k1 (6.7% NAs), k7(4.1% NAs), k9_1 (3.1% NAs), k9_3 (3.0% NAs), k9_5 (4.3% NAs), k9_8 (6.5%), 9_10 (5.0% NAs), 9_13 (5.1 NAs), 9_24 (6.3% NAs), 9_25 (6.5% NAs) and 9_26 (5.9% NAs) were actually acceptable in the terms of NA proportion. Other variables were with NA proportion＞7%， with some of them reaching 12%, suggesting possible bias such as floor effect. Analysis of k9 should be extremely careful. 

**2.4 Variable selecting <br />**

&emsp;&emsp;In this section, variables for general heath k23~k34 were selected into a new data set USHSghq for factor analysis #1 (general health). Variables k95_1~k95_18 were selected into a new data set USHSstu for factor analysis #2 (study burnout).

```{r}
#k23~k34 were selected into a new data set USHSghq for factor analysis #1
USHSghq <- USHSv2 %>% select(k23:k34)
#inspect the unique value of USHSghq to tell if data wrangling is needed
USHSghq%>% apply(2, function(x)unique(x))
#k95_1~k95_18 were selected into a new data set USHSstu for factor analysis #2
USHSstu <- USHSv2 %>% select(k95_1:k95_18)#USHSstu means USHS study burnout
#inspect the unique value of USHSstu to tell if data wrangling is needed
USHSstu %>% apply(2, function(x)unique(x))
```

## Analysis 1:  Factor analysis of general health questionnaire


**1. Analysis <br />**

**1.1 Descriptive statistics for variables <br />**

&emsp;&emsp;In this section, descriptive statistics was done for each variable of USHSghq. 

```{r}
#inspect the unique values in k23~k34
USHSghq %>% 
  apply(., 2, function(x)unique(x))%>%
  knitr::kable()
```

&emsp;&emsp;Findings were all variables contain values of 1,2,3,4 and NA.

```{r}
#visualizing the number of values for each item
longv1 <- USHSv2 %>%  #There are two patterns of choices in k23:k34. Select 
                      #items of pattern one and convert them to long format
  select(k24, k27, k28, k31, k32, k33) %>%  
  pivot_longer(everything(), names_to = "item", values_to = "score") 
#plot bar charts for the number of each pattern one choice by each item
p1 <- longv1 %>%   
  ggplot (aes(x= factor(score), fill = score))+
  geom_bar() +
  facet_wrap(~item)+
  theme_minimal()+
  xlab("1=not at all; 2=no more than usual;    
       3=rather more than usual; 4=much more than usual")+ #choices of pattern 1
  theme(legend.position ="none")
# keep visualising other items
longv2 <- USHSv2 %>% #There are two patterns of choices in k23:k34. Select 
                      #items of pattern two and convert them to long format
  select(k23, k25, k26, k29, k30, k34) %>% 
  pivot_longer(everything(), names_to = "item", values_to = "score") 
#plot bar charts for the number of each pattern tow choice by each item
p2 <- longv2 %>%  
  ggplot (aes(x= factor(score), fill = score))+
  geom_bar() +
  facet_wrap(~item)+
  theme_minimal()+
  xlab("1=more so than usual; 2=same as usual;  
       3=less so than usual; 4=much less than usual")+ #choices of pattern 2
  theme(legend.position ="none")
#display the charts in one graph
p1/p2
```

&emsp;&emsp;The findings were "Not at all" and "No more than usual" was the most frequent choices for item k24, k27, k28, k31, k32 and k33, indicating most respondents were in a stable and healthy state. Finland is a developed country, so no surprise.

&emsp;&emsp;"Same as usual" was the most frequent choice for item k23, k25, k26, k29, k30 and k34, indicating most respondents were in a stable and healthy state. Same finding with results from above items.

```{r}
#Examine if missing values are dependent on any demographic variables
longv3 <- USHSv2 %>%  # convert health-related variables to long format
  select (fsd_id, gender, BMI.factor, k23:k34) %>% 
  pivot_longer(k23:k34, names_to = "item", values_to = "score") %>% 
  filter(is.na(score)) 
# plot stacked bar chart of NA frequencies base on ender for each item
p1 <- longv3 %>%  
  ggplot(aes(x = item, fill = gender))+
  geom_bar(position = position_fill(reverse = T))+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        plot.title = element_text(vjust = 1), 
        axis.title.y = element_text(size = 9))+
  labs(title = "Percentage of NAs by gender", 
       y = "Percentage of NAs", x ="", subtitle = "uncorrected")
# plot stacked bar chart of percentage of NA frequencies base on 
# gender by each item
p2 <- longv3 %>% 
  ggplot(aes(x = item, fill = gender))+
  geom_bar(position = position_stack(reverse = T))+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        axis.title.y = element_text(size = 9))+
  labs(title = "Number of NAs by gender", 
       y = "Number of NAs", x = "Item", subtitle = "uncorrected")
#create contingency table for item&gender
Igender <- longv3 %>% count(item, gender)
#correct for effect of sample size by dividing male 
#NA number with the sample size of male(n = 1068),  
#and female NA number with that of female(n = 2022)
Igender_cor <- Igender %>%  
  mutate (score = case_when(gender == "Male" ~ n/1068,
                            gender == "Female" ~ n/2022,
                            is.na(gender) ~ n/20)) %>% filter (!is.na(gender))

#plot the corrected data
p3 <- Igender_cor %>% 
  ggplot(aes(x = item, y = score, fill = gender))+
  geom_col(position = position_fill())+
  theme(axis.text = element_text(size = 5),
        axis.text.x = element_text (angle = 45))+
  labs(title = "Percent of NAs by gender", 
       subtitle ="corrected for sample size",
       y = "Corrected number of NAs", x = "Item") +
  guides(fill = guide_legend(title = "Gender"))
#display the uncorrected (left) and the corrected (right)
p1/p2|p3
```
 
 &emsp;&emsp;The findings were base on raw data, females givng NA answers were much more than males (see pictures on the left). However, we should bring into consideration the fact that female respondents were around two times the umber of males. Hence, the number was corrected by diving the number of each subgroup of NAs by the sample size of subgroups(male and female), obtaining the graph corrected for sample size. See picture on the right. It demonstrated that though the proportion of females giving NA answers is still larger than males, the difference becomes much less pronounced. Still, attention should be given to items k23( Have you recently been able to concentrate on your tasks?") and k25("Have you recently felt that you are playing a useful part in things?"), which around 80% of NAs were from females, as well as items k27 ("Have you recently felt constantly under strain?"), which around 70% NAs were from males, indicating bias such as floor effect might exist in these items. 
 
```{r}
#stacked bar chart of number of NAs by BMI ranges for each item
p4 <- longv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
  geom_bar(position = position_fill(reverse = T))+
  ylab("Percent of NAs")+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        axis.title.y = element_text (size = 9), 
        plot.title = element_text (vjust = 1))+
  labs(title = "Percent of NAs by BMI", x= "", subtitle = "uncorrected")
#stacked bar chart of percent of NAs by BMI ranges for each item
p5 <- longv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
  geom_bar(position = position_stack(reverse = T))+
  ylab("Number of NAs")+
  theme(legend.position = "none", axis.text = element_text(size = 6),
        axis.title.y = element_text (size = 9))+
  labs(title = "Number of NAs by BMI", x = "Item", subtitle = "uncorrected")
#corrected bar chart of percent of NAs by BMI ranges for each item
#create contingency table for item&gender
longv3 %>% count(BMI.factor)
IBMI <- longv3 %>% count(item, BMI.factor)
IBMI <- IBMI %>% rename(score = n) #change the new data frame's variable name
#correct for effect of sample size by dividing number of normal weight respondents
# with the sample size of it(n = 118). Apply the same idea to other levels.
IBMI_cor <- IBMI %>%  
 mutate (score_corrected = case_when(BMI.factor == "Normal weight" ~ score/118,
                                             BMI.factor == "Overweight" ~ score/39,
                                             BMI.factor == "Obese" ~ score/25,
                                             BMI.factor == "Underweight" ~ score/1,
                                             is.na(BMI.factor) ~ score/31))

#plot the correct bar chart
p6 <- IBMI_cor %>% 
  ggplot(aes(x = item, y = score_corrected, fill = BMI.factor))+
  geom_col(position = position_fill(reverse = T))+
  theme(axis.text = element_text(size = 5), 
        axis.text.x = element_text(angle = 45))+
  labs(title = "Percent of NAs by BMI", 
       subtitle ="corrected for sample size",
       y = "Corrected number of NAs", x = "Item") +
  guides(fill = guide_legend(title = "BMI"))
#display the charts
p4/p5|p6

```

 &emsp;&emsp;Respondents with normal weight gave most NA answers (see pictures on the left). After sample size correction (see picture on right), the NA answers from different BMI groups became more equally distributed across each subcategory. However, the observation was inconclusive due to a large proportion lacking BMI information (NAs). 
 
```{r}
#descriptive statistics for general health questions. 
describe(USHSghq, IQR = T)
```

&emsp;&emsp;Findings were the mean and median were very close to each other. Standard deviations were more varied, indicating some items have different filling out pattern with others. 

```{r}
# convert health-related variables to long format
longv4<- USHSghq %>%  
  pivot_longer(everything(), names_to = "Item", values_to = "Score")
# draw histogram of the scores for each item
longv4 %>% ggplot(aes(x = Score, fill = Score)) +
  geom_histogram(binwidth=0.9, fill = "cornsilk3") +
  facet_wrap(~Item)+
  labs(caption="Score is each respondent's choice for each item")+
  theme_bw()
```

&emsp;&emsp;Findings were the assumption of multivariate normality was severely violated (at least half of the variables were not normal), see histogram above. This indicated I might have to exclude maximum likelihood for factoring due to its poor validity for non-normal data. 
**1.2 Factorability of the items <br />**

&emsp;&emsp;Next, the factorability of the items was examined. Several well-recognized criteria for the factorability of a correlation were used. 

```{r}
#correlation to examine the factorability
c_matrix <- cor.plot(USHSghq, method = "spearman")

#These codes below are not in use anymore. 
#preparing USHSghq with categorical data 
#USHSghq_more <- USHSghq %>% mutate (fsd_id = USHSv2$fsd_id, gender = USHSv2$gender, BMI.factor = USHSv2$BMI.factor, age = USHSv2$age) %>% #select (fsd_id, everything())
```

&emsp;&emsp;It was observed that all of the 12 items correlated at least 0.4 with at least one other item, suggesting good factorability.

```{r}
#perform KMO test for factorability
KMO(c_matrix) 
```

&emsp;&emsp;The Kaiser-Meyer-Olkin measure of sampling adequacy was .93, indicating marvelous adequacy according to Kaiser.

```{r}
#perform bartlett test for factorability
cortest.bartlett(c_matrix, nrow(USHSghq))
```

&emsp;&emsp;Bartlett’s test of sphericity was significant (χ2 (66) = 15570.3, p < .05), suggesting that the null hypothesis that our matrix be equal to an identity matrix was objected, and hence good factorability. 

**1.3 Estimating factor structure **

&emsp;&emsp;Next, the number of factors was estimated. Several well-recognized criteria for exploring factor structure solution were used. 

```{r}
# Parallel analysis and scree plot
fa.parallel(USHSghq, fa = "fa", fm = "ml", plot = T) 
# very simple structure and BIC
nfactors(c_matrix, n=3, rotate = "oblimin", fm = "pa", n.obs=12)
#below code is discarded 
#vss(c_matrix,n=5, rotate = "oblimin", fm = "pa", n.obs=12)
```

&emsp;&emsp;The parallel analysis and scree plot (showing 4 factors above eigen-values of simulated data ) based on correlation matrix suggested 4-factor structure, while the VSS was favourable to a uni-dimensional interpretation. Results from the VSS analysis also pointed to the optimal number of one factors, this was further consolidated by both Velicer MAP and SABIC, both suggesting a single dimension. 

**1.4 Presetting parameters **

I loaded the variables on 4 factors based on the result of parallel analysis. A 3-factor solution was also tested for exercise purpose, albeit no mathematical foundation supporting to do it. Oblimin rotation will be adopted because of the fact that dimensions of health situations were usually highly correlated and the hand-on-exercise examining a 2-factor solution has given a correlation of 0.75 between the factors. I assumed that there was some latent construct called "general health" that is influencing how people answer these questions, and hence principle axis factoring (PAF) was adopted instead of principle component analysis (PCA; on other hand, PCA is orthogonal by nature). Since the assumption of multivariate normality is severely violated, I also did not adopt maximum likelihood factoring. 

**1.4 Factor analysis **
```{r}
#Examining the 4 factors solution
fa4 <- fa(USHSghq, nfactors = 4, rotate = "oblimin", fm = "pa", scores = "regression")
fa4
# draw a diagram of the factor model (measurement model):
fa.diagram(fa4, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")
#print the results of analysis
print(fa4$loadings,cutoff = 0.3, sort=TRUE)
#find out how much variance was explained
prop.table(fa4$values) 
#codes below is not in use
#summary(fa4)
#print(fa4, digits = 2, sort = T)
```

```{r}
#Examining the 4 factors solution
fa3 <- fa(USHSghq, nfactors = 3, rotate = "oblimin", fm = "pa", scores = "regression")

# draw a diagram of the factor model (measurement model):
fa.diagram(fa3, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")

#print the results of analysis
print(fa3$loadings,cutoff = 0.3, sort=TRUE)
prop.table(fa3$values)

```

&emsp;&emsp;Solutions for three and four factors were each examined using oblimin rotations of the factor loading matrix.Correlation coefficients among the factors in both solutions ranges from 0.48 to 0.68, consolidating the appropriateness of oblimin rotation. Although the number of primary loading were sufficient and roughly equal for both solutions, the four factor solution, which explained 81.40% of the variance, was preferred because of: (a) hand-on-exercise found a 2 factor solution did not provide sufficient granularity for interpreting the structure; (b) the ‘leveling off’ of eigen values on the scree plot after four factors; and (c) the insufficient number of primary loadings and difficulty of interpreting 3 factor structure from medical perspective. 

**1.5 conclusion of factor analysis**

&emsp;&emsp;In the four factor structure I finally adopted, items k31, k32, k33 and k28 comprising the first factor, capturing "Social well-being"; items k25, k26, k23 and k30 comprising the second factor, capturing "social engagement"; items k34 and k29 comprising the third factor, capturing "mental well-being"; and items k27 and k24 comprising the fourth factor, capturing "emotional well-being".

**1.6 factor score analysis**

*1.6.1 generating factor score variables*

```{r}
#generate factor scores
healthScores  <-  as.data.frame( #healthScores mean score from factor analysis of general health
  factor.scores(USHSghq, fa4)$scores)  %>% 
  mutate(fsd_id = USHSv2$fsd_id)  %>% 
  rename(Social_wellbeing = PA1,
         Social_engagement = PA2,
         Emotional_wellbeing = PA3,
         Mental_wellbeing = PA4)
#merge the healthScore data set with USHSv2
USHSv2  <-  left_join(USHSv2, healthScores)
#find out the column number
ncol(USHSv2)
#inspect the  factor score variables
USHSv2 %>% select(1,105:108) %>% head 
USHSv2 %>% head
```

*1.6.2 inspection of factor scores*

```{r}
# inspect the normality of factor scores using histogram
USHSv2 %>% select (fsd_id, 100:108) %>% 
  pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_wrap(~Factor) +
  theme_bw()+
  theme(legend.position = "none")
```
&emsp;&emsp; Base on face value, factor score of emotional well-being and social engagement seemed to be normally distributed, while mental and social well-beings did not. See histograms above. 

```{r}
# inspect the normality of factor scores by gender using histogram
USHSv2 %>% select (fsd_id, 100:108) %>% 
  pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_grid(Factor~gender)+
  theme_bw()+
  theme(legend.position = "none")
```

&emsp;&emsp; Base on face value, grouped by gender, factor scores of the dimensions stayed roughly the same with ungrouped data. See histograms above.

```{r}
# inspect the normality of factor scores by BMI.factor using histogram
USHSv2 %>% select (fsd_id, 100:108) %>% 
  pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_grid(BMI.factor~Factor)+
  theme_bw()+
  theme(legend.position = "none")
```

&emsp;&emsp; Base on face value, grouped by BMI ranges, factor scores of the dimensions stayed roughly the same with un-grouped data. See histograms above.

```{r}
# inspect the normality of factor scores for social wellbeing using QQplot
hp1 <- USHSv2 %>%  
  ggplot(aes(sample = Social_wellbeing)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Social_wellbeing")+
  theme(plot.title = element_text(hj = 0.5))

# inspect the normality of factor scores for Social engagement using QQplot
hp2 <- USHSv2 %>%  
  ggplot(aes(sample = Social_engagement)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Social_engagement")+
  theme(plot.title = element_text(hj = 0.5))

# inspect the normality of factor scores for Mental wellbeing using QQplot
hp3 <- USHSv2 %>%  
  ggplot(aes(sample = Mental_wellbeing)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Mental_wellbeing")+
  theme(plot.title = element_text(hj = 0.5))

# inspect the normality of factor scores for emotional wellbeing using QQplot
hp4 <- USHSv2 %>%  
  ggplot(aes(sample = Emotional_wellbeing)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Emotional_wellbeing")+
  theme(plot.title = element_text(hj = 0.5))

hp1+hp2+hp3+hp4
```

&emsp;&emsp; Qqplots showed emotional well-bing scores were roughly normally distributed but other 3 dimensions were considerably skewed. 

```{r}
#KS test for factor scores
USHSv2 %>% 
  select(Social_wellbeing:Emotional_wellbeing) %>% 
  apply(., 2, function(x)ks.test(x, "pnorm"))
```

&emsp;&emsp; KS test for normality was adopted considering the considerably large sample size. The results were all the 4 dimensions were not normally distributed. 

&emsp;&emsp; In summary, the observation of histogram and qqplot gave partially consistent conclusion from KS normality test for the 4 dimensions. The dispute was on whether emotional well-being score be normally distributed. For reducing the type I error in the following hypothesis testing, I adopted a conservative decision that pressure is not normally distributed (so that non-parametric test was adopted). 


```{r}
# Correlation among the factor scores of 4 dimensions. 
hp5 <- USHSv2 %>% 
  ggplot(aes(x = Emotional_wellbeing, y = Social_wellbeing))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp6 <- USHSv2 %>% 
  ggplot(aes(x = Emotional_wellbeing, y = Social_engagement))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp7 <- USHSv2 %>% 
  ggplot(aes(x = Emotional_wellbeing, y = Mental_wellbeing))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp8 <- USHSv2 %>% 
  ggplot(aes(x = Social_wellbeing, y = Mental_wellbeing))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp9 <- USHSv2 %>% 
  ggplot(aes(x = Social_wellbeing, y = Social_engagement))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
# continue  
hp10 <- USHSv2 %>% 
  ggplot(aes(x = Social_engagement, y = Mental_wellbeing))+
  geom_point(shape = 1, colour = "blue", size =0.5)+
  geom_smooth(method = "lm", colour = "red")  
#display the graphs
hp5/hp6|hp7/hp8|hp9/hp10
```

&emsp;&emsp;From point graphs above, weak to moderate correlation was observed between each dimension and another. This corresponded to the real-world situation that every aspect of well-being would influence each other. The strongest correlation was between social engagement and mental well-being. This made sense since hunch feeling is the more one engaged in social activity, the more mental well-being he/she could enjoy. They could possibly have some causal relationship, which other pairs didn't. This further proved the goodness of our factor structure. 

*1.6.3 Hypothesis testing*

&emsp;&emsp;Previous studies have found that BMI is a good gauge of risk for diseases that can occur with more body fat, indicating the higher our BMI, the higher our risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers. As such, it would be interesting to explore its relation with Finnish student's general health. Hence I tested the following hypotheses:

&emsp;&emsp;a. Students in different BMI ranges have different levels of social well-being.
&emsp;&emsp;b. Students in different BMI ranges have different levels of social engagement.
&emsp;&emsp;c. Students in different BMI ranges have different levels of mental well-being.
&emsp;&emsp;d. Students in different BMI ranges have different levels of emotional well-being.

```{r}
# Test if student with different BMI ranges have different social well-being scores.
het0 <- USHSv2 %>%  
  summary_factorlist(dependent = "Social_wellbeing", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(het0, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;I found students in different BMI ranges have different sense of social well-being. Pairwise comparison was performed to find out source of the difference.

```{r}
#multiple comparison without correction
  pairwise.t.test(USHSv2$Social_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "none")
  
# multiple comparison with corrected p value from bonferroni method
  pairwise.t.test(USHSv2$Social_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "bonferroni")
```

&emsp;&emsp;Multiple comparison was done to test which BMI range(s) concurred with different sense of social well-being. I found obese students were different from normal weight(*p*=0.00057) or overweight students(*p*=0.02298) in this regard . However,after correcting p values by bonferroni method, only difference between normal weight and obese students was still significant (*p*=0.0034). 

```{r}
# Test if students in different BMI ranges have different levels of social engagement.
het1 <- USHSv2 %>%  
  summary_factorlist(dependent = "Social_engagement", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1,
                     p = T)
  knitr::kable(het1, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;It is found students in different BMI ranges have different sense of social engagement (*p*=0.022). Pairwise comparison was performed to find out the source of difference.

```{r}
#multiple comparison without correction
  pairwise.t.test(USHSv2$Social_engagement, USHSv2$BMI.factor, 
                p.adjust.method = "none")
  
# multiple comparison with corrected p value from bonferroni method
  pairwise.t.test(USHSv2$Social_engagement, USHSv2$BMI.factor, 
                p.adjust.method = "bonferroni")
```

&emsp;&emsp;Multiple comparison was done to test which BMI range(s) concurred with different sense of social engagement. It is found obese students were different from normal weight students (*p*=0.0065) in this regard. After correcting p values by bonferroni method, the difference was still significant (*p*=0.039).

```{r}
# Test if students in different BMI ranges have different sense of mental well-being.
het2 <- USHSv2 %>%  
  summary_factorlist(dependent = "Mental_wellbeing", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(het2, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;It is found students in different BMI ranges have different sense of mental well-being (*p*=0.017). Pairwise comparison was performed to find out the source of difference.

```{r}
#multiple comparison without correction
  pairwise.t.test(USHSv2$Mental_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "none")
  
# multiple comparison with corrected p value from bonferroni method
  pairwise.t.test(USHSv2$Mental_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "bonferroni")
```

&emsp;&emsp;Multiple comparison was done to test which BMI range(s) concurred with different sense of mental well-being. It is found obese students were different from normal weight(*p*=0.001) or overweight students (*p*=0.028) in this regard. After correcting p values by bonferroni method, only the difference between obese and normal weight students was still significant (*p*=0.0063).

```{r}
# Test if students in different BMI ranges have different sense of emotional well-being.
het3 <- USHSv2 %>%  
  summary_factorlist(dependent = "Emotional_wellbeing", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(het3, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;It is found students in different BMI ranges have different sense of emotional well-being (*p*=0.011). Pairwise comparison was performed to find out the source of difference.

```{r}
#multiple comparison
  pairwise.t.test(USHSv2$Emotional_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "none")
  
# multiple comparison with corrected p value from bonferroni method
  pairwise.t.test(USHSv2$Emotional_wellbeing, USHSv2$BMI.factor, 
                p.adjust.method = "bonferroni")
```

&emsp;&emsp;Multiple comparison was done to test which BMI range(s) concurred with different sense of emotional well-being It is found normal weight students were different from obese (*p*=0.0038) or overweight students (*p*=0.0479) in this regard . After correcting p values by bonferroni method, only the difference between obese and normal weight students was still significant (*p*=0.023).

In summary, obese and normal weight students have different sense of emotional well-being, mental well-being, social engagement and social well-being. Note that this is a data-driven analysis and the *p* values were not corrected for the number of hypotheses. 


## Analysis 2:  Factor analysis of study burnout questionnaire


**1. Analysis <br />**

**1.1 Descriptive statistics for variables <br />**

&emsp;&emsp;In this section, descriptive statistics was done for each variable of USHSghq. 

```{r}
#have a look at the unique values in k23~k34
USHSstu %>% 
  apply(., 2, function(x)unique(x))%>%
  knitr::kable()
```

```{r}
#adapting the variable names for better displaying effect
USHSstu <- USHSstu %>% rename(k95_01 = k95_1, k95_02 = k95_2, k95_03 = k95_3, k95_04 = k95_4, k95_05 = k95_5, k95_06 = k95_6, k95_07 = k95_7, k95_08 = k95_8, k95_09 = k95_9)

#generate two subsets of data containing negative-wording items (01~09) and postive-wording items (10~18), respectively
slongv1 <- USHSstu %>% select (k95_01:k95_09) %>% 
  pivot_longer(everything(), names_to = "item", values_to = "score") 

slongv2 <- USHSstu %>% select (k95_10:k95_18) %>% 
  pivot_longer(everything(), names_to = "item", values_to = "score") 
```

```{r}
#visualize the number of values for negative-wording items
slongv1 %>%  
  ggplot (aes(x= factor(score), fill = score))+
  geom_bar() +
  facet_wrap(~item)+
  theme_get()+
  xlab("1=Totally disagree; 2=Disagree;3=Partly disagree; 4=Partly agree; 5=Agree; 6=Totally agree")+ 
  ggtitle("Distribution of choices for negative-wording items under k95")+
  theme(legend.position ="none", axis.title.x = element_text (size = 9, face = "bold"), axis.text.x = element_text (size = 6, vjust = 2), strip.text = element_text(size = 8, face = "bold", vjust = 0.5), axis.ticks = element_blank(), plot.title = element_text (face = "bold")) 
```

&emsp;&emsp;The findings were "totally disagree" was the most frequent choice, followed by "disagree", indicating most students were not experiencing study burnout in dimensions these items cover. Besides, the data were mostly skewed. This indicated care should be taken in adopting maximum likelihood factoring. 

```{r}
#visualize the number of values for positive-wording items
slongv2 %>%  
  ggplot (aes(x= factor(score), fill = score))+
  geom_bar() +
  facet_wrap(~item)+
  theme_get()+
  xlab("1=Totally disagree; 2=Disagree;3=Partly disagree; 4=Partly agree; 5=Agree; 6=Totally agree")+ 
  ggtitle("Distribution of choices for postive-wording items under k95")+
  theme(legend.position ="none", axis.title.x = element_text (size = 9, face = "bold"), axis.text.x = element_text (size = 6, vjust = 2), strip.text = element_text(size = 8, face = "bold", vjust = 0.5), axis.ticks = element_blank(), plot.title = element_text (face = "bold")) 
```

&emsp;&emsp;The findings were "partly agree" was the most frequent choice, followed by "agree" and "partly disagree", indicating most students were experiencing little to none study burnout in the dimensions these items cover. The possibility of modesty bias towards weak agreement in facing of positive descriptions should also be considered (this could explain why in negative-wording items 1~9 such pattern was not present). Besides, the data were mostly normally distributed, indicating positive wording might be a good strategy for eliciting a gradient of state changes. 

```{r}
#Examining if missing values are dependent on any demographic variables
slongv3 <- USHSv2 %>%  # convert health-related variables to long format
  select (fsd_id, gender, BMI.factor, k95_1:k95_18) %>% 
  pivot_longer(k95_1:k95_18, names_to = "item", values_to = "score") %>% 
  filter(is.na(score)) 
# plot stacked bar chart of NA frequencies by gender for each item
sp1 <- slongv3 %>%  
  ggplot(aes(x = item, fill = gender))+
  geom_bar(position = position_fill(reverse = T))+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        plot.title = element_text(vjust = 1), 
        axis.title.y = element_text(size = 9), axis.text.x = element_text(angle =45, vjust = 0.7))+
  labs(title = "Percentage of NAs by gender", 
       y = "Percentage of NAs", x ="", subtitle = "uncorrected")
# plot stacked bar chart of percentage of NA frequencies by gender for each item
sp2 <- slongv3 %>% 
  ggplot(aes(x = item, fill = gender))+
  geom_bar(position = position_stack(reverse = T))+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        axis.title.y = element_text(size = 9), 
        axis.text.x = element_text(angle =45, vjust = 0.7))+
  labs(title = "Number of NAs by gender", 
       y = "Number of NAs", x = "Item", subtitle = "uncorrected")
#create contingency table for item&gender
sIgender <- slongv3 %>% count(item, gender)
#correct for effect of sample size by dividing male NA number with the sample 
#size of male(n = 1068),and female NA number with that of female(n = 2022)
sIgender_cor <- sIgender %>%  #sIgender means study-burnout item/gender 
                             #sIgender_cor means study-burnout item/gender corrected
  mutate (score = case_when(gender == "Male" ~ n/1068,
                            gender == "Female" ~ n/2022,
                            is.na(gender) ~ n/20)) %>% filter (!is.na(gender))

#plotting the corrected data
sp3 <- sIgender_cor %>% 
  ggplot(aes(x = item, y = score, fill = gender))+
  geom_col(position = position_fill())+
  theme(axis.text = element_text(size = 5),
        axis.text.x = element_text (angle = 45))+
  labs(title = "Percent of NAs by gender", 
       subtitle ="corrected for sample size",
       y = "Corrected number of NAs", x = "Item") +
  guides(fill = guide_legend(title = "Gender"))
sp1/sp2|sp3
```
 
 &emsp;&emsp;The findings were females givng NA answers were much more than males. See pictures on the left. However, after correcting the number of each subgroup (male and female) by its sample size,the distribution of NAs became almost consistent across gender. See picture on the right. 
 
```{r}
#same idea with last chunk, with categorical variable switched to BMI ranges
sp4 <- slongv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
  geom_bar(position = position_fill(reverse = T))+
  ylab("Percent of NAs")+
  theme(legend.position = "none", axis.text = element_text(size = 6), 
        axis.title.y = element_text (size = 9), 
        plot.title = element_text (vjust = 1),
        axis.text.x = element_text(angle =45, vjust = 0.7))+
  labs(title = "Percent of NAs by BMI", x= "", subtitle = "uncorrected")
#same idea with last chunk, with categorical variable switched to BMI ranges
sp5 <- slongv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
  geom_bar(position = position_stack(reverse = T))+
  ylab("Number of NAs")+
  theme(legend.position = "none", axis.text = element_text(size = 6),
        axis.title.y = element_text (size = 9),
        axis.text.x = element_text (angle = 45, vjust = 0.7))+
  labs(title = "Number of NAs by BMI", x = "Item", subtitle = "uncorrected")
#same idea with last chunk, with categorical variable switched to BMI ranges
slongv3 %>% count(BMI.factor)
sIBMI <- slongv3 %>% count(item, BMI.factor)#create contingency table for item&gender
sIBMI <- sIBMI %>% rename(score = n)
sIBMI_cor <- sIBMI %>%  #correct for effect of sample size by dividing number of  
                            #normal weight respondents with the sample size of,  
                            #it(n = 118). Apply the same idea to other levels.

 mutate (score_corrected = case_when(BMI.factor == "Normal weight" ~ score/118,
                                             BMI.factor == "Overweight" ~ score/39,
                                             BMI.factor == "Obese" ~ score/25,
                                             BMI.factor == "Underweight" ~ score/1,
                                             is.na(BMI.factor) ~ score/31))
#same idea with last chunk, with categorical variable switched to BMI ranges
sp6 <- sIBMI_cor %>% 
  ggplot(aes(x = item, y = score_corrected, fill = BMI.factor))+
  geom_col(position = position_fill(reverse = T))+
  theme(axis.text = element_text(size = 5), 
        axis.text.x = element_text(angle = 45, vjust= 0.5))+
  labs(title = "Percent of NAs by BMI", 
       subtitle ="corrected for sample size",
       y = "Corrected number of NAs", x = "Item") +
  guides(fill = guide_legend(title = "BMI"))
sp4/sp5|sp6

```

 &emsp;&emsp;Respondents with normal weight gave most NA answers (see pictures on the left). After sample size correction (see picutre on right), the NA answers from different BMI groups became more equally distributed across each subcategory. However, the observation was inconclusive due to a large proportion lacking BMI information. Also note that in the bar graph on the right, underweight respondents seemed to comprise the majority of NAs across a couple of items, but I decided to ignore it since the number of underweight respondents were quite small (n=20), leading to overestimation by correction. 
 
```{r}
#descriptive statistics for study burnout items. 
describe(USHSstu, IQR = T)
?describe 
```

&emsp;&emsp;Findings were the median of score for positive/negative wording items were closer to its item-group member and IQRs were consistent within some subgroups of items and were varied across these subgroups, indicating some items have different filling out pattern. Means were less varied across all items, but due to the non-normality for half of the item results, here mean and standard deviation would be less convincing than median and IQR. 

**1.2 Factorability of the items <br />**

&emsp;&emsp;Next, the factorability of the items was examined. Several well-recognized criteria for the factorability of a correlation were used. 

```{r}

#correlation to examine the factorability
sc_matrix <- cor.plot(USHSstu) # sc_matrix means study-burnout correlation matrix

```

&emsp;&emsp;It was observed that all of the 12 items correlated at least 0.5 with at least one other item, suggesting good factorability.

```{r}
#KMO test for factorability
KMO(sc_matrix) 
```

&emsp;&emsp;The Kaiser-Meyer-Olkin measure of sampling adequacy was .94, indicating marvelous adequacy according to Kaiser (1975).

```{r}
#Bartlett test for factorability
cortest.bartlett(sc_matrix, nrow(USHSghq))
```

&emsp;&emsp;Bartlett’s test of sphericity was significant (χ2 (66) = 36358.71, p < .05), suggesting that the null hypothesis that our matrix be equal to an identity matrix was objected and hence good factorability. 


**1.3 Estimating factor structure **

&emsp;&emsp;Next, the number of factors was estimated. Several well-recognized criteria for exploring factor structure solution were used. 

```{r}

# Parallel analysis and scree plot
fa.parallel(USHSstu, fa = "fa", fm = "ml", plot = T) 
# very simple structure and BIC
nfactors(sc_matrix, n=5, rotate = "oblimin", fm = "pa", n.obs=12)
```

&emsp;&emsp;Parallel analysis and scree plot (showing 4 factors above eigen-values of simulated data ) based on correlation matrix suggested 4-factor structure, while the VSS was favourable to a 2 factor interpretation by the given complexity peaking at the second factor. This result was echoed by BIC, where the minimum value was achieved with 2 factors. Whereas VSS analysis recommended a 3-factor solution, which was inconsistent with others.

**1.4 Presetting parameters **

&emsp;&emsp;Base on the factor structure exploration, solutions for three, four, five and six factors were each examined using varimax and oblimin rotations of the factor loading matrix. Both rotation was selected since I was not able to presume the relationship between dimensions of study burnout. Histogram showed more than half of the items under k95 were normally distributed, thus maximum likelihood factoring was first considered in view of its goodness in factoring normal data, but other factoring methods might also be adopted in case ML did not give interpretable results.  

**1.4 Factor analysis **

*1.4.1 Examining 2-factor structure*

```{r}
# a) orthogonal rotation
sfa2_var <- fa(USHSstu, nfactors = 2,  #sfa2_var means study factor analysis 2 factors by varimax
           rotate = "varimax", fm = "ml", scores = "regression") 
sfa2_var
fa.diagram(sfa2_var, cut = 0, digits =2, 
           main = "Factor Analysis, 2-factor structure, Orthogonal rotation")

#b) oblimin rotation
sfa2_obl <- fa(USHSstu, nfactors = 2,  #sfa2_obl means study factor analysis 2 factors by oblimin
               rotate = "oblimin", fm = "ml", scores = "regression")
sfa2_obl
fa.diagram(sfa2_obl, cut = 0, digits =2, 
           main = "Factor Analysis, 2-factor structure, Oblimin rotation")
```
&emsp;&emsp;The correlation between the 2 factors were weak (r = -0.35), indicating oblimin rotation might be a proper choice theoretically.However, by examining the items under each factor, I found the most possible interpretation for this 2 factor solution is the differnt directions of wording, since factor 1 contain items 1-9, which were negatively worded, and items 10~18 under factor 2 were all positively worded. They should of course be negatively correlated on face value, and, on the other hand, the factoring solution was not sufficiently convincing because it did not capture meaningful dimensions of study burnout. 

```{r}
#results by varimax rotation
print(sfa2_var$loadings,cutoff = 0.3, sort=TRUE)
prop.table(sfa2_var$values) #how much variance explained

#results by oblimin rotation
print(sfa2_obl$loadings,cutoff = 0.3, sort=TRUE)
prop.table(sfa2_obl$values) #how much variance explained
```

&emsp;&emsp;Cross-loading was very pronounced by both rotation methods (especially varimax rotation). This indicated we might either remove some items or seek for other factor solutions. This echoed with the fact that this two-factor structure did not provide any meaningful interpretation of dimension of study burnout (it simply separated positive from negative wording items). 

*1.4.2 Examining 3-factor structure*

```{r}
# a) orthogonal rotation
sfa3_var <- fa(USHSstu, nfactors = 3,  #sfa2_var means study factor analysis 2 factors by varimax
           rotate = "varimax", fm = "ml", scores = "regression") 
sfa3_var
fa.diagram(sfa3_var, cut = 0, digits =2, 
           main = "Factor Analysis, 3-factor structure, Orthogonal rotation")

#b) oblimin rotation
sfa3_obl <- fa(USHSstu, nfactors = 3,  #sfa2_obl means study factor analysis 2 factors by oblimin
               rotate = "oblimin", fm = "ml", scores = "regression")
sfa3_obl
fa.diagram(sfa3_obl, cut = 0, digits =2, 
           main = "Factor Analysis, 3-factor structure, Oblimin rotation")
```
&emsp;&emsp;The correlation among the 3 factors were weak to moderate (0.18~0.55), not pointing to any rotation method mathematically. Two methods provided factoring results with little difference. The interpretation of this 3-factor structure became easier and more enlightening than the 2-factor solution. By qualitative analysis, I assumed items 3,5,2,6 and 8 captured the dimension of exhaustion (item example: I feel a lack of motivation in my studies and often thinking of giving up); items 7, 1, 9, 4 and 3 captured sense of pressure (Example: I often have feelings of inadequacy in my studies.); other 9 items together captured sense of accomplishment (Example: I find my studies full of meaning and purpose).

```{r}
#results by varimax rotation
print(sfa3_var, digits = 2, sort = T)
print(sfa3_var$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa3_var)
prop.table(sfa3_var$values) #how much variance explained

#results by oblimin rotation
print(sfa3_obl, digits = 2, sort = T)
print(sfa3_obl$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa3_obl)
prop.table(sfa3_obl$values) #how much variance explained
```
&emsp;&emsp;Cross-loading was very considerably heavy by orthogonal rotation (7 cross-loadings, and 2 of them loading on 3 factors simultaneously), but via oblimin rotation it reduced to an acceptable situation (2 cross-loading items with loadings of 0.31 vs 0.61, and 0.37 vs 0.41 ). This indicated the results by using oblimin rotation should be preferred. 

*1.4.3 Examining 4-factor structure*

```{r}
# a) orthogonal rotation
sfa4_var <- fa(USHSstu, nfactors = 4,  #sfa2_var means study factor analysis 2 factors by varimax
           rotate = "varimax", fm = "ml", scores = "regression") 
sfa4_var
fa.diagram(sfa4_var, cut = 0, digits =2, 
           main = "Factor Analysis, 4-factor structure, Orthogonal rotation")

#b) oblimin rotation
sfa4_obl <- fa(USHSstu, nfactors = 4,  #sfa2_obl means study factor analysis 2 factors by oblimin
               rotate = "oblimin", fm = "ml", scores = "regression")
sfa4_obl
fa.diagram(sfa4_obl, cut = 0, digits =2, 
           main = "Factor Analysis, 4-factor structure, Oblimin rotation")
```

&emsp;&emsp;The correlation among the 4 factors were weak to moderate (0.12~0.55), not pointing to any rotation method mathematically. Two methods provided factoring results with little difference. However, via both rotation methods, the variables still loaded on only 3 factors instead of 4 factors as we specified, and the factoring was the same with that of the 3-factor solution.

```{r}
#results by varimax rotation
print(sfa4_var, digits = 2, sort = T)
print(sfa4_var$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa4_var)
prop.table(sfa4_var$values) #how much variance explained

#results by oblimin rotation
print(sfa4_obl, digits = 2, sort = T)
print(sfa4_obl$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa4_obl)
prop.table(sfa4_obl$values) #how much variance explained
```

&emsp;&emsp;Cross-loadings in this 4-factor solution were very pronounced (worse than the 3-factor solution). Due to fact that it provided same factoring with 3 factor solution, I did not look into it further.

&emsp;&emsp;Solutions for two, three and four factors were each examined using varimax and oblimin rotations of the factor loading matrix. Correlaion coefficients among the factors in all solutions ranges from 0.12 to 0.55, not strongly pointing to the appropriateness of any rotation methods. The three factor solution by oblimin rotation, which explained 68.0% of the variance, was preferred because of: (a) 2-factor solution was hardly interpretable from the perspective of study burnout (b) although eigen values leveled off on the scree plot after four factors, in the analsyis of the four-factor structure, items still loaded on 3 variables; and (c) 2-factor and 4-factor structures were plagued by cross-loading.

**1.5 conclusion of factor analysis**

&emsp;&emsp;In the 3-factor structure via oblimin rotation I finally adopted, items 5,2,6 and 8 captured the dimension of exhaustion (item example: I feel a lack of motivation in my studies and often thinking of giving up); items 7, 1, 9, 4 and 3 captured sense of pressure (Example: I often have feelings of inadequacy in my studies); other 9 items together captured sense of accomplishment (Example: I find my studies full of meaning and purpose).


**1.6 factor score analysis**

*1.6.1 generating factor score variables*

```{r}
#generating factor scores
stuScores  <-  as.data.frame( #stuScores mean score from factor analysis of study burnout
  factor.scores(USHSstu, sfa3_obl)$scores)  %>% 
  mutate(fsd_id = USHSv2$fsd_id)  %>% 
  rename(Accomplishment = ML1,
         Pressure = ML2,
         EXhaustion = ML3)
#merge the stuscore data set with USHSv2
USHSv2  <-  left_join(USHSv2, stuScores)
ncol(USHSv2)#find out the column number
USHSv2 %>% select(1,99:104, 109:111) %>% head #have a look at the categorical variables and 
                                     #factor score variables (the end of the sheet)
```

*1.6.2 inspection of factor scores*

```{r}
# inspect the normality of factor scores using histogram
USHSv2 %>% select (fsd_id, 99:104, 109:111) %>% 
  pivot_longer(c(8:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_wrap(~Factor) +
  theme_bw()+
  theme(legend.position = "none")
```
&emsp;&emsp; Base on face value, factor score of accomplishment was normally distributed, that of pressure was roughly normally distributed (need support from QQplot or ks test), and that of exhaustion right-skewed. See histograms above. 


```{r}
# inspect the normality of factor scores by gender using histogram
USHSv2 %>% select (fsd_id, 99:104, 109:111) %>% 
  pivot_longer(c(8:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_grid(Factor~gender)+
  theme_bw()+
  theme(legend.position = "none")
```

&emsp;&emsp; Base on face value, grouped by gender, factor scores of the dimensions stayed roughly the same with ungrouped data. See histograms above.

```{r}
# inspect the normality of factor scores by BMI.factor using histogram
USHSv2 %>% select (fsd_id, 99:104, 109:111) %>% 
  pivot_longer(c(8:10), names_to = "Factor", values_to = "FactorScore") %>% 
  ggplot(aes(x = FactorScore, fill = Factor)) +
  geom_histogram() +
  facet_grid(Factor~BMI.factor)+
  theme_bw()+
  theme(legend.position = "none")
```

&emsp;&emsp; Base on face value, grouped by BMI ranges, factor scores of the dimensions stayed roughly the same with ungrouped data. See histograms above.

```{r}
# inspect the normality of factor scores for Accomplishment using qqplot
fsp1 <- USHSv2 %>%  
  ggplot(aes(sample = Accomplishment)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Accomplishment")+
  theme(plot.title = element_text(hj = 0.5))

# linspect the normality of factor scores for Pressure using qqplot
fsp2 <- USHSv2 %>%  
  ggplot(aes(sample = Pressure)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Pressure")+
  theme(plot.title = element_text(hj = 0.5))

# inspect the normality of factor scores for Exhaustion  using qqplot
fsp3 <- USHSv2 %>%  
  ggplot(aes(sample = EXhaustion)) +
  geom_qq(colour = "red", shape = 1, size = 2) +
  geom_qq_line(colour = "blue")+
  labs(title = "Exhaustion")+
  theme(plot.title = element_text(hj = 0.5))

fsp1+fsp2+fsp3
```

&emsp;&emsp; Qqplots echoed the conclusions from histogram, showing accomplishment and pressure were roughly normally distributed but their extreme values were a bit derailed from expected values, and exhaustion was considerably skewed. 

```{r}
#KS test for normality
USHSv2 %>% 
  select(109:111) %>% 
  apply(., 2, function(x)ks.test(x, "pnorm"))
```

&emsp;&emsp; KS test for normality was adopted considering the considerably large sample size. The results confirmed "sense of accomplishment" was normally distributed (*p*=0.038); while the presumption of normal distribution for "pressure" and "exhaustion" were violated.

&emsp;&emsp; In summary, the observation of histogram and QQplot gave consistent conclusion from KS normality test for accomplishment and exhaustion, which were normally and not normally distributed, respectively. For pressure, inconsistent conclusions were produced. For reducing the type I error in the following hypothesis testing, I adopted a conservative decision that pressure is not normally distributed(so that non-parametric test was adopted).

```{r}
# Correlation among the factor scores of 3 dimensions. 
fsp4 <- USHSv2 %>% 
  ggplot(aes(x = Accomplishment, y = Pressure))+
  geom_point(shape = 1, colour = "blue", size =0.5)

fsp5 <- USHSv2 %>% 
  ggplot(aes(x = Accomplishment, y = EXhaustion))+
  geom_point(shape = 1, colour = "blue", size =0.5)

fsp6 <- USHSv2 %>% 
  ggplot(aes(x = Pressure, y = EXhaustion))+
  geom_point(shape = 1, colour = "blue", size =0.5)

fsp4+fsp5+fsp6
```

&emsp;&emsp;From point graphs above, I found weak to none correlation between pressure and accomplishment, a moderate negative correlation between exhaustion and accomplishment, and a moderate positve correlation between exhaustion and pressure. This made sense since it corresponded to real-world situation where the state of exhaustion deters achievement (negative relation); the increase of pressure aggravates exhaustion (positive relation), while pressure could be either the impetus or barrier to accomplishment dependent on specific situation and individual (correlation gets offset). This further proved the goodness of our factor structure. 

*1.6.3 Hypothesis testing*

&emsp;&emsp;Previous studies have found that obesity and burnout were correlated in a general population. BMI is a good gauge of obesity. As such, it would be interesting to explore their relationship in a sample of Finnish students. Hence I tested the following hypotheses:

&emsp;&emsp;a. Students in different BMI ranges have different sense of exhaustion.
&emsp;&emsp;b. Students in different BMI ranges have different sense of pressure.
&emsp;&emsp;c. Students in different BMI ranges have different sense of accomplishment.

```{r}
# Test if student with different BMI ranges have different exhaustion scores.
fat0 <- USHSv2 %>%  
  summary_factorlist(dependent = "EXhaustion", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat0, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;We did not have sufficient evidence objecting the null hypothesis that students in different BMI ranges have different sense of exhaustion from study  (*p*=0.13). 

&emsp;&emsp;However, since gender can be a confounding factor for the relationship between BMI and sense of exhaustion. I also tested it for each gender as follows:

```{r}
# Test if male student with different BMI ranges have different exhaustion scores.
fat1 <- USHSv2 %>% filter (gender == "Male") %>% 
  summary_factorlist(dependent = "EXhaustion", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat1, align =c("l", "l", "r", "r", "r"))
```  

  
```{r}
# Test if female student with different BMI ranges have different exhaustion scores.
fat2 <- USHSv2 %>% filter (gender == "Female") %>% 
  summary_factorlist(dependent = "EXhaustion", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat2, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;We still did not have sufficient evidence objecting the null hypothesis that male/female students in different BMI ranges have different sense of exhaustion from study  (*p*=0.13 and 0.45, respectively). 

```{r}
# Test if student with different BMI ranges have different accomplishment scores.
fat3 <- USHSv2 %>%  
  summary_factorlist(dependent = "Accomplishment", 
                     explanatory = "BMI.factor", 
         
                     p = T)
  knitr::kable(fat3, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;We did not have sufficient evidence (*p*=0.15) objecting the null hypothesis that students in different BMI ranges have different sense of accomplishment from study. 

&emsp;&emsp;However, since gender can be a confounding factor for the relationship between accomplishment feeling and BMI ranges. I also tested it for each gender as follows:

```{r}
# Test if male student with different BMI ranges have different accomplishment scores.
fat4 <- USHSv2 %>% filter (gender == "Male") %>% 
  summary_factorlist(dependent = "Accomplishment", 
                     explanatory = "BMI.factor", 
                      
                     p = T)
  knitr::kable(fat4, align =c("l", "l", "r", "r", "r"))
```  

  
```{r}
# Test if female student with different BMI ranges have different accomplishment scores.
fat5 <- USHSv2 %>% filter (gender == "Female") %>% 
  summary_factorlist(dependent = "Accomplishment", 
                     explanatory = "BMI.factor", 
                     
                     p = T)
  knitr::kable(fat5, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;We still did not have sufficient evidence objecting the null hypothesis that male/female students in different BMI ranges have different sense of accomplishment from study  (*p*=0.48 and 0.23, respectively). 

```{r}
# Test if student with different BMI ranges have different exhaustion scores.
fat6 <- USHSv2 %>%  
  summary_factorlist(dependent = "Pressure", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat6, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;We found sufficient evidence (*p*=0.024) objecting the null hypothesis that students in different BMI ranges have different sense of exhaustion from study. 

&emsp;&emsp;However, since gender can be a confounding factor for the relationship between exhaustion feeling and BMI ranges. I also tested it for each gender to see if the difference exists across males and females. The tests were as follows:

```{r}
# Test if male student with different BMI ranges have different exhaustion scores.
fat7 <- USHSv2 %>% filter (gender == "Male") %>% 
  summary_factorlist(dependent = "Pressure", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat7, align =c("l", "l", "r", "r", "r"))
```  
  
```{r}
# Test if female student with different BMI ranges have different exhaustion scores.
fat8 <- USHSv2 %>% filter (gender == "Female") %>% 
  summary_factorlist(dependent = "Pressure", 
                     explanatory = "BMI.factor", 
                     cont_nonpara = 1, 
                     p = T)
  knitr::kable(fat8, align =c("l", "l", "r", "r", "r"))
```

&emsp;&emsp;It is interesting to find out the female students in different BMI ranges have different sense of exhaustion, while similar effect was not found among males students (*p*=0.03 and 0.23, respectively). 

```{r}
#multiple comparison
USHSv2_female <- USHSv2 %>% filter (gender == "Female") 
  pairwise.t.test(USHSv2_female$Pressure, USHSv2_female$BMI.factor, 
                p.adjust.method = "none")
  
# multiple comparison with corrected p value from bonferroni method
USHSv2_female <- USHSv2 %>% filter (gender == "Female") 
  pairwise.t.test(USHSv2_female$Pressure, USHSv2_female$BMI.factor, 
                p.adjust.method = "bonferroni")
```

&emsp;&emsp;Multiple comparison was done to test which BMI range(s) of female student have different sense of exhaustion. I found normal weight female students were different from obese students (*p*=0.012) and also different from underweight female students in exhaustion feeling(*p*=0.019). However,  after correcting p values by bonferroni method, all of the differences turned insignificant.

&emsp;&emsp;In summary, we do not have sufficient evidence to conclude students in different BMI ranges have different levels of sense of exhaustion, sense of pressure and sense of accomplishment.


*******************************************************************************

**GOOOOOD JOB!!**

In the end, you should be ready to KNIT this document and SUBMIT the result (.html file) as your report of the **Assignment 2** in Moodle. I am completely sure that you learned again a lot! **GRRRRREAT!!!**