-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment2.Rmd
1307 lines (1049 loc) · 66.3 KB
/
Assignment2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Quantitative Research Skills 2022, Assignment 2"
author: "Rong Guang"
date: "Sep 27 2022"
output:
html_document:
theme: flatly
highlight: haddock
toc: yes
toc_depth: 2
number_section: no
word_document:
toc: yes
toc_depth: '2'
subtitle: Factor Analysis
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
# Assignment 2: Factor Analysis
Begin working with this when you have completed the **Hands-On Exercise 2**.
*******************************************************************************
**Your task:**
Practice with 2-3 sets of variables (one at a time) in the USHS data, trying different number of factors (2-3), comparing orthogonal and oblique rotations, and hence going through ALL the phases required to do FA (as instructed and shown in the **Hands-On-Exercise 2**).
It should be possible to analyse these particular sets of variables in the USHS data with FA:
(Some of them might be easier to interpret than others).
k21_1:k21_9, # various things in life (obs: originally wrong scale! "3" -> NA!), 9 variables
k22_1:k22_10, # how often (felt various things), 10 variables
k23:k34, # General Health Questionnaire (1-4), 12 variables
k46_1:k46_11, # on how many days/week eat various food/(11)skip lunch (0,...,7=every day), 11 variables
k95_1:k95_18, # feelings of studies (1:disagree,6:agree), 18 variables
**OBS:** If you use the k21* variables, remember to wrangle them first ("3" - NA!) See Hands-On-Exercise 1b!
**Note:** With FA, the variables can have different directions (like the items 2 & 3 in k22*), even different scales.
*PERHAPS you could also try to combine variables across different measures: some items of k21 together with some items from k95, for instance!* (I have not tried to analyse that kind of combinations, but it should be perfectly possible, at least if the items would be somehow substantially related to each other. TRY IT!!)
What is also required in this **Assignment 2**, is your **interpretations** (you don't have to be a substance expert, I just encourage you to practice, as those phenomena should be easy enough to understand without a special expertise). Interpreting your analyses is very important, and this course is a safe place to practice that part, too.
**So TRY to interpret the factors you find and give them describing NAMES.**
Finally, show that you can work with the new **factor score variables** in the original data: visualise those scores with histograms, box plots, scatter plots, using also some other variables of the USHS. You will find a lot of good R code for producing such visualisations in the earlier exercises.
*******************************************************************************
# Analysing the USHS data with Factor Analysis (step by step)
Use as many R code chunks and text areas as you need - it's now all up to you!
*Please choose only your BEST FACTOR ANALYSIS and show it here step by step!*
Start by reading in the data and converting the factors (gender and age), just like in the exercise.
## Enviroment and data preparation
**1. Package loading**
```{r}
library (tidyverse)#a combination of useful packages
library(dplyr) # data wrangling
library(psych) # factor analysis
library(patchwork) # picture layout
library(finalfit) # descriptive and inferential statistics well-tabulated
library(naniar)#install.packages("naniar")#Visualizing missing data
library(broom)#tidy tables
library(ggplot2)# plotting graphs
library(tidyr)#data sheet to longer or wider
library(GPArotation)#matrix rotation
```
**2. Data wrangling <br />**
**2.1 Load USHS data**
```{r}
#read raw data
USHS <- read.csv("daF3224e.csv", sep = ";")
#select reduced set of variables into a new data set
USHSpart <- USHS %>%
select(
fsd_id,
k1, # age
k2, # gender (1=male, 2=female, 3=not known/unspecified), NA (missing)
bv1, # university (from register), see label info! (40 different codes)
bv3, # field of study (from register), various codes
bv6, # higher educ institution (0=uni applied sciences, 1=university)
k7, # learning disability
k8_1:k8_4, # current well-being
k9_1:k9_30, # symptoms (past month), skip 31 ("other") due missingness
k11a, k11b, # height (cm) a=male, b=female
k12a, k12b, # weight (kg) a=male, b=female
k13, # what to you think of your weight? (1=under,5=over)
k14:k20, # various questions (mostly yes/no)
k21_1:k21_9, # various things in life (obs: originally wrong scale! "3" -> NA!)
k22_1:k22_10, # how often (felt various things) (obs: 2 & 3 should be inversed! see quest.)
k23:k34, # General Health Questionnaire (1-4, directions already OK)
k95_1:k95_18) # (end of select!)
```
**2.2 Variable recreation and generation<br />**
  In this section, I first generated height, weight, age and gender variables. Considering height and weight are better interpreted by referencing each other, I then generated a Body Mass Index (BMI) variable that merges the pair using well-accepted formula:
$$
BMI=\frac{Weight (kg)}{Height (m)^{2}}
$$
Next, I generated a new categorical variable by slicing BMI into underweight (<18.5), normal weight(~25), overweight(~30) and obese(>30). Previous studies have found that BMI is a good gauge of risk for diseases that can occur with more body fat, indicating the higher our BMI, the higher our risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers. As such, it would be interesting to explore its relation with Finnish student's general health, dietary status and feeling about study.
```{r}
# Recreate height, weight, age and gender variables
USHSv2 <- USHSpart %>%
mutate(height = if_else(is.na(k11a), k11b, k11a), #merge height variables of two sexes into one
weight = if_else(is.na(k12a), k12b, k12a), #merge weight variables of two sexes into one
gender = k2 %>% factor() %>% fct_recode("Male" = "1",
"Female" = "2",
"Not known" = "3"), #add variable name and level label to gender
age = k1 %>%
factor() %>%
fct_recode("19-21" = "1", "22-24" = "2", "25-27" = "3",
"28-30" = "4", "31-33" = "5", "33+" = "6"))#add variable name and level label to age
# Generate BMI(body mass index) and BMI.factor(4 categories)
USHSv2 <- USHSv2 %>%
mutate(BMI = weight/(height/100)^2) # adopting the formula of BMI
USHSv2 <- USHSv2 %>%
mutate(BMI.factor = BMI %>% #turn BMI into factor and add labels(underweight:<18.5, #normal weight: ~25, overweight: ~30; obese: >30)
cut(breaks=c(1,18.5,25,30,100), include.lowest = T) %>%
fct_recode("Underweight" = "[1,18.5]",
"Normal weight" = "(18.5,25]",
"Overweight" = "(25,30]",
"Obese" = "(30,100]") %>%
ff_label("BMI ranges"))
USHSv2[, c(42:45)] <- list(NULL) # I already got "gender". K11a~k12b were removed.
USHSv2 %>%
select(height, weight, age, gender, BMI, BMI.factor) %>%
head %>%
knitr::kable() # examine the data #sum(is.na(USHSv2$k1)) #apply(USHSv2, 2, function(x)sum(is.na(x))) %>% tidy()#examine NAs for each variable
unique(USHSv2$k21_1)
```
**2.3 data cleansing<br />**
  In item k21(k21_1 to k21_9), the choice "difficult to say" was assigned a value of 3, which could be misleading since these were ordinal variables where 3 indicated a specific strength instead of NA. In this section, this was corrected by converting 3 to NA.
```{r}
#inspect the unique values in k21_1 through k21_9
USHSv2 %>%
select(starts_with(("k21"))) %>% apply(.,2, function(x)unique(x))%>% knitr::kable()
```
  Finding: the value 3(which I want to convert to NA) is present now.
```{r}
#Convert the value 3 in k21_1 to k21_9 to NA
USHSv2 <- USHSv2 %>%
mutate(across(starts_with("k21"), ~replace(., . == 3, NA)))
#inspect the unique values in k21_1 through k21_9 in the updated data set
#inspect if all 3s convert to NA
USHSv2 %>% select(starts_with(("k21"))) %>%
apply(.,2, function(x)unique(x)) %>% knitr::kable()
```
  Finding: 3s were successfully replaced.
**2.4 data inspecting<br />**
  In this section, missing values of all the variables were inspected to find out if there was a pattern.
```{r}
#inspect the NAs in each column
#To get a clear view, separate into 4 pictures
ncol(USHSv2)#get the number of columns
USHSv2 %>% select(1:25) %>% vis_miss()
USHSv2 %>% select(26:50) %>% vis_miss()
USHSv2 %>% select(51:75) %>% vis_miss()
USHSv2 %>% select(76:104) %>% vis_miss()
```
  Findings were variables k1, k7, and k9_1~k9_3 suffered from quite a number of NAs, suggesting a quantitative inspection. Besides, several horizontal black lines (around 9) showed in the graph, suggesting these respondents tend to give more NAs consecuteively than others.
```{r}
#a closer look of the identified high-NA variables
USHSv2 %>% select(k1,k7,k9_1:k9_30) %>% vis_miss()
#calclate the percent of NAs for these variables
USHSv2 %>% select (k1,k7,k9_1:k9_30) %>%
apply(.,2,function(x)sum(is.na(x))/nrow(.)) %>%
tidy
```
  Findings were k1 (6.7% NAs), k7(4.1% NAs), k9_1 (3.1% NAs), k9_3 (3.0% NAs), k9_5 (4.3% NAs), k9_8 (6.5%), 9_10 (5.0% NAs), 9_13 (5.1 NAs), 9_24 (6.3% NAs), 9_25 (6.5% NAs) and 9_26 (5.9% NAs) were actually acceptable in the terms of NA proportion. Other variables were with NA proportion>7%, with some of them reaching 12%, suggesting possible bias such as floor effect. Analysis of k9 should be extremely careful.
**2.4 Variable selecting <br />**
  In this section, variables for general heath k23~k34 were selected into a new data set USHSghq for factor analysis #1 (general health). Variables k95_1~k95_18 were selected into a new data set USHSstu for factor analysis #2 (study burnout).
```{r}
#k23~k34 were selected into a new data set USHSghq for factor analysis #1
USHSghq <- USHSv2 %>% select(k23:k34)
#inspect the unique value of USHSghq to tell if data wrangling is needed
USHSghq%>% apply(2, function(x)unique(x))
#k95_1~k95_18 were selected into a new data set USHSstu for factor analysis #2
USHSstu <- USHSv2 %>% select(k95_1:k95_18)#USHSstu means USHS study burnout
#inspect the unique value of USHSstu to tell if data wrangling is needed
USHSstu %>% apply(2, function(x)unique(x))
```
## Analysis 1: Factor analysis of general health questionnaire
**1. Analysis <br />**
**1.1 Descriptive statistics for variables <br />**
  In this section, descriptive statistics was done for each variable of USHSghq.
```{r}
#inspect the unique values in k23~k34
USHSghq %>%
apply(., 2, function(x)unique(x))%>%
knitr::kable()
```
  Findings were all variables contain values of 1,2,3,4 and NA.
```{r}
#visualizing the number of values for each item
longv1 <- USHSv2 %>% #There are two patterns of choices in k23:k34. Select
#items of pattern one and convert them to long format
select(k24, k27, k28, k31, k32, k33) %>%
pivot_longer(everything(), names_to = "item", values_to = "score")
#plot bar charts for the number of each pattern one choice by each item
p1 <- longv1 %>%
ggplot (aes(x= factor(score), fill = score))+
geom_bar() +
facet_wrap(~item)+
theme_minimal()+
xlab("1=not at all; 2=no more than usual;
3=rather more than usual; 4=much more than usual")+ #choices of pattern 1
theme(legend.position ="none")
# keep visualising other items
longv2 <- USHSv2 %>% #There are two patterns of choices in k23:k34. Select
#items of pattern two and convert them to long format
select(k23, k25, k26, k29, k30, k34) %>%
pivot_longer(everything(), names_to = "item", values_to = "score")
#plot bar charts for the number of each pattern tow choice by each item
p2 <- longv2 %>%
ggplot (aes(x= factor(score), fill = score))+
geom_bar() +
facet_wrap(~item)+
theme_minimal()+
xlab("1=more so than usual; 2=same as usual;
3=less so than usual; 4=much less than usual")+ #choices of pattern 2
theme(legend.position ="none")
#display the charts in one graph
p1/p2
```
  The findings were "Not at all" and "No more than usual" was the most frequent choices for item k24, k27, k28, k31, k32 and k33, indicating most respondents were in a stable and healthy state. Finland is a developed country, so no surprise.
  "Same as usual" was the most frequent choice for item k23, k25, k26, k29, k30 and k34, indicating most respondents were in a stable and healthy state. Same finding with results from above items.
```{r}
#Examine if missing values are dependent on any demographic variables
longv3 <- USHSv2 %>% # convert health-related variables to long format
select (fsd_id, gender, BMI.factor, k23:k34) %>%
pivot_longer(k23:k34, names_to = "item", values_to = "score") %>%
filter(is.na(score))
# plot stacked bar chart of NA frequencies base on ender for each item
p1 <- longv3 %>%
ggplot(aes(x = item, fill = gender))+
geom_bar(position = position_fill(reverse = T))+
theme(legend.position = "none", axis.text = element_text(size = 6),
plot.title = element_text(vjust = 1),
axis.title.y = element_text(size = 9))+
labs(title = "Percentage of NAs by gender",
y = "Percentage of NAs", x ="", subtitle = "uncorrected")
# plot stacked bar chart of percentage of NA frequencies base on
# gender by each item
p2 <- longv3 %>%
ggplot(aes(x = item, fill = gender))+
geom_bar(position = position_stack(reverse = T))+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text(size = 9))+
labs(title = "Number of NAs by gender",
y = "Number of NAs", x = "Item", subtitle = "uncorrected")
#create contingency table for item&gender
Igender <- longv3 %>% count(item, gender)
#correct for effect of sample size by dividing male
#NA number with the sample size of male(n = 1068),
#and female NA number with that of female(n = 2022)
Igender_cor <- Igender %>%
mutate (score = case_when(gender == "Male" ~ n/1068,
gender == "Female" ~ n/2022,
is.na(gender) ~ n/20)) %>% filter (!is.na(gender))
#plot the corrected data
p3 <- Igender_cor %>%
ggplot(aes(x = item, y = score, fill = gender))+
geom_col(position = position_fill())+
theme(axis.text = element_text(size = 5),
axis.text.x = element_text (angle = 45))+
labs(title = "Percent of NAs by gender",
subtitle ="corrected for sample size",
y = "Corrected number of NAs", x = "Item") +
guides(fill = guide_legend(title = "Gender"))
#display the uncorrected (left) and the corrected (right)
p1/p2|p3
```
  The findings were base on raw data, females givng NA answers were much more than males (see pictures on the left). However, we should bring into consideration the fact that female respondents were around two times the umber of males. Hence, the number was corrected by diving the number of each subgroup of NAs by the sample size of subgroups(male and female), obtaining the graph corrected for sample size. See picture on the right. It demonstrated that though the proportion of females giving NA answers is still larger than males, the difference becomes much less pronounced. Still, attention should be given to items k23( Have you recently been able to concentrate on your tasks?") and k25("Have you recently felt that you are playing a useful part in things?"), which around 80% of NAs were from females, as well as items k27 ("Have you recently felt constantly under strain?"), which around 70% NAs were from males, indicating bias such as floor effect might exist in these items.
```{r}
#stacked bar chart of number of NAs by BMI ranges for each item
p4 <- longv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
geom_bar(position = position_fill(reverse = T))+
ylab("Percent of NAs")+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text (size = 9),
plot.title = element_text (vjust = 1))+
labs(title = "Percent of NAs by BMI", x= "", subtitle = "uncorrected")
#stacked bar chart of percent of NAs by BMI ranges for each item
p5 <- longv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
geom_bar(position = position_stack(reverse = T))+
ylab("Number of NAs")+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text (size = 9))+
labs(title = "Number of NAs by BMI", x = "Item", subtitle = "uncorrected")
#corrected bar chart of percent of NAs by BMI ranges for each item
#create contingency table for item&gender
longv3 %>% count(BMI.factor)
IBMI <- longv3 %>% count(item, BMI.factor)
IBMI <- IBMI %>% rename(score = n) #change the new data frame's variable name
#correct for effect of sample size by dividing number of normal weight respondents
# with the sample size of it(n = 118). Apply the same idea to other levels.
IBMI_cor <- IBMI %>%
mutate (score_corrected = case_when(BMI.factor == "Normal weight" ~ score/118,
BMI.factor == "Overweight" ~ score/39,
BMI.factor == "Obese" ~ score/25,
BMI.factor == "Underweight" ~ score/1,
is.na(BMI.factor) ~ score/31))
#plot the correct bar chart
p6 <- IBMI_cor %>%
ggplot(aes(x = item, y = score_corrected, fill = BMI.factor))+
geom_col(position = position_fill(reverse = T))+
theme(axis.text = element_text(size = 5),
axis.text.x = element_text(angle = 45))+
labs(title = "Percent of NAs by BMI",
subtitle ="corrected for sample size",
y = "Corrected number of NAs", x = "Item") +
guides(fill = guide_legend(title = "BMI"))
#display the charts
p4/p5|p6
```
  Respondents with normal weight gave most NA answers (see pictures on the left). After sample size correction (see picture on right), the NA answers from different BMI groups became more equally distributed across each subcategory. However, the observation was inconclusive due to a large proportion lacking BMI information (NAs).
```{r}
#descriptive statistics for general health questions.
describe(USHSghq, IQR = T)
```
  Findings were the mean and median were very close to each other. Standard deviations were more varied, indicating some items have different filling out pattern with others.
```{r}
# convert health-related variables to long format
longv4<- USHSghq %>%
pivot_longer(everything(), names_to = "Item", values_to = "Score")
# draw histogram of the scores for each item
longv4 %>% ggplot(aes(x = Score, fill = Score)) +
geom_histogram(binwidth=0.9, fill = "cornsilk3") +
facet_wrap(~Item)+
labs(caption="Score is each respondent's choice for each item")+
theme_bw()
```
  Findings were the assumption of multivariate normality was severely violated (at least half of the variables were not normal), see histogram above. This indicated I might have to exclude maximum likelihood for factoring due to its poor validity for non-normal data.
**1.2 Factorability of the items <br />**
  Next, the factorability of the items was examined. Several well-recognized criteria for the factorability of a correlation were used.
```{r}
#correlation to examine the factorability
c_matrix <- cor.plot(USHSghq, method = "spearman")
#These codes below are not in use anymore.
#preparing USHSghq with categorical data
#USHSghq_more <- USHSghq %>% mutate (fsd_id = USHSv2$fsd_id, gender = USHSv2$gender, BMI.factor = USHSv2$BMI.factor, age = USHSv2$age) %>% #select (fsd_id, everything())
```
  It was observed that all of the 12 items correlated at least 0.4 with at least one other item, suggesting good factorability.
```{r}
#perform KMO test for factorability
KMO(c_matrix)
```
  The Kaiser-Meyer-Olkin measure of sampling adequacy was .93, indicating marvelous adequacy according to Kaiser.
```{r}
#perform bartlett test for factorability
cortest.bartlett(c_matrix, nrow(USHSghq))
```
  Bartlett’s test of sphericity was significant (χ2 (66) = 15570.3, p < .05), suggesting that the null hypothesis that our matrix be equal to an identity matrix was objected, and hence good factorability.
**1.3 Estimating factor structure **
  Next, the number of factors was estimated. Several well-recognized criteria for exploring factor structure solution were used.
```{r}
# Parallel analysis and scree plot
fa.parallel(USHSghq, fa = "fa", fm = "ml", plot = T)
# very simple structure and BIC
nfactors(c_matrix, n=3, rotate = "oblimin", fm = "pa", n.obs=12)
#below code is discarded
#vss(c_matrix,n=5, rotate = "oblimin", fm = "pa", n.obs=12)
```
  The parallel analysis and scree plot (showing 4 factors above eigen-values of simulated data ) based on correlation matrix suggested 4-factor structure, while the VSS was favourable to a uni-dimensional interpretation. Results from the VSS analysis also pointed to the optimal number of one factors, this was further consolidated by both Velicer MAP and SABIC, both suggesting a single dimension.
**1.4 Presetting parameters **
I loaded the variables on 4 factors based on the result of parallel analysis. A 3-factor solution was also tested for exercise purpose, albeit no mathematical foundation supporting to do it. Oblimin rotation will be adopted because of the fact that dimensions of health situations were usually highly correlated and the hand-on-exercise examining a 2-factor solution has given a correlation of 0.75 between the factors. I assumed that there was some latent construct called "general health" that is influencing how people answer these questions, and hence principle axis factoring (PAF) was adopted instead of principle component analysis (PCA; on other hand, PCA is orthogonal by nature). Since the assumption of multivariate normality is severely violated, I also did not adopt maximum likelihood factoring.
**1.4 Factor analysis **
```{r}
#Examining the 4 factors solution
fa4 <- fa(USHSghq, nfactors = 4, rotate = "oblimin", fm = "pa", scores = "regression")
fa4
# draw a diagram of the factor model (measurement model):
fa.diagram(fa4, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")
#print the results of analysis
print(fa4$loadings,cutoff = 0.3, sort=TRUE)
#find out how much variance was explained
prop.table(fa4$values)
#codes below is not in use
#summary(fa4)
#print(fa4, digits = 2, sort = T)
```
```{r}
#Examining the 4 factors solution
fa3 <- fa(USHSghq, nfactors = 3, rotate = "oblimin", fm = "pa", scores = "regression")
# draw a diagram of the factor model (measurement model):
fa.diagram(fa3, cut = 0, digits =2, main = "Factor Analysis, Oblimin rotation")
#print the results of analysis
print(fa3$loadings,cutoff = 0.3, sort=TRUE)
prop.table(fa3$values)
```
  Solutions for three and four factors were each examined using oblimin rotations of the factor loading matrix.Correlation coefficients among the factors in both solutions ranges from 0.48 to 0.68, consolidating the appropriateness of oblimin rotation. Although the number of primary loading were sufficient and roughly equal for both solutions, the four factor solution, which explained 81.40% of the variance, was preferred because of: (a) hand-on-exercise found a 2 factor solution did not provide sufficient granularity for interpreting the structure; (b) the ‘leveling off’ of eigen values on the scree plot after four factors; and (c) the insufficient number of primary loadings and difficulty of interpreting 3 factor structure from medical perspective.
**1.5 conclusion of factor analysis**
  In the four factor structure I finally adopted, items k31, k32, k33 and k28 comprising the first factor, capturing "Social well-being"; items k25, k26, k23 and k30 comprising the second factor, capturing "social engagement"; items k34 and k29 comprising the third factor, capturing "mental well-being"; and items k27 and k24 comprising the fourth factor, capturing "emotional well-being".
**1.6 factor score analysis**
*1.6.1 generating factor score variables*
```{r}
#generate factor scores
healthScores <- as.data.frame( #healthScores mean score from factor analysis of general health
factor.scores(USHSghq, fa4)$scores) %>%
mutate(fsd_id = USHSv2$fsd_id) %>%
rename(Social_wellbeing = PA1,
Social_engagement = PA2,
Emotional_wellbeing = PA3,
Mental_wellbeing = PA4)
#merge the healthScore data set with USHSv2
USHSv2 <- left_join(USHSv2, healthScores)
#find out the column number
ncol(USHSv2)
#inspect the factor score variables
USHSv2 %>% select(1,105:108) %>% head
USHSv2 %>% head
```
*1.6.2 inspection of factor scores*
```{r}
# inspect the normality of factor scores using histogram
USHSv2 %>% select (fsd_id, 100:108) %>%
pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>%
ggplot(aes(x = FactorScore, fill = Factor)) +
geom_histogram() +
facet_wrap(~Factor) +
theme_bw()+
theme(legend.position = "none")
```
   Base on face value, factor score of emotional well-being and social engagement seemed to be normally distributed, while mental and social well-beings did not. See histograms above.
```{r}
# inspect the normality of factor scores by gender using histogram
USHSv2 %>% select (fsd_id, 100:108) %>%
pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>%
ggplot(aes(x = FactorScore, fill = Factor)) +
geom_histogram() +
facet_grid(Factor~gender)+
theme_bw()+
theme(legend.position = "none")
```
   Base on face value, grouped by gender, factor scores of the dimensions stayed roughly the same with ungrouped data. See histograms above.
```{r}
# inspect the normality of factor scores by BMI.factor using histogram
USHSv2 %>% select (fsd_id, 100:108) %>%
pivot_longer(c(7:10), names_to = "Factor", values_to = "FactorScore") %>%
ggplot(aes(x = FactorScore, fill = Factor)) +
geom_histogram() +
facet_grid(BMI.factor~Factor)+
theme_bw()+
theme(legend.position = "none")
```
   Base on face value, grouped by BMI ranges, factor scores of the dimensions stayed roughly the same with un-grouped data. See histograms above.
```{r}
# inspect the normality of factor scores for social wellbeing using QQplot
hp1 <- USHSv2 %>%
ggplot(aes(sample = Social_wellbeing)) +
geom_qq(colour = "red", shape = 1, size = 2) +
geom_qq_line(colour = "blue")+
labs(title = "Social_wellbeing")+
theme(plot.title = element_text(hj = 0.5))
# inspect the normality of factor scores for Social engagement using QQplot
hp2 <- USHSv2 %>%
ggplot(aes(sample = Social_engagement)) +
geom_qq(colour = "red", shape = 1, size = 2) +
geom_qq_line(colour = "blue")+
labs(title = "Social_engagement")+
theme(plot.title = element_text(hj = 0.5))
# inspect the normality of factor scores for Mental wellbeing using QQplot
hp3 <- USHSv2 %>%
ggplot(aes(sample = Mental_wellbeing)) +
geom_qq(colour = "red", shape = 1, size = 2) +
geom_qq_line(colour = "blue")+
labs(title = "Mental_wellbeing")+
theme(plot.title = element_text(hj = 0.5))
# inspect the normality of factor scores for emotional wellbeing using QQplot
hp4 <- USHSv2 %>%
ggplot(aes(sample = Emotional_wellbeing)) +
geom_qq(colour = "red", shape = 1, size = 2) +
geom_qq_line(colour = "blue")+
labs(title = "Emotional_wellbeing")+
theme(plot.title = element_text(hj = 0.5))
hp1+hp2+hp3+hp4
```
   Qqplots showed emotional well-bing scores were roughly normally distributed but other 3 dimensions were considerably skewed.
```{r}
#KS test for factor scores
USHSv2 %>%
select(Social_wellbeing:Emotional_wellbeing) %>%
apply(., 2, function(x)ks.test(x, "pnorm"))
```
   KS test for normality was adopted considering the considerably large sample size. The results were all the 4 dimensions were not normally distributed.
   In summary, the observation of histogram and qqplot gave partially consistent conclusion from KS normality test for the 4 dimensions. The dispute was on whether emotional well-being score be normally distributed. For reducing the type I error in the following hypothesis testing, I adopted a conservative decision that pressure is not normally distributed (so that non-parametric test was adopted).
```{r}
# Correlation among the factor scores of 4 dimensions.
hp5 <- USHSv2 %>%
ggplot(aes(x = Emotional_wellbeing, y = Social_wellbeing))+
geom_point(shape = 1, colour = "blue", size =0.5)+
geom_smooth(method = "lm", colour = "red")
# continue
hp6 <- USHSv2 %>%
ggplot(aes(x = Emotional_wellbeing, y = Social_engagement))+
geom_point(shape = 1, colour = "blue", size =0.5)+
geom_smooth(method = "lm", colour = "red")
# continue
hp7 <- USHSv2 %>%
ggplot(aes(x = Emotional_wellbeing, y = Mental_wellbeing))+
geom_point(shape = 1, colour = "blue", size =0.5)+
geom_smooth(method = "lm", colour = "red")
# continue
hp8 <- USHSv2 %>%
ggplot(aes(x = Social_wellbeing, y = Mental_wellbeing))+
geom_point(shape = 1, colour = "blue", size =0.5)+
geom_smooth(method = "lm", colour = "red")
# continue
hp9 <- USHSv2 %>%
ggplot(aes(x = Social_wellbeing, y = Social_engagement))+
geom_point(shape = 1, colour = "blue", size =0.5)+
geom_smooth(method = "lm", colour = "red")
# continue
hp10 <- USHSv2 %>%
ggplot(aes(x = Social_engagement, y = Mental_wellbeing))+
geom_point(shape = 1, colour = "blue", size =0.5)+
geom_smooth(method = "lm", colour = "red")
#display the graphs
hp5/hp6|hp7/hp8|hp9/hp10
```
  From point graphs above, weak to moderate correlation was observed between each dimension and another. This corresponded to the real-world situation that every aspect of well-being would influence each other. The strongest correlation was between social engagement and mental well-being. This made sense since hunch feeling is the more one engaged in social activity, the more mental well-being he/she could enjoy. They could possibly have some causal relationship, which other pairs didn't. This further proved the goodness of our factor structure.
*1.6.3 Hypothesis testing*
  Previous studies have found that BMI is a good gauge of risk for diseases that can occur with more body fat, indicating the higher our BMI, the higher our risk for certain diseases such as heart disease, high blood pressure, type 2 diabetes, gallstones, breathing problems, and certain cancers. As such, it would be interesting to explore its relation with Finnish student's general health. Hence I tested the following hypotheses:
  a. Students in different BMI ranges have different levels of social well-being.
  b. Students in different BMI ranges have different levels of social engagement.
  c. Students in different BMI ranges have different levels of mental well-being.
  d. Students in different BMI ranges have different levels of emotional well-being.
```{r}
# Test if student with different BMI ranges have different social well-being scores.
het0 <- USHSv2 %>%
summary_factorlist(dependent = "Social_wellbeing",
explanatory = "BMI.factor",
cont_nonpara = 1,
p = T)
knitr::kable(het0, align =c("l", "l", "r", "r", "r"))
```
  I found students in different BMI ranges have different sense of social well-being. Pairwise comparison was performed to find out source of the difference.
```{r}
#multiple comparison without correction
pairwise.t.test(USHSv2$Social_wellbeing, USHSv2$BMI.factor,
p.adjust.method = "none")
# multiple comparison with corrected p value from bonferroni method
pairwise.t.test(USHSv2$Social_wellbeing, USHSv2$BMI.factor,
p.adjust.method = "bonferroni")
```
  Multiple comparison was done to test which BMI range(s) concurred with different sense of social well-being. I found obese students were different from normal weight(*p*=0.00057) or overweight students(*p*=0.02298) in this regard . However,after correcting p values by bonferroni method, only difference between normal weight and obese students was still significant (*p*=0.0034).
```{r}
# Test if students in different BMI ranges have different levels of social engagement.
het1 <- USHSv2 %>%
summary_factorlist(dependent = "Social_engagement",
explanatory = "BMI.factor",
cont_nonpara = 1,
p = T)
knitr::kable(het1, align =c("l", "l", "r", "r", "r"))
```
  It is found students in different BMI ranges have different sense of social engagement (*p*=0.022). Pairwise comparison was performed to find out the source of difference.
```{r}
#multiple comparison without correction
pairwise.t.test(USHSv2$Social_engagement, USHSv2$BMI.factor,
p.adjust.method = "none")
# multiple comparison with corrected p value from bonferroni method
pairwise.t.test(USHSv2$Social_engagement, USHSv2$BMI.factor,
p.adjust.method = "bonferroni")
```
  Multiple comparison was done to test which BMI range(s) concurred with different sense of social engagement. It is found obese students were different from normal weight students (*p*=0.0065) in this regard. After correcting p values by bonferroni method, the difference was still significant (*p*=0.039).
```{r}
# Test if students in different BMI ranges have different sense of mental well-being.
het2 <- USHSv2 %>%
summary_factorlist(dependent = "Mental_wellbeing",
explanatory = "BMI.factor",
cont_nonpara = 1,
p = T)
knitr::kable(het2, align =c("l", "l", "r", "r", "r"))
```
  It is found students in different BMI ranges have different sense of mental well-being (*p*=0.017). Pairwise comparison was performed to find out the source of difference.
```{r}
#multiple comparison without correction
pairwise.t.test(USHSv2$Mental_wellbeing, USHSv2$BMI.factor,
p.adjust.method = "none")
# multiple comparison with corrected p value from bonferroni method
pairwise.t.test(USHSv2$Mental_wellbeing, USHSv2$BMI.factor,
p.adjust.method = "bonferroni")
```
  Multiple comparison was done to test which BMI range(s) concurred with different sense of mental well-being. It is found obese students were different from normal weight(*p*=0.001) or overweight students (*p*=0.028) in this regard. After correcting p values by bonferroni method, only the difference between obese and normal weight students was still significant (*p*=0.0063).
```{r}
# Test if students in different BMI ranges have different sense of emotional well-being.
het3 <- USHSv2 %>%
summary_factorlist(dependent = "Emotional_wellbeing",
explanatory = "BMI.factor",
cont_nonpara = 1,
p = T)
knitr::kable(het3, align =c("l", "l", "r", "r", "r"))
```
  It is found students in different BMI ranges have different sense of emotional well-being (*p*=0.011). Pairwise comparison was performed to find out the source of difference.
```{r}
#multiple comparison
pairwise.t.test(USHSv2$Emotional_wellbeing, USHSv2$BMI.factor,
p.adjust.method = "none")
# multiple comparison with corrected p value from bonferroni method
pairwise.t.test(USHSv2$Emotional_wellbeing, USHSv2$BMI.factor,
p.adjust.method = "bonferroni")
```
  Multiple comparison was done to test which BMI range(s) concurred with different sense of emotional well-being It is found normal weight students were different from obese (*p*=0.0038) or overweight students (*p*=0.0479) in this regard . After correcting p values by bonferroni method, only the difference between obese and normal weight students was still significant (*p*=0.023).
In summary, obese and normal weight students have different sense of emotional well-being, mental well-being, social engagement and social well-being. Note that this is a data-driven analysis and the *p* values were not corrected for the number of hypotheses.
## Analysis 2: Factor analysis of study burnout questionnaire
**1. Analysis <br />**
**1.1 Descriptive statistics for variables <br />**
  In this section, descriptive statistics was done for each variable of USHSghq.
```{r}
#have a look at the unique values in k23~k34
USHSstu %>%
apply(., 2, function(x)unique(x))%>%
knitr::kable()
```
```{r}
#adapting the variable names for better displaying effect
USHSstu <- USHSstu %>% rename(k95_01 = k95_1, k95_02 = k95_2, k95_03 = k95_3, k95_04 = k95_4, k95_05 = k95_5, k95_06 = k95_6, k95_07 = k95_7, k95_08 = k95_8, k95_09 = k95_9)
#generate two subsets of data containing negative-wording items (01~09) and postive-wording items (10~18), respectively
slongv1 <- USHSstu %>% select (k95_01:k95_09) %>%
pivot_longer(everything(), names_to = "item", values_to = "score")
slongv2 <- USHSstu %>% select (k95_10:k95_18) %>%
pivot_longer(everything(), names_to = "item", values_to = "score")
```
```{r}
#visualize the number of values for negative-wording items
slongv1 %>%
ggplot (aes(x= factor(score), fill = score))+
geom_bar() +
facet_wrap(~item)+
theme_get()+
xlab("1=Totally disagree; 2=Disagree;3=Partly disagree; 4=Partly agree; 5=Agree; 6=Totally agree")+
ggtitle("Distribution of choices for negative-wording items under k95")+
theme(legend.position ="none", axis.title.x = element_text (size = 9, face = "bold"), axis.text.x = element_text (size = 6, vjust = 2), strip.text = element_text(size = 8, face = "bold", vjust = 0.5), axis.ticks = element_blank(), plot.title = element_text (face = "bold"))
```
  The findings were "totally disagree" was the most frequent choice, followed by "disagree", indicating most students were not experiencing study burnout in dimensions these items cover. Besides, the data were mostly skewed. This indicated care should be taken in adopting maximum likelihood factoring.
```{r}
#visualize the number of values for positive-wording items
slongv2 %>%
ggplot (aes(x= factor(score), fill = score))+
geom_bar() +
facet_wrap(~item)+
theme_get()+
xlab("1=Totally disagree; 2=Disagree;3=Partly disagree; 4=Partly agree; 5=Agree; 6=Totally agree")+
ggtitle("Distribution of choices for postive-wording items under k95")+
theme(legend.position ="none", axis.title.x = element_text (size = 9, face = "bold"), axis.text.x = element_text (size = 6, vjust = 2), strip.text = element_text(size = 8, face = "bold", vjust = 0.5), axis.ticks = element_blank(), plot.title = element_text (face = "bold"))
```
  The findings were "partly agree" was the most frequent choice, followed by "agree" and "partly disagree", indicating most students were experiencing little to none study burnout in the dimensions these items cover. The possibility of modesty bias towards weak agreement in facing of positive descriptions should also be considered (this could explain why in negative-wording items 1~9 such pattern was not present). Besides, the data were mostly normally distributed, indicating positive wording might be a good strategy for eliciting a gradient of state changes.
```{r}
#Examining if missing values are dependent on any demographic variables
slongv3 <- USHSv2 %>% # convert health-related variables to long format
select (fsd_id, gender, BMI.factor, k95_1:k95_18) %>%
pivot_longer(k95_1:k95_18, names_to = "item", values_to = "score") %>%
filter(is.na(score))
# plot stacked bar chart of NA frequencies by gender for each item
sp1 <- slongv3 %>%
ggplot(aes(x = item, fill = gender))+
geom_bar(position = position_fill(reverse = T))+
theme(legend.position = "none", axis.text = element_text(size = 6),
plot.title = element_text(vjust = 1),
axis.title.y = element_text(size = 9), axis.text.x = element_text(angle =45, vjust = 0.7))+
labs(title = "Percentage of NAs by gender",
y = "Percentage of NAs", x ="", subtitle = "uncorrected")
# plot stacked bar chart of percentage of NA frequencies by gender for each item
sp2 <- slongv3 %>%
ggplot(aes(x = item, fill = gender))+
geom_bar(position = position_stack(reverse = T))+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text(size = 9),
axis.text.x = element_text(angle =45, vjust = 0.7))+
labs(title = "Number of NAs by gender",
y = "Number of NAs", x = "Item", subtitle = "uncorrected")
#create contingency table for item&gender
sIgender <- slongv3 %>% count(item, gender)
#correct for effect of sample size by dividing male NA number with the sample
#size of male(n = 1068),and female NA number with that of female(n = 2022)
sIgender_cor <- sIgender %>% #sIgender means study-burnout item/gender
#sIgender_cor means study-burnout item/gender corrected
mutate (score = case_when(gender == "Male" ~ n/1068,
gender == "Female" ~ n/2022,
is.na(gender) ~ n/20)) %>% filter (!is.na(gender))
#plotting the corrected data
sp3 <- sIgender_cor %>%
ggplot(aes(x = item, y = score, fill = gender))+
geom_col(position = position_fill())+
theme(axis.text = element_text(size = 5),
axis.text.x = element_text (angle = 45))+
labs(title = "Percent of NAs by gender",
subtitle ="corrected for sample size",
y = "Corrected number of NAs", x = "Item") +
guides(fill = guide_legend(title = "Gender"))
sp1/sp2|sp3
```
  The findings were females givng NA answers were much more than males. See pictures on the left. However, after correcting the number of each subgroup (male and female) by its sample size,the distribution of NAs became almost consistent across gender. See picture on the right.
```{r}
#same idea with last chunk, with categorical variable switched to BMI ranges
sp4 <- slongv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
geom_bar(position = position_fill(reverse = T))+
ylab("Percent of NAs")+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text (size = 9),
plot.title = element_text (vjust = 1),
axis.text.x = element_text(angle =45, vjust = 0.7))+
labs(title = "Percent of NAs by BMI", x= "", subtitle = "uncorrected")
#same idea with last chunk, with categorical variable switched to BMI ranges
sp5 <- slongv3 %>% ggplot(aes(x = item, fill = BMI.factor))+
geom_bar(position = position_stack(reverse = T))+
ylab("Number of NAs")+
theme(legend.position = "none", axis.text = element_text(size = 6),
axis.title.y = element_text (size = 9),
axis.text.x = element_text (angle = 45, vjust = 0.7))+
labs(title = "Number of NAs by BMI", x = "Item", subtitle = "uncorrected")
#same idea with last chunk, with categorical variable switched to BMI ranges
slongv3 %>% count(BMI.factor)
sIBMI <- slongv3 %>% count(item, BMI.factor)#create contingency table for item&gender
sIBMI <- sIBMI %>% rename(score = n)
sIBMI_cor <- sIBMI %>% #correct for effect of sample size by dividing number of
#normal weight respondents with the sample size of,
#it(n = 118). Apply the same idea to other levels.
mutate (score_corrected = case_when(BMI.factor == "Normal weight" ~ score/118,
BMI.factor == "Overweight" ~ score/39,
BMI.factor == "Obese" ~ score/25,
BMI.factor == "Underweight" ~ score/1,
is.na(BMI.factor) ~ score/31))
#same idea with last chunk, with categorical variable switched to BMI ranges
sp6 <- sIBMI_cor %>%
ggplot(aes(x = item, y = score_corrected, fill = BMI.factor))+
geom_col(position = position_fill(reverse = T))+
theme(axis.text = element_text(size = 5),
axis.text.x = element_text(angle = 45, vjust= 0.5))+
labs(title = "Percent of NAs by BMI",
subtitle ="corrected for sample size",
y = "Corrected number of NAs", x = "Item") +
guides(fill = guide_legend(title = "BMI"))
sp4/sp5|sp6
```
  Respondents with normal weight gave most NA answers (see pictures on the left). After sample size correction (see picutre on right), the NA answers from different BMI groups became more equally distributed across each subcategory. However, the observation was inconclusive due to a large proportion lacking BMI information. Also note that in the bar graph on the right, underweight respondents seemed to comprise the majority of NAs across a couple of items, but I decided to ignore it since the number of underweight respondents were quite small (n=20), leading to overestimation by correction.
```{r}
#descriptive statistics for study burnout items.
describe(USHSstu, IQR = T)
?describe
```
  Findings were the median of score for positive/negative wording items were closer to its item-group member and IQRs were consistent within some subgroups of items and were varied across these subgroups, indicating some items have different filling out pattern. Means were less varied across all items, but due to the non-normality for half of the item results, here mean and standard deviation would be less convincing than median and IQR.
**1.2 Factorability of the items <br />**
  Next, the factorability of the items was examined. Several well-recognized criteria for the factorability of a correlation were used.
```{r}
#correlation to examine the factorability
sc_matrix <- cor.plot(USHSstu) # sc_matrix means study-burnout correlation matrix
```
  It was observed that all of the 12 items correlated at least 0.5 with at least one other item, suggesting good factorability.
```{r}
#KMO test for factorability
KMO(sc_matrix)
```
  The Kaiser-Meyer-Olkin measure of sampling adequacy was .94, indicating marvelous adequacy according to Kaiser (1975).
```{r}
#Bartlett test for factorability
cortest.bartlett(sc_matrix, nrow(USHSghq))
```
  Bartlett’s test of sphericity was significant (χ2 (66) = 36358.71, p < .05), suggesting that the null hypothesis that our matrix be equal to an identity matrix was objected and hence good factorability.
**1.3 Estimating factor structure **
  Next, the number of factors was estimated. Several well-recognized criteria for exploring factor structure solution were used.
```{r}
# Parallel analysis and scree plot
fa.parallel(USHSstu, fa = "fa", fm = "ml", plot = T)
# very simple structure and BIC
nfactors(sc_matrix, n=5, rotate = "oblimin", fm = "pa", n.obs=12)
```
  Parallel analysis and scree plot (showing 4 factors above eigen-values of simulated data ) based on correlation matrix suggested 4-factor structure, while the VSS was favourable to a 2 factor interpretation by the given complexity peaking at the second factor. This result was echoed by BIC, where the minimum value was achieved with 2 factors. Whereas VSS analysis recommended a 3-factor solution, which was inconsistent with others.
**1.4 Presetting parameters **
  Base on the factor structure exploration, solutions for three, four, five and six factors were each examined using varimax and oblimin rotations of the factor loading matrix. Both rotation was selected since I was not able to presume the relationship between dimensions of study burnout. Histogram showed more than half of the items under k95 were normally distributed, thus maximum likelihood factoring was first considered in view of its goodness in factoring normal data, but other factoring methods might also be adopted in case ML did not give interpretable results.
**1.4 Factor analysis **
*1.4.1 Examining 2-factor structure*
```{r}
# a) orthogonal rotation
sfa2_var <- fa(USHSstu, nfactors = 2, #sfa2_var means study factor analysis 2 factors by varimax
rotate = "varimax", fm = "ml", scores = "regression")
sfa2_var
fa.diagram(sfa2_var, cut = 0, digits =2,
main = "Factor Analysis, 2-factor structure, Orthogonal rotation")
#b) oblimin rotation
sfa2_obl <- fa(USHSstu, nfactors = 2, #sfa2_obl means study factor analysis 2 factors by oblimin
rotate = "oblimin", fm = "ml", scores = "regression")
sfa2_obl
fa.diagram(sfa2_obl, cut = 0, digits =2,
main = "Factor Analysis, 2-factor structure, Oblimin rotation")
```
  The correlation between the 2 factors were weak (r = -0.35), indicating oblimin rotation might be a proper choice theoretically.However, by examining the items under each factor, I found the most possible interpretation for this 2 factor solution is the differnt directions of wording, since factor 1 contain items 1-9, which were negatively worded, and items 10~18 under factor 2 were all positively worded. They should of course be negatively correlated on face value, and, on the other hand, the factoring solution was not sufficiently convincing because it did not capture meaningful dimensions of study burnout.
```{r}
#results by varimax rotation
print(sfa2_var$loadings,cutoff = 0.3, sort=TRUE)
prop.table(sfa2_var$values) #how much variance explained
#results by oblimin rotation
print(sfa2_obl$loadings,cutoff = 0.3, sort=TRUE)
prop.table(sfa2_obl$values) #how much variance explained
```
  Cross-loading was very pronounced by both rotation methods (especially varimax rotation). This indicated we might either remove some items or seek for other factor solutions. This echoed with the fact that this two-factor structure did not provide any meaningful interpretation of dimension of study burnout (it simply separated positive from negative wording items).
*1.4.2 Examining 3-factor structure*
```{r}
# a) orthogonal rotation
sfa3_var <- fa(USHSstu, nfactors = 3, #sfa2_var means study factor analysis 2 factors by varimax
rotate = "varimax", fm = "ml", scores = "regression")
sfa3_var
fa.diagram(sfa3_var, cut = 0, digits =2,
main = "Factor Analysis, 3-factor structure, Orthogonal rotation")
#b) oblimin rotation
sfa3_obl <- fa(USHSstu, nfactors = 3, #sfa2_obl means study factor analysis 2 factors by oblimin
rotate = "oblimin", fm = "ml", scores = "regression")
sfa3_obl
fa.diagram(sfa3_obl, cut = 0, digits =2,
main = "Factor Analysis, 3-factor structure, Oblimin rotation")
```
  The correlation among the 3 factors were weak to moderate (0.18~0.55), not pointing to any rotation method mathematically. Two methods provided factoring results with little difference. The interpretation of this 3-factor structure became easier and more enlightening than the 2-factor solution. By qualitative analysis, I assumed items 3,5,2,6 and 8 captured the dimension of exhaustion (item example: I feel a lack of motivation in my studies and often thinking of giving up); items 7, 1, 9, 4 and 3 captured sense of pressure (Example: I often have feelings of inadequacy in my studies.); other 9 items together captured sense of accomplishment (Example: I find my studies full of meaning and purpose).
```{r}
#results by varimax rotation
print(sfa3_var, digits = 2, sort = T)
print(sfa3_var$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa3_var)
prop.table(sfa3_var$values) #how much variance explained
#results by oblimin rotation
print(sfa3_obl, digits = 2, sort = T)
print(sfa3_obl$loadings,cutoff = 0.3, sort=TRUE)
summary(sfa3_obl)
prop.table(sfa3_obl$values) #how much variance explained