appendix.rmd

# Appendix

## Texts


### Donald Barthelme

<a href="https://en.wikipedia.org/wiki/Donald_Barthelme"><img src="img/Barthelme.jpg" style="display:block; margin: 0 auto;" width=500></a>

> <p style="text-align:center">"I have to admit we are mired in the most exquisite mysterious muck. <br> This muck heaves and palpitates. It is multi-directional and has a mayor."</p>

> <p style="text-align:center">"You may not be interested in absurdity, but absurdity is interested in you."</p>


#### The First Thing the Baby Did Wrong

This short story is essentially a how-to on parenting.

[link](http://jessamyn.com/barth/baby.html)


#### The Balloon

This story is about a balloon that can represent whatever you want it to.

[link](http://www.uni.edu/oloughli/elit11/Balloon.rtf)


#### Some of Us Had Been Threatening Our Friend Colby

A brief work about etiquette and how to act in society.

[link](http://jessamyn.com/barth/colby.html)


### Raymond Carver

<a href="https://en.wikipedia.org/wiki/Raymond_Carver"><img src="img/Raymond_Carver.jpg" style="display:block; margin: 0 auto;" width=500></a>

> <p style="text-align:center">"It ought to make us feel ashamed when we talk like we know what we're talking about when we talk about love."</p>

> <p style="text-align:center">"That's all we have, finally, the words, and they had better be the right ones." </p>


#### What We Talk About When We Talk About Love

The text we use is actually *Beginners*, or the unedited version. A drink is required in order to read it with the proper context. Probably several. No. Definitely several.

[link](http://www.newyorker.com/magazine/2007/12/24/beginners)


### Billy Dee Shakespeare

<img src="img/Billy_Dee_Shakespeare.jpg" style="display:block; margin: 0 auto;" width='40%'>


> <p style="text-align:center">"It works every time."</p>


These old works have pretty much no relevance today, and are mostly forgotten by everyone except humanities faculty. The analysis of them depicted in this document is pretty much definitive, and leaves little else to say regarding them, so don't bother reading them if you haven't already.


## R

Up until even a couple years ago, R was *terrible* at text.  You really only had base R for basic processing and a couple packages that were not straightforward to use.  There was little for scraping the web.  Nowadays, I would say it's probably easier to deal with text in R than it is elsewhere, including Python. Packages like <span class="pack">rvest</span>, <span class="pack">stringr</span>/<span class="pack">stringi</span>, and <span class="pack">tidytext</span> and more make it almost easy enough to jump right in.

One can peruse the Natural Language Processing task view to start getting a sense of what all is available in R.

[NLP task view](https://www.r-pkg.org/ctv/NaturalLanguageProcessing)


The one drawback with R is that most of the dealing with text is slow and/or memory intensive.  The Shakespeare texts are only a few dozen and not very long works, and yet your basic LDA might still take a minute or so.  Most text analysis situations might have thousands to millions of texts, such that the corpus itself may be too much to hold in memory, and thus R, at least on a standard computing device or with the usual methods, might not be viable for your needs.

## Python

While R has done a lot to catch up, more advanced text analysis techniques are developed in Python (if not lower level languages), and so the state of the art may be found there.  Furthermore, much of text analysis is a high volume affair, and that means it will likely be done much more efficiently in the Python environment if so, though one still might need a high performance computing environment.  Here are some of the popular modules in Python. 

- <span class="pack">nltk</span>
- <span class="pack">textblob</span> (the tidytext for Python)
- <span class="pack">gensim</span> (topic modeling)
- <span class="pack">spaCy</span>


## A Faster LDA

We noted in the Shakespeare start to finish example that there are faster alternatives than the standard LDA in <span class="pack">topicmodels</span>.  In particular, the powerful <span class="pack">text2vec</span> package contains a faster and less memory intensive implementation of LDA and dealing with text generally.  Both of which are very important if you're wanting to use R for text analysis.  The other nice thing is that it works with <span class="pack">LDAvis</span> for visualization.

For the following, we'll use one of the partially cleaned document term matrix for the Shakespeare texts.  One of the things to get used to is that <span class="pack">text2vec</span> uses the newer R6 classes of R objects, hence the `$` approach you see to using specific methods.

```{r text2vec_demo}
library(text2vec)
load('data/shakes_dtm_stemmed.RData')
# load('data/shakes_words_df.RData') # non-stemmed

# convert to the sparse matrix representation using Matrix package
shakes_dtm = as(shakes_dtm, 'CsparseMatrix')

# setup the model
lda_model = LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)

# fit the model
doc_topic_distr =   lda_model$fit_transform(x = shakes_dtm, 
                                            n_iter = 1000,
                                            convergence_tol = 0.0001, 
                                            n_check_convergence = 25,
                                            progressbar = FALSE)

lda_model$get_top_words(n = 10, topic_number = 1:10, lambda = 1)
which.max(doc_topic_distr['Hamlet', ])

# top-words could be sorted by “relevance” which also takes into account
# frequency of word in the corpus (0 < lambda < 1)
lda_model$get_top_words(n = 10, topic_number = 1:10, lambda = 0.2)

# ldavis not shown
# lda_model$plot()
```

Given that most text analysis can be very time consuming for a model, consider any approach that might give you more efficiency.