In this MP, you will implement the PLSA algorithm discussed in lectures 9.7 and 9.8 of the Text Mining Coursera course.
You are not required to implement the PLSA algorithm with a background model (we will run tests assuming the background model has not been implemented). You are provided with two data sets in the data
folder: test.txt
and dblp.txt
. You can assume that each line is a separate document.
The test.txt
data set contains 1000 lines. Each line is a document. The first 100 documents are labeled with the topic the document belongs to, either 0 (“Seattle”) or 1 (“Chicago”). Each of the first 100 document is generated by sampling completely from the topic that is labelled (i.e., generated from one topic only). The rest 900 documents are generated from a mixture model of the topic “Seattle” and “Chicago”. Run your code to test if your EM algorithm returns reasonable results.
The DBLP.txt
data set is made up of research paper abstracts from DBLP. Each line is a document. Make sure to test your code on the simpler data set test.txt
before running it on DBLP.txt
.
You are provided with a code skeleton plsa.py
. Do not change anything in the def __init__()
function. But feel free to change anything in the main()
function to test your code.
Some suggested tips:
- Using matrices is strongly encouraged (writing nested
for-loops
for calculation is painful) - Python libraries
numpy
andscipy
are useful in matrix based calculations
The original PLSA model does not contain a background model. This MP also is based on the original PLSA model, you do not have to worry about the background model. However, lectures are all about PLSA with a background model, so you should not attempt to map the formulas on lecture slides directly to the code. Instead, you would need to adjust the formulas on slides by assuming that there is zero probability that the background model would be chosen. That is, you should set λB to zero. If you do this, you will see that the E-step would essentially only compute a distribution over k topics for z=1, ..., k, given each word, i.e., p(z=i|w). The M-step would also be simpler as p(Z=B|w) is also zero for all words (due to the fact that λB=0). If you are still confused, please take a look at Section 3 of Chase Geigle's note on EM [2] and the note below.
The main data structures involved in the implementation of this EM algorithm are three matrices:
-
T (topics by words): this is the set of parameters characterizing topic content that we denoted by θi's. Each element is the probability of a particular word in a particular topic.
-
D (documents by topics): this is the set of parameters modeling the coverage of topics in each document, which we denoted by pij's. Each element is the probability of a particular topic is covered in a particular document.
-
Z (hidden variables): For every document, we need one Z which represents the probability that each word in the document has been generated from a particular topic, so for any document, this is a "word-by-topic" matrix, encoding p(Z|w) for a particular document. Z is the matrix that we compute in the E-step (based on matrices T and D, which represent our parameters). Note that we need to compute a different Z for each document, so we need to allocate a matrix Z for every document. If we do so, the M-step is simply to use all these Z matrices together with word counts in each document to re-estimate all the parameters, i.e., updating matrices T and D based on Z. Thus at a high level, this is what's happening in the algorithm:
- T and D are initialized.
- E-step computes all Z's based on T and D.
- M-step uses all Z's to update T and D.
- We iterate until the likelihood doesn't change much when we would use T and D as our output. Note that Zs are also very useful (can you imagine some applications of Zs?).
[1] Cheng’s note on the EM algorithm
[2] Chase Geigle’s note on the EM algorithm, which includes a derivation of the EM algorithm (see section 4), and
[3] Qiaozhu Mei’s note on the EM algorithm for PLSA, which includes a different derivation of the EM algorithm.
THIS MP IS CODING AND DEBUGGING HEAVY! PLEASE DON'T LEAVE IT TILL THE LAST MINUTE!