Discovering higher-level Patterns - Fundamentals #27

HyunkuKwon · 2021-01-12T18:03:35Z

Post questions here for one or more of our fundamentals readings:

Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. “Flat Clustering” and “Hierarchical Clustering.” Chapters 16 and 17 from Introduction to Information Retrieval.

Blei, David. 2012. “Probabilistic Topic Models”. Communications of the ACM 55(4):77-84.

RobertoBarrosoLuque · 2021-02-18T01:06:20Z

I have a two pronged question, first what is the state of the research in terms of automatizing labels for topic modeling? That is, is there any way to create labels for topics without requiring a human to generate them. Are there other topic modeling algorithms that don require the number of topics apriori, but rather infer the number of topics from the dataset itself?

hesongrun · 2021-02-19T01:27:34Z

The idea of applying EM algorithm to extract latent topics from a large corpus is really clever! It taps into the fact that many texts have some common themes which can be extracted by collocation of words within a documents. I have three questions:

For EM algorithm, we know that it is not guaranteed to reach global optimum. Different runs with different initial guess may yield different results. In this sense, how do we report this in academic writing? Is there some systematic ways to tune the model so that the resulting topics are transparent and has very high interpretability?
My second question is on determining optimal number of topics. From this week's lecture, I also learnt many ways to determine the optimal K. For example, we may rely on some information criterion like BIC or use cross validation to see the likelihood of the model on the hold-out data. I am wondering what if the measures are not consistent with each other? Which measures should we trust the most?
My third question is on labeling the topics. How do we justify our labeling in general? Although using the top words may be reasonable, I was thinking about some kinds of tf-idf approach. That is the words that are the most diagnostic of a topic may be those not occurring common across documents.

chiayunc · 2021-02-19T02:57:19Z

I am curious about LDA's performance for a corpus that has a relatively small vocabulary. In the example, we see the entire corpus considered covers topics that are very diverse. If we are looking at a highly concentrated field of corpora, where the size of the vocabulary is pretty small but is articulated/mediated in different ways(hence different focus/ topic), would LDA be an ideal way to perform topic modeling? or would topic modeling not be an ideal way to try to find the nuances between documents in this case?

jinfei1125 · 2021-02-19T04:36:22Z

Hi, though this has been introduced in the lecture and this week's fundamental reading, I still don't quite understand the following process for topic modeling:

Step 1: Randomly choose a distribution over topics.
Step 2: For each word in the document
a. Randomly choose a topic from the distribution over topics in step #1.
b. Randomly choose a word from the corresponding distribution over the vocabulary.

Can you give some further explanation? How to "choose a distribution over topics" and how to " choose a topic from the distribution over topics" and how to "choose a word from the corresponding distribution over the vocabulary"......

k-partha · 2021-02-19T05:48:01Z

Are there any metrics to define the optimal number of topics for LDA?
How does LDA hold up in a world where a bag-of-words model is considered unrealistic compared to more recent NLP approaches which go beyond even n-grams? Are there any input transformations that are considered to improve LDA analyses? (including low level n-grams within the mix, converting the words to vector space representations, etc.)

toecn · 2021-02-19T05:52:02Z

How should we think about the integration of metadata and data for analysis when constructing topic models? What can metadata helps us do, validate or expand in terms of analysis?

Raychanan · 2021-02-19T05:52:47Z

The EM algorithm has a strong correlation with the Naive Bayes method. Does this mean that it can't actually be considered as a good algorithm anymore? Because I know that Naive Bayes is actually not considered as an advanced algorithm in the application area anymore.

I have another question: the EM algorithm seems to have a high degree of similarity to classification based on topic modeling. So I am curious what is the difference between them? When should EM algorithm be used instead of other classification methods?

romanticmonkey · 2021-02-19T06:44:16Z

Are there studies that discovered new population segmentations with text data through clustering methods in the existing literature?

jcvotava · 2021-02-19T07:45:10Z

I have a (perhaps embarrasingly) simple question about LDA and topic modeling: what is the relationship between number of clusters/topics formed (according to various unsupervised metrics mentioned in lecture, like the silhoutte formula) and number of documents? For example, imagine that Journal X ran 80 very similar papers on farm equipment, 10 papers on pasta recipes, 5 papers on neurobiology, and 5 papers on analytic philosophy. Despite organically having 4 clear topics, would the construction of any of these algorithms push artificially for more or fewer topics, or is number of documents already de-weighted in the formula? What kind of approach would be appropriate in this instance, or in an even more extreme case where very, very distinct topics had very few associated documents?

jacyanthis · 2021-02-19T13:23:09Z

What do you think of seeded LDA? When is it useful, and when is it not?

MOTOKU666 · 2021-02-19T14:18:27Z

Can you introduce some more ways to incorporate metadata into topic models? The paper briefly mentioned models of linguistic structure, models that account for distances between corpora, models of named entities. General-purpose methods for incorporating metadata into topic models include Dirichlet-multinomial regression models and supervised topic models. For example, How is the distance between corpora accounted?

zshibing1 · 2021-02-19T14:47:01Z

Is it possible to use unsupervised methods on corpora that have rigid structures (e.g., policy documents) but contain relatively fewer words (less than a million)?

ming-cui · 2021-02-19T14:53:08Z

The authors indicated that LDA is the simplest kind of topic model. I have read some sociology papers using LDA published in leading journals. So LDA seems also powerful and capable. Are there any other topic modeling techniques that we should take a look at?

Rui-echo-Pan · 2021-02-19T15:04:25Z

LDA is useful in analyzing the different topics compared among different texts; then what should we use to analyze the topic change in a long trend? Could we build such an analysis based on LDA ?

william-wei-zhu · 2021-02-19T15:15:00Z

How do we identify the optimal values for the tuning parameters "document - topic sparsity" and "topic-word sparsity"?

Bin-ary-Li · 2021-02-19T15:25:48Z

Is there any benchmark that compares different non-parametric clustering methods? I wonder if there is any consensus/common practice on "when to use/what to use/use on what" in applying clustering methods to data.

xxicheng · 2021-02-19T15:33:20Z

Could you please explain more about metadata? How does it affect?

sabinahartnett · 2021-02-19T15:38:59Z

It seems like topic models require a LOT of back and forth between researcher and machine to optimize - is there an automated way to do this? Or a 'best out of bag number' to start from?

theoevans1 · 2021-02-19T15:41:12Z

What kinds of considerations should be take into account in deciding between classification or clustering for a research question? Is there ever reason to use both methods together?

egemenpamukcu · 2021-02-19T15:58:01Z

I am also interested in some of the best-practices in determining the number of topics for LDA. Also, I would like to hear more about the approaches about comparing the clusters generated by an unsupervised learning method and predetermined classes (classified by an expert or has 'natural' categories). What would be some of the interesting applications of such a mixed approach?

lilygrier · 2021-02-19T17:35:59Z

Similarly to @zshibing1, I'm wondering about the feasibility of performing topic modeling on documents with rigid structures. Specifically, I'm thinking about executive orders or legislative bills, which tend to focus heavily on procedures and logistics and look similar regardless of policy context. Is there something analogous to TF-IDF that occurs in topic modeling to distinguish between very similar documents?

mingtao-gao · 2021-02-19T19:14:38Z

One of the main goals of topic modeling is "discovering and exploiting the hidden thematic structure in large archives of text." However, the umbrella topic of topics is selected before modeling, then how do we mitigate selection bias in this case?

dtanoglidis · 2021-02-19T19:32:28Z

About the LDA and its assumptions. There is a discussion about relaxing the assumptions made by the topic modeling algorithms. I was wondering the following: are all topics equally distinct? To rephrase it: Imagine we have two texts that each is composed of two topics, let's say that Airbnb reviews contain discussions about the location and the listing itself. But in the first text the discussion is more polarized, without any overlap between the topics while in the second one these two are intertwined. Is there a way LDA can distinguish between the two?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discovering higher-level Patterns - Fundamentals #27

Discovering higher-level Patterns - Fundamentals #27

HyunkuKwon commented Jan 12, 2021

RobertoBarrosoLuque commented Feb 18, 2021

hesongrun commented Feb 19, 2021

chiayunc commented Feb 19, 2021 •

edited

Loading

jinfei1125 commented Feb 19, 2021

k-partha commented Feb 19, 2021

toecn commented Feb 19, 2021

Raychanan commented Feb 19, 2021

romanticmonkey commented Feb 19, 2021 •

edited

Loading

jcvotava commented Feb 19, 2021

jacyanthis commented Feb 19, 2021

MOTOKU666 commented Feb 19, 2021

zshibing1 commented Feb 19, 2021

ming-cui commented Feb 19, 2021

Rui-echo-Pan commented Feb 19, 2021

william-wei-zhu commented Feb 19, 2021

Bin-ary-Li commented Feb 19, 2021

xxicheng commented Feb 19, 2021

sabinahartnett commented Feb 19, 2021

theoevans1 commented Feb 19, 2021

egemenpamukcu commented Feb 19, 2021

lilygrier commented Feb 19, 2021

mingtao-gao commented Feb 19, 2021

dtanoglidis commented Feb 19, 2021

Discovering higher-level Patterns - Fundamentals #27

Discovering higher-level Patterns - Fundamentals #27

Comments

HyunkuKwon commented Jan 12, 2021

RobertoBarrosoLuque commented Feb 18, 2021

hesongrun commented Feb 19, 2021

chiayunc commented Feb 19, 2021 • edited Loading

jinfei1125 commented Feb 19, 2021

k-partha commented Feb 19, 2021

toecn commented Feb 19, 2021

Raychanan commented Feb 19, 2021

romanticmonkey commented Feb 19, 2021 • edited Loading

jcvotava commented Feb 19, 2021

jacyanthis commented Feb 19, 2021

MOTOKU666 commented Feb 19, 2021

zshibing1 commented Feb 19, 2021

ming-cui commented Feb 19, 2021

Rui-echo-Pan commented Feb 19, 2021

william-wei-zhu commented Feb 19, 2021

Bin-ary-Li commented Feb 19, 2021

xxicheng commented Feb 19, 2021

sabinahartnett commented Feb 19, 2021

theoevans1 commented Feb 19, 2021

egemenpamukcu commented Feb 19, 2021

lilygrier commented Feb 19, 2021

mingtao-gao commented Feb 19, 2021

dtanoglidis commented Feb 19, 2021

chiayunc commented Feb 19, 2021 •

edited

Loading

romanticmonkey commented Feb 19, 2021 •

edited

Loading