Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovering higher-level Patterns - Fundamentals #27

Open
HyunkuKwon opened this issue Jan 12, 2021 · 23 comments
Open

Discovering higher-level Patterns - Fundamentals #27

HyunkuKwon opened this issue Jan 12, 2021 · 23 comments

Comments

@HyunkuKwon
Copy link
Collaborator

Post questions here for one or more of our fundamentals readings:

Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. “Flat Clustering” and “Hierarchical Clustering.” Chapters 16 and 17 from Introduction to Information Retrieval.

Blei, David. 2012. “Probabilistic Topic Models”. Communications of the ACM 55(4):77-84.

@RobertoBarrosoLuque
Copy link

I have a two pronged question, first what is the state of the research in terms of automatizing labels for topic modeling? That is, is there any way to create labels for topics without requiring a human to generate them. Are there other topic modeling algorithms that don require the number of topics apriori, but rather infer the number of topics from the dataset itself?

@hesongrun
Copy link

The idea of applying EM algorithm to extract latent topics from a large corpus is really clever! It taps into the fact that many texts have some common themes which can be extracted by collocation of words within a documents. I have three questions:

  1. For EM algorithm, we know that it is not guaranteed to reach global optimum. Different runs with different initial guess may yield different results. In this sense, how do we report this in academic writing? Is there some systematic ways to tune the model so that the resulting topics are transparent and has very high interpretability?
  2. My second question is on determining optimal number of topics. From this week's lecture, I also learnt many ways to determine the optimal K. For example, we may rely on some information criterion like BIC or use cross validation to see the likelihood of the model on the hold-out data. I am wondering what if the measures are not consistent with each other? Which measures should we trust the most?
  3. My third question is on labeling the topics. How do we justify our labeling in general? Although using the top words may be reasonable, I was thinking about some kinds of tf-idf approach. That is the words that are the most diagnostic of a topic may be those not occurring common across documents.

@chiayunc
Copy link

chiayunc commented Feb 19, 2021

I am curious about LDA's performance for a corpus that has a relatively small vocabulary. In the example, we see the entire corpus considered covers topics that are very diverse. If we are looking at a highly concentrated field of corpora, where the size of the vocabulary is pretty small but is articulated/mediated in different ways(hence different focus/ topic), would LDA be an ideal way to perform topic modeling? or would topic modeling not be an ideal way to try to find the nuances between documents in this case?

@jinfei1125
Copy link

Hi, though this has been introduced in the lecture and this week's fundamental reading, I still don't quite understand the following process for topic modeling:

Step 1: Randomly choose a distribution over topics.
Step 2: For each word in the document
a. Randomly choose a topic from the distribution over topics in step #1.
b. Randomly choose a word from the corresponding distribution over the vocabulary.

Can you give some further explanation? How to "choose a distribution over topics" and how to " choose a topic from the distribution over topics" and how to "choose a word from the corresponding distribution over the vocabulary"......

@k-partha
Copy link

  1. Are there any metrics to define the optimal number of topics for LDA?
  2. How does LDA hold up in a world where a bag-of-words model is considered unrealistic compared to more recent NLP approaches which go beyond even n-grams? Are there any input transformations that are considered to improve LDA analyses? (including low level n-grams within the mix, converting the words to vector space representations, etc.)

@toecn
Copy link

toecn commented Feb 19, 2021

How should we think about the integration of metadata and data for analysis when constructing topic models? What can metadata helps us do, validate or expand in terms of analysis?

@Raychanan
Copy link

The EM algorithm has a strong correlation with the Naive Bayes method. Does this mean that it can't actually be considered as a good algorithm anymore? Because I know that Naive Bayes is actually not considered as an advanced algorithm in the application area anymore.

I have another question: the EM algorithm seems to have a high degree of similarity to classification based on topic modeling. So I am curious what is the difference between them? When should EM algorithm be used instead of other classification methods?

@romanticmonkey
Copy link

romanticmonkey commented Feb 19, 2021

Are there studies that discovered new population segmentations with text data through clustering methods in the existing literature?

@jcvotava
Copy link

I have a (perhaps embarrasingly) simple question about LDA and topic modeling: what is the relationship between number of clusters/topics formed (according to various unsupervised metrics mentioned in lecture, like the silhoutte formula) and number of documents? For example, imagine that Journal X ran 80 very similar papers on farm equipment, 10 papers on pasta recipes, 5 papers on neurobiology, and 5 papers on analytic philosophy. Despite organically having 4 clear topics, would the construction of any of these algorithms push artificially for more or fewer topics, or is number of documents already de-weighted in the formula? What kind of approach would be appropriate in this instance, or in an even more extreme case where very, very distinct topics had very few associated documents?

@jacyanthis
Copy link

What do you think of seeded LDA? When is it useful, and when is it not?

@MOTOKU666
Copy link

Can you introduce some more ways to incorporate metadata into topic models? The paper briefly mentioned models of linguistic structure, models that account for distances between corpora, models of named entities. General-purpose methods for incorporating metadata into topic models include Dirichlet-multinomial regression models and supervised topic models. For example, How is the distance between corpora accounted?

@zshibing1
Copy link

Is it possible to use unsupervised methods on corpora that have rigid structures (e.g., policy documents) but contain relatively fewer words (less than a million)?

@ming-cui
Copy link

The authors indicated that LDA is the simplest kind of topic model. I have read some sociology papers using LDA published in leading journals. So LDA seems also powerful and capable. Are there any other topic modeling techniques that we should take a look at?

@Rui-echo-Pan
Copy link

LDA is useful in analyzing the different topics compared among different texts; then what should we use to analyze the topic change in a long trend? Could we build such an analysis based on LDA ?

@william-wei-zhu
Copy link

How do we identify the optimal values for the tuning parameters "document - topic sparsity" and "topic-word sparsity"?

@Bin-ary-Li
Copy link

Is there any benchmark that compares different non-parametric clustering methods? I wonder if there is any consensus/common practice on "when to use/what to use/use on what" in applying clustering methods to data.

@xxicheng
Copy link

Could you please explain more about metadata? How does it affect?

@sabinahartnett
Copy link

It seems like topic models require a LOT of back and forth between researcher and machine to optimize - is there an automated way to do this? Or a 'best out of bag number' to start from?

@theoevans1
Copy link

What kinds of considerations should be take into account in deciding between classification or clustering for a research question? Is there ever reason to use both methods together?

@egemenpamukcu
Copy link

I am also interested in some of the best-practices in determining the number of topics for LDA. Also, I would like to hear more about the approaches about comparing the clusters generated by an unsupervised learning method and predetermined classes (classified by an expert or has 'natural' categories). What would be some of the interesting applications of such a mixed approach?

@lilygrier
Copy link

Similarly to @zshibing1, I'm wondering about the feasibility of performing topic modeling on documents with rigid structures. Specifically, I'm thinking about executive orders or legislative bills, which tend to focus heavily on procedures and logistics and look similar regardless of policy context. Is there something analogous to TF-IDF that occurs in topic modeling to distinguish between very similar documents?

@mingtao-gao
Copy link

One of the main goals of topic modeling is "discovering and exploiting the hidden thematic structure in large archives of text." However, the umbrella topic of topics is selected before modeling, then how do we mitigate selection bias in this case?

@dtanoglidis
Copy link

About the LDA and its assumptions. There is a discussion about relaxing the assumptions made by the topic modeling algorithms. I was wondering the following: are all topics equally distinct? To rephrase it: Imagine we have two texts that each is composed of two topics, let's say that Airbnb reviews contain discussions about the location and the listing itself. But in the first text the discussion is more polarized, without any overlap between the topics while in the second one these two are intertwined. Is there a way LDA can distinguish between the two?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests