-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discovering higher-level Patterns - Fundamentals #27
Comments
I have a two pronged question, first what is the state of the research in terms of automatizing labels for topic modeling? That is, is there any way to create labels for topics without requiring a human to generate them. Are there other topic modeling algorithms that don require the number of topics apriori, but rather infer the number of topics from the dataset itself? |
The idea of applying EM algorithm to extract latent topics from a large corpus is really clever! It taps into the fact that many texts have some common themes which can be extracted by collocation of words within a documents. I have three questions:
|
I am curious about LDA's performance for a corpus that has a relatively small vocabulary. In the example, we see the entire corpus considered covers topics that are very diverse. If we are looking at a highly concentrated field of corpora, where the size of the vocabulary is pretty small but is articulated/mediated in different ways(hence different focus/ topic), would LDA be an ideal way to perform topic modeling? or would topic modeling not be an ideal way to try to find the nuances between documents in this case? |
Hi, though this has been introduced in the lecture and this week's fundamental reading, I still don't quite understand the following process for topic modeling:
Can you give some further explanation? How to "choose a distribution over topics" and how to " choose a topic from the distribution over topics" and how to "choose a word from the corresponding distribution over the vocabulary"...... |
|
How should we think about the integration of metadata and data for analysis when constructing topic models? What can metadata helps us do, validate or expand in terms of analysis? |
The EM algorithm has a strong correlation with the Naive Bayes method. Does this mean that it can't actually be considered as a good algorithm anymore? Because I know that Naive Bayes is actually not considered as an advanced algorithm in the application area anymore. I have another question: the EM algorithm seems to have a high degree of similarity to classification based on topic modeling. So I am curious what is the difference between them? When should EM algorithm be used instead of other classification methods? |
Are there studies that discovered new population segmentations with text data through clustering methods in the existing literature? |
I have a (perhaps embarrasingly) simple question about LDA and topic modeling: what is the relationship between number of clusters/topics formed (according to various unsupervised metrics mentioned in lecture, like the silhoutte formula) and number of documents? For example, imagine that Journal X ran 80 very similar papers on farm equipment, 10 papers on pasta recipes, 5 papers on neurobiology, and 5 papers on analytic philosophy. Despite organically having 4 clear topics, would the construction of any of these algorithms push artificially for more or fewer topics, or is number of documents already de-weighted in the formula? What kind of approach would be appropriate in this instance, or in an even more extreme case where very, very distinct topics had very few associated documents? |
What do you think of seeded LDA? When is it useful, and when is it not? |
Can you introduce some more ways to incorporate metadata into topic models? The paper briefly mentioned models of linguistic structure, models that account for distances between corpora, models of named entities. General-purpose methods for incorporating metadata into topic models include Dirichlet-multinomial regression models and supervised topic models. For example, How is the distance between corpora accounted? |
Is it possible to use unsupervised methods on corpora that have rigid structures (e.g., policy documents) but contain relatively fewer words (less than a million)? |
The authors indicated that LDA is the simplest kind of topic model. I have read some sociology papers using LDA published in leading journals. So LDA seems also powerful and capable. Are there any other topic modeling techniques that we should take a look at? |
LDA is useful in analyzing the different topics compared among different texts; then what should we use to analyze the topic change in a long trend? Could we build such an analysis based on LDA ? |
How do we identify the optimal values for the tuning parameters "document - topic sparsity" and "topic-word sparsity"? |
Is there any benchmark that compares different non-parametric clustering methods? I wonder if there is any consensus/common practice on "when to use/what to use/use on what" in applying clustering methods to data. |
Could you please explain more about metadata? How does it affect? |
It seems like topic models require a LOT of back and forth between researcher and machine to optimize - is there an automated way to do this? Or a 'best out of bag number' to start from? |
What kinds of considerations should be take into account in deciding between classification or clustering for a research question? Is there ever reason to use both methods together? |
I am also interested in some of the best-practices in determining the number of topics for LDA. Also, I would like to hear more about the approaches about comparing the clusters generated by an unsupervised learning method and predetermined classes (classified by an expert or has 'natural' categories). What would be some of the interesting applications of such a mixed approach? |
Similarly to @zshibing1, I'm wondering about the feasibility of performing topic modeling on documents with rigid structures. Specifically, I'm thinking about executive orders or legislative bills, which tend to focus heavily on procedures and logistics and look similar regardless of policy context. Is there something analogous to TF-IDF that occurs in topic modeling to distinguish between very similar documents? |
One of the main goals of topic modeling is "discovering and exploiting the hidden thematic structure in large archives of text." However, the umbrella topic of topics is selected before modeling, then how do we mitigate selection bias in this case? |
About the LDA and its assumptions. There is a discussion about relaxing the assumptions made by the topic modeling algorithms. I was wondering the following: are all topics equally distinct? To rephrase it: Imagine we have two texts that each is composed of two topics, let's say that Airbnb reviews contain discussions about the location and the listing itself. But in the first text the discussion is more polarized, without any overlap between the topics while in the second one these two are intertwined. Is there a way LDA can distinguish between the two? |
Post questions here for one or more of our fundamentals readings:
Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. “Flat Clustering” and “Hierarchical Clustering.” Chapters 16 and 17 from Introduction to Information Retrieval.
Blei, David. 2012. “Probabilistic Topic Models”. Communications of the ACM 55(4):77-84.
The text was updated successfully, but these errors were encountered: