Experiments in identifying someone's interests/knowledge using word embedding & topic modeling
- Hacker News headlines (and comments?) (All HN headlines w/10+ comments from users who've made 10+ comments)
- scraped extracted URLs
- use sense2vec
- Wikipedia
words2map - word2vec + t-SNE + HDBSCAN
words2map blogpost: Making Sense of Everything with words2map
Visualizing Clusters of Clickbait Headlines Using Spark, Word2vec, and Plotly - built on top of words2map
fastText paper on word representations: Enriching Word Vectors with Subword Information fastText paper on text classification: Bag of Tricks for Efficient Text Classification
lda2vec - mixes lda and word2vec
lda2vec paper: Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec
entity2vec - finding related info sources (to expand training data?)
entity2vec paper: Fast and Space-efficient Entity Linking in Queries)
topicvec - a different way of mixing lda w/word vectors
topicvec paper: Generative Topic Embedding: a Continuous Representation of Documents)
LFTM - a third attempt at mixing LDA and word2vec
LFTM paper: Improving Topic Models with Latent Feature Word Representations
Word Mover's Distance - proposed future work at end of topicvec paper (unsure if it can be used here)
wmd paper: From Word Embeddings To Document Distances
tweet2vec - predicts hashtags via character-level vectors? (probably not useful for this)
tweet2vec paper: Tweet2Vec: Character-Based Distributed Representations for Social Media