Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovering Higher-level Patterns - Challenge #51

Open
jamesallenevans opened this issue Feb 6, 2021 · 14 comments
Open

Discovering Higher-level Patterns - Challenge #51

jamesallenevans opened this issue Feb 6, 2021 · 14 comments

Comments

@jamesallenevans
Copy link

First, write down three intuitions you have about broad content patterns you will discover in your data. Plan an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) on which you will build an unsupervised model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, or (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise unsupervised strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

@jinfei1125
Copy link

jinfei1125 commented Feb 18, 2021

Intuitions:
(1) +People's financial concerns are relatively stable over time from 2015-2019
(2) Top words in different clusters may overlap
(3) * housing, car, retirement, student loans, and tax are the top five topics
Data:
Posts from subreddit Personal Finance
5 csv files of 1000 posts per year from 2015 to 2019: Download Here

@toecn
Copy link

toecn commented Feb 19, 2021

People speaking about Latin American politicians that ran for president (2005-2015):

  • Politicians that were successful were embedded in a more diverse array of topics.
  • Left-wing politicians’ topics are mainly nationalist, worker-oriented; Right-wing politicians’ topics revolve about security, the economy.
  • Politicians in the center of the political spectrum emerged over a tensioned space of topics bc there was a political opportunity to create a “center”.

Corpus del Español: This corpus contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries. It was web-scraped in 2015.

Class dataset: Corpus del Español ("SPAN").

@dtanoglidis
Copy link

Comparison of Airbnb reviews from different places

  • (*) There are strong differences in the way people review their Airbnb experience, and these differences are location-specific.
  • (+) Reviews in popular tourist destinations focus mainly on the place/region. Reviews of listing from larger cities/business destinations focus on the listing (apartment/house) itself.
  • The topics in reviews can mainly be classified in three categories: those related to location, people (description of the locals/hosts) and the amenities of the listing.

Dataset/Download: Inside Airbnb (http://insideairbnb.com/get-the-data.html)

@k-partha
Copy link

k-partha commented Feb 19, 2021

Dataset: Twitter 'likes' of persons with differing self-identified personalities.

  • (*) Implicit thematic interests of differing personality types significantly vary and show larger divergences as contrast in a personality trait increases.
  • (+) We will see unexpected synergies linking the interests of (seemingly) opposite personality types. These synergies will correspond to Jungian cognitive functions.
  • Interests will significantly change in response to personality trait 'openness'.

Data
TAs please note that the data might have to be reduced in size (while keeping the share of 'type' proportional) in order to be workable for quick analysis.

@jcvotava
Copy link

Subject: Topics and rhetoric change in the Marx-Engels Collected Works (MECW), 1835-1895
Intuitions:

  1. (+) So-called "Late Marx" (esp. MECW Volumes 28-37) and "Young Marx" (esp. Volumes 1-5) have substantial similarity in topics and rhetoric, with perhaps some changes in vocabulary
  2. (*) The content of unpublished MECW materials (letters and poetry, see MECW 1-2, 38-50) differs significantly from published works
  3. Chapters in Capital Vols. I, II, and III are relatively idiosyncratic to each book, with topics more likely to belong within books than to cross books (MECW 35-37).

Data:
Link to a scraped copy of the MECW, Vol 1-49 (386 mb, saved as a .csv.) Documents are saved in a single data table, organized by "document" number (i.e. volume number of the MECW) and "subdocument" (i.e. individual texts, letters, chapters, etc. published within each volume.) Data needs tokenization, mild cleaning, etc. Key to understanding what's in each volume can be accessed here.

@jacyanthis
Copy link

Intuitions

  1. AI discussants avoid discussion of ethics (i.e. AI that is economically efficient and productive) and performance (i.e. AI that is economically efficient and productive) together.
  2. AI ethics topics have become increasingly common over time.*
  3. After AI milestones (e.g. AlphaGo’s victory over world Go champion Lee Sedol in March 2016), ethical frames grow in salience relative to performance frames.+

Dataset
News on the Web (NOW) Davies corpus. There are at least 70,000 articles from 2010 to 2020 that include "artificial intelligence."

@romanticmonkey
Copy link

Intuitions:
(1) * Film critics speaks differently from the general audience.
(2) The cluster of speech patterns of film critics in, e.g., 2015 would be more similar to that of the general comments in 2016.
(3) + The cluster of speech patterns of general audience in, 2015 would be more similar to that of the film critics in 2016.

Data:
Rotten Tomatoes critics reviews (https://www.kaggle.com/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset)
Amazon Movie & TV reviews (need to filter TV) (https://nijianmo.github.io/amazon/index.html)

@MOTOKU666
Copy link

*The understanding of immigrant's benefit is different between academia and general public
+We may see a convergence in opinion during the Trump period
The difference in understanding might be quite large at all time

Data:
Maybe not available to construct a comparable opinion set.

@william-wei-zhu
Copy link

Data: Music lyrics dataset.
https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres

Intuition:
(1) The lyrics cluster by genre
(2) The lyrics cluster by artist
(3) the lyrics cluster by album

@xxicheng
Copy link

xxicheng commented Feb 19, 2021

Intuitions:
*1. Feminism has always been the theme of Gilmore Girls.
2. Rory's speech style and content have changed when she graduates from high school and enters college.
3. The frequency of her conversations with different characters has also changed.

Dataset: http://www.crazy-internet-people.com/site/gilmoregirls/scripts.html

@Bin-ary-Li
Copy link

Intuitions

  • (*) the length of the catalog description of artworks correlates with the auction price of artworks.
  • Semantic content (e.g. sentiment) of the catalog description might impact the auction price.
  • (+) Number of specialized noun term (e.g. other artists' names, art theory jargon, etc.) correlates with the auction price of artworks
    Dataset:
    50k artwork entries scrapped from Sotheby's website: link

@theoevans1
Copy link

Intuitions:
I expect more romantic topics in fanfiction than in source material.*
I expect thematic material will be more consistent over time than the sources the fanfiction is based upon.
I expect clusters to be similar across fanfiction stories for different shows.+

Data: Davies TV Corpus and fanfiction scraped using AO3 scraper script (https://github.com/radiolarian/AO3Scraper)

@egemenpamukcu
Copy link

Intuitions:

  • (+) In structured debates, the topic of the winning teams' speech will be more closely related to the debate prompt than the loser's.
  • (*) Winners of different debates on a similar topic (like climate change) would have roughly the same set of topics in their speeches (topics other than the ones that are directly related to the prompt).
  • Clustering would roughly appriximate the winners and losers for debates with similar prompts.

I didn't collect the data on this because this is unrelated to my project, but it can be scraped from Munk Debates and Intelligence Squared websites.

@lilygrier
Copy link

Intuitions:

  1. (*) Legislation clusters based on topic rather than position (i.e., pro- and anti-environmental regulation will appear in the same cluster).
  2. The styles of legislative writing may have changed over time, and clustering might pick up on this and put bills written around the same time together.
  3. (+) It may be possible to determine bills that actually passed based on clustering. (My instinct is that the language used in the bill has less to do with whether it passed than who was in the House and Senate at the time, but could be!)

Data:
Billsum dataset . Use the us_train_data_final_offical json file in the drive. NOTE: This dataset does not include information about whether the bills passed or not, so we'd have to verify that with a different dataset, so this would have to be verified with outside data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests