Exploring Semantic Spaces - Challenge #53

jamesallenevans · 2021-02-20T16:00:38Z

First, write down two intuitions you have about broad content patterns you will discover in your data. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from continuous, high-dimensional embeddings. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Note that these expectations become the basis of abduction--to condition your surprise. Second, describe the dataset(s) on which you will build an embedding model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, OR (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

jinfei1125 · 2021-02-26T07:29:06Z

Intuitions:
(1) + The words from subreddit Personal Finance corpus that are similar to word "finance" are different from words subreddit wallstreetbets that are similar to "finance" --when people talking about finance, they are talking different things
(2) * Wallstreetbets corpus is different from Personal Finance corpus

Data:
Posts from subreddit Personal Finance: Download
Posts from subreddit Wallstreetbets: Download

k-partha · 2021-02-26T09:09:14Z

Dataset: Twitter likes from profiles with personality labels.
TAs please note: The data has not been cleaned for non-english profiles.

Intuitions:

(*) We would see the likes of users with high openness and low agreeableness (labelled 'intuitive', 'thinking' = 1) differ significantly in semantic content from those with the opposite traits ( 'intuitive', 'thinking' = 0).
(+) The semantic content of likes from the first set of users will align stronger with 'technology' and 'science' than the second set's. We would see a stronger alignment towards less abstract themes (think cooking, animals, sports?) for the second set of users.

romanticmonkey · 2021-02-26T09:10:33Z

Intuition:
(1) * News articles are clustered by categories.
(2) + Fake and real news articles have different linguistic features, and therefore will differentiable in the semantic space.

Dataset: This dataset is from Perez-Rosas et al. (2017) on fake news detection. It's a small dataset but it's marked for both authenticity and news category. Nevertheless, the fake news pieces are created by AMT, so it might read a bit off comparing to actual fake news. (https://data.world/romanticmonkey/perez-rosasfakenews)

Notes: label = 1 for fake news, label = 0 for real news

jacyanthis · 2021-02-26T13:29:02Z

Intuitions

(*) The keywords of artificial intelligence (e.g. artificial intelligence, machine learning, deep learning, algorithms) have become less associated with profit (e.g. money, business, market) and more associated with society (e.g. impact, ethics, racism) over time.
(+) The keywords of artificial intelligence (e.g. artificial intelligence, machine learning, deep learning, algorithms) have become less associated with technology (e.g. blockchain, science, math, software) and more associated with society (e.g. impact, ethics, racism) over time.

Dataset
News on the Web (NOW) Davies corpus. There are at least 70,000 articles from 2010 to 2020 that include "artificial intelligence."

MOTOKU666 · 2021-02-26T14:23:58Z

intuition:
(1) * Issues in terms of Immigrants are more and more related to Economics and Security issues.
(2) + Post and Pre 911 News have different topic intentions towards immigrants, refugees, and Specific Cultures

Dataset: COCA News

Bin-ary-Li · 2021-02-26T15:16:08Z

Intuition:

(*) The words about artwork format and dimension (e.g. "height", "width", "canvas", number) will have closer semantic embedding.
(+) Words commonly seen in the description of expensive artworks have closer semantic representation than words common in the least expensive ones.

Dataset: Sothebys art entry

xxicheng · 2021-02-26T15:44:21Z

IntuitioIntuitions:
*1. Feminism has always been the theme of Gilmore Girls.
2. Rory's speech style and content have changed when she graduates from high school and enters college.
3. The frequency of her conversations with different characters has also changed.

Dataset: http://www.crazy-internet-people.com/site/gilmoregirls/scripts.html

theoevans1 · 2021-02-26T16:00:01Z

Intuitions:
I expect more romantic content in fanfiction than in source material.*
I expect more positive word associations with marginalized groups in fanfiction than in source material.+

Data: Scraping data from Archive of Our Own (http://archiveofourown.org/) using this script (https://github.com/radiolarian/AO3Scraper), along with the Davies TV Corpus

hesongrun · 2021-02-26T16:00:08Z

Intuitions:

There will be a surge in research in relevant research area after Nobel prize in Economics are released.
Economists also follow real-world event closely. For example, at present, we expect to see many Covid-related research.

Dataset: web of science Econ journal article abstracts.

RobertoBarrosoLuque · 2021-02-26T16:14:38Z

Two intuitions:
1. We can deduce the relationship between presidents by using the difference between embedding vectors of certain keywords. ++
2. The valence of words that are most similar (based on embeddings and cosine) to climate and regulation is determined by a president's party affiliation. **
Dataset:
Presidential speeches corpus from https://millercenter.org/the-presidency/presidential-speeches
Can be scraped with: https://github.com/RobertoBarrosoLuque/ContentAnalysisPresidentialRhetoric/blob/main/ScrapeSpeeches/scrape_miller.py

william-wei-zhu · 2021-02-26T16:22:57Z

hypothesis: we can predict events of companies (e.g. IPO, growth, bankruptcy, CEO firing) by their glassdoor company reviews.

Data: Glassdoor company review data

toecn · 2021-02-26T16:44:18Z

People speaking about Latin American politicians that ran for president (2005-2015):

Politicians that were successful were embedded in a more diverse array of topics.
Left-wing politicians’ topics are mainly nationalist, worker-oriented; Right-wing politicians’ topics revolve about security, the economy.
Politicians in the center of the political spectrum emerged over a tensioned space of topics bc there was a political opportunity to create a “center”.

Corpus del Español: This corpus contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries. It was web-scraped in 2015.

Class dataset: Corpus del Español ("SPAN").

jcvotava · 2021-02-26T16:58:49Z

Subject: Topics and rhetoric change in the Marx-Engels Collected Works (MECW), 1835-1895
Intuitions:

(+) So-called "Late Marx" (esp. MECW Volumes 28-37) and "Young Marx" (esp. Volumes 1-5) have substantial similarity in topics and rhetoric, with perhaps some changes in vocabulary
(*) The content of unpublished MECW materials (letters and poetry, see MECW 1-2, 38-50) differs significantly from published works
Chapters in Capital Vols. I, II, and III are relatively idiosyncratic to each book, with topics more likely to belong within books than to cross books (MECW 35-37).

Data:
Link to a scraped copy of the MECW, Vol 1-49 (386 mb, saved as a .csv.) Documents are saved in a single data table, organized by "document" number (i.e. volume number of the MECW) and "subdocument" (i.e. individual texts, letters, chapters, etc. published within each volume.) Data needs tokenization, mild cleaning, etc. Key to understanding what's in each volume can be accessed here.

dtmlinh · 2021-02-26T17:01:12Z

Hypothesis: There are overlap topics between presidential speeches and executive orders and that overlap differs by the president.

Data: Presidential speeches corpus from https://millercenter.org/the-presidency/presidential-speeches and Executive orders from https://www.federalregister.gov/presidential-documents/executive-orders

egemenpamukcu · 2021-02-26T17:37:55Z

Intuitions:

(+) In structured debates, the winning teams' arguments will be centered around the debate topic.
(*) Winners of different debates on a similar topic (like climate change) would be more closely aligned to each other than the losers.

I didn't collect the data on this because this is unrelated to my project, but it can be scraped from Munk Debates and Intelligence Squared websites.

lilygrier · 2021-02-26T19:20:53Z

Intuitions:
(*) Republican presidents' rhetoric on energy policy will be more aligned with talking about jobs (so as to defend the coal industry), whereas Democratic presidents' rhetoric on energy policy will have more to do with climate change and the environment.
(+) Language used in energy-related bills will be similar to language used in health-related bills.
Corpus: Presidential speeches corpus from Miller Corpus
Can be scraped with: this code
Contact me at [email protected] if you want the corpus of executive orders!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploring Semantic Spaces - Challenge #53

Exploring Semantic Spaces - Challenge #53

jamesallenevans commented Feb 20, 2021

jinfei1125 commented Feb 26, 2021

k-partha commented Feb 26, 2021 •

edited

Loading

romanticmonkey commented Feb 26, 2021

jacyanthis commented Feb 26, 2021

MOTOKU666 commented Feb 26, 2021

Bin-ary-Li commented Feb 26, 2021

xxicheng commented Feb 26, 2021

theoevans1 commented Feb 26, 2021

hesongrun commented Feb 26, 2021

RobertoBarrosoLuque commented Feb 26, 2021

william-wei-zhu commented Feb 26, 2021

toecn commented Feb 26, 2021

jcvotava commented Feb 26, 2021

dtmlinh commented Feb 26, 2021

egemenpamukcu commented Feb 26, 2021

lilygrier commented Feb 26, 2021

Exploring Semantic Spaces - Challenge #53

Exploring Semantic Spaces - Challenge #53

Comments

jamesallenevans commented Feb 20, 2021

jinfei1125 commented Feb 26, 2021

k-partha commented Feb 26, 2021 • edited Loading

romanticmonkey commented Feb 26, 2021

jacyanthis commented Feb 26, 2021

MOTOKU666 commented Feb 26, 2021

Bin-ary-Li commented Feb 26, 2021

xxicheng commented Feb 26, 2021

theoevans1 commented Feb 26, 2021

hesongrun commented Feb 26, 2021

RobertoBarrosoLuque commented Feb 26, 2021

william-wei-zhu commented Feb 26, 2021

toecn commented Feb 26, 2021

jcvotava commented Feb 26, 2021

dtmlinh commented Feb 26, 2021

egemenpamukcu commented Feb 26, 2021

lilygrier commented Feb 26, 2021

k-partha commented Feb 26, 2021 •

edited

Loading