Deep Classification, Embedding & Text Generation - Challenge #54

jamesallenevans · 2021-02-27T11:02:19Z

First, write down two intuitions you have about broad content patterns you will discover about your data as encoded within a pre-trained or fine-tuned deep contextual (e.g., BERT) embedding. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from dynamic, contextual embeddings--e.g., they could be about text generation from a tuned model. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Second, describe the dataset(s) you would like to fine-tune or embed within a pre-trained contextual embedding model to explore these intuitions. Note that this need not be large text--you could simply encode a few texts in a pretrained contextual embedding and explore their position relative to one another and the semantics of the model. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, or (d) an invitation for a TA to contact you about it. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).

jacyanthis · 2021-03-05T00:10:34Z

Intuitions

(*) Humans rate GPT-3 output as more creative than human output for a school-essay-style prompt (but not higher quality overall).
(+) BERT output is a better input for BERT text generation than human-edited text meant to convey the same content as the BERT output.

Dataset
Pre-trained GPT-3, BERT, and humans (e.g. students, M Turkers)

jcvotava · 2021-03-05T10:36:48Z

Intuitions:

"Young" Marx texts display robust difference along a number of key dimensions as compared to "Mature" Marx texts. (+)
Vectorization can help parse out different usages for words which Marx uses in multiple (often confusing) ways: 'value,' 'labor,' and 'history,' for example. (*)

Data:
Google Drive folder - contains 2 pre-trained models (first using Bhargav's code for the labelling task, second GPT2.) Data is sampled sentences which come from Early Marx (Volume 3), Late Marx (Volume 27), and Ambiguous (Late??) Marx (Volume 35). Both data CSVs in the folder contain same data scheme but one is a tiny subset of the other.

MOTOKU666 · 2021-03-05T15:17:37Z

Intuition:
Following previous week question: that
(1) * Issues in terms of Immigrants are more and more related to Economics and Security issues.
(2) + Post and Pre 911 News have different topic intentions towards immigrants, refugees, and Specific Cultures
I have the following intuition
For this week:
(1)* Republican media and Democrat media may have different evaluating criteria towards immigrants/refugees.
(1)+ It's possible to understand the target audience different media focus on for immigrant issues.

Dataset: COCA News(I have the data for 2002 and maybe can start from this) https://drive.google.com/file/d/1rzcTmYxeT5zLRG1UJnyce2izIIM324zi/view?usp=sharing

jinfei1125 · 2021-03-05T15:42:49Z

Intuitions:
(1) + The words from subreddit Personal Finance corpus that are similar to word "finance" are different from words subreddit wallstreetbets that are similar to "finance" --when people talking about finance, they are talking different things
(2) * Wallstreetbets corpus is different from Personal Finance corpus

Data:
Posts from subreddit Personal Finance: Download
Posts from subreddit Wallstreetbets: Download

xxicheng · 2021-03-05T15:49:57Z

*1. Feminism has always been the theme of Gilmore Girls.
+2. Rory's speech style and content have changed when she graduates from high school and enters college.

Dataset:
http://www.crazy-internet-people.com/site/gilmoregirls/scripts.html

hesongrun · 2021-03-05T15:50:16Z

Intuitions:
(1) (*) Given the low signal-to-noise ratio in the financial market, incorporating BERT embeddings may not be able to improve our ability to forecast the stock market returns.
(2) (+) BERT embeddings can beat human traders in digesting textual information.

Data:
financial news, BERT, stock returns from CRSP

romanticmonkey · 2021-03-05T15:59:22Z

Intuition:

(*) Fake news and real news shares different semantic space, with fake news making more weird associations between keywords. (from w7 homework)
(+) By the news sources, political partisanship can be differentiated.

Dataset : https://data.world/romanticmonkey/syrianwarfakenews

theoevans1 · 2021-03-05T16:00:26Z

Intuitions:
I expect romantic language to be generated by a model trained on fanfiction, even if the input suggests a diferent genre.*
I expect more positive word associations with marginalized groups in fanfiction than in source material.+

Data: Scraping data from Archive of Our Own (http://archiveofourown.org/) using this script (https://github.com/radiolarian/AO3Scraper), along with the Davies TV Corpus

william-wei-zhu · 2021-03-05T16:14:25Z

(*) I expect that large percentage of negative employee reviews predicts poor company performance.
(+) In certain industries, large percent of negative reviews is associated with good corporate performance, while good reviews is associated with poor company performance.

data: Glassdoor company review database.

RobertoBarrosoLuque · 2021-03-05T16:29:16Z

Two intuitions:

We can deduce the relationship between presidents by using the difference between embedding vectors of certain keywords. ++
The valence of words that are most similar (based on embeddings and cosine) to climate and regulation is determined by a president's party affiliation. **

Dataset:
Presidential speeches corpus from https://millercenter.org/the-presidency/presidential-speeches
Can be scraped with: https://github.com/RobertoBarrosoLuque/ContentAnalysisPresidentialRhetoric/blob/main/ScrapeSpeeches/scrape_miller.py

egemenpamukcu · 2021-03-05T16:49:24Z

Intuitions:

(+) In structured debates, the winning teams' arguments will be centered around the debate topic.
(*) Winners of different debates on a similar topic (like climate change) would be more closely aligned to each other than the losers.

I didn't collect the data on this because this is unrelated to my project, but it can be scraped from Munk Debates and Intelligence Squared websites.

Bin-ary-Li · 2021-03-05T17:03:29Z

Intuitions

(*) BERT provides unprecedented performance on this dataset compared to any other model used so far.
(+) For this dataset, BERT embedding might not be as informative as LSTM embedding.
Dataset:
Sothebys Dataset (shared in previous weeks)

lilygrier · 2021-03-05T17:35:23Z

Intuitions:

(+) When trained on presidential speeches, GPT-3 generates text readers perceive as convincing (would require human ratings).
(*) Word embeddings could help us predict whether an executive order later gets revoked (above and beyond the president's party affiliation in relation to successive presidents').
Datasets:
Presidential speeches corpus from https://millercenter.org/the-presidency/presidential-speeches
Can be scraped with: https://github.com/RobertoBarrosoLuque/ContentAnalysisPresidentialRhetoric/blob/main/ScrapeSpeeches/scrape_miller.py
Email me at [email protected] for dataset of executive orders!

k-partha · 2021-03-05T18:02:32Z

Intuitions:

(*) Embeddings from high openness users would contain more abstract concepts, as compared to those from low openness users.
(+) Embeddings from highly conscientious users would contain more action verbs that those from low conscientiousness users.
Dataset: Twitter personality dataset (shared last week)

toecn · 2021-03-05T18:46:56Z

People speaking about Latin American politicians that ran for president (2005-2015):

(*) Politicians that were successful were embedded in a more diverse array of topics.
(+) Left-wing politicians’ were embedded mainly in nationalist, worker-oriented sections or the spectrum; Right-wing politicians’ topics revolve about security, the economy.

Corpus del Español: This corpus contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries. It was web-scraped in 2015.

Class dataset: Corpus del Español ("SPAN").

jamesallenevans changed the title ~~Deep Classification, Embedding & Text Generation~~ Deep Classification, Embedding & Text Generation - Challenge Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deep Classification, Embedding & Text Generation - Challenge #54

Deep Classification, Embedding & Text Generation - Challenge #54

jamesallenevans commented Feb 27, 2021

jacyanthis commented Mar 5, 2021

jcvotava commented Mar 5, 2021

MOTOKU666 commented Mar 5, 2021

jinfei1125 commented Mar 5, 2021

xxicheng commented Mar 5, 2021

hesongrun commented Mar 5, 2021

romanticmonkey commented Mar 5, 2021

theoevans1 commented Mar 5, 2021

william-wei-zhu commented Mar 5, 2021

RobertoBarrosoLuque commented Mar 5, 2021 •

edited

Loading

egemenpamukcu commented Mar 5, 2021

Bin-ary-Li commented Mar 5, 2021

lilygrier commented Mar 5, 2021

k-partha commented Mar 5, 2021 •

edited

Loading

toecn commented Mar 5, 2021

Deep Classification, Embedding & Text Generation - Challenge #54

Deep Classification, Embedding & Text Generation - Challenge #54

Comments

jamesallenevans commented Feb 27, 2021

jacyanthis commented Mar 5, 2021

jcvotava commented Mar 5, 2021

MOTOKU666 commented Mar 5, 2021

jinfei1125 commented Mar 5, 2021

xxicheng commented Mar 5, 2021

hesongrun commented Mar 5, 2021

romanticmonkey commented Mar 5, 2021

theoevans1 commented Mar 5, 2021

william-wei-zhu commented Mar 5, 2021

RobertoBarrosoLuque commented Mar 5, 2021 • edited Loading

egemenpamukcu commented Mar 5, 2021

Bin-ary-Li commented Mar 5, 2021

lilygrier commented Mar 5, 2021

k-partha commented Mar 5, 2021 • edited Loading

toecn commented Mar 5, 2021

RobertoBarrosoLuque commented Mar 5, 2021 •

edited

Loading

k-partha commented Mar 5, 2021 •

edited

Loading