-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exploring Semantic Spaces - Challenge #53
Comments
Intuitions: Data: |
Dataset: Twitter likes from profiles with personality labels. Intuitions:
|
Intuition: Dataset: This dataset is from Perez-Rosas et al. (2017) on fake news detection. It's a small dataset but it's marked for both authenticity and news category. Nevertheless, the fake news pieces are created by AMT, so it might read a bit off comparing to actual fake news. (https://data.world/romanticmonkey/perez-rosasfakenews) Notes: label = 1 for fake news, label = 0 for real news |
Intuitions
Dataset |
intuition: Dataset: COCA News |
Intuition:
Dataset: Sothebys art entry |
IntuitioIntuitions: Dataset: http://www.crazy-internet-people.com/site/gilmoregirls/scripts.html |
Intuitions: Data: Scraping data from Archive of Our Own (http://archiveofourown.org/) using this script (https://github.com/radiolarian/AO3Scraper), along with the Davies TV Corpus |
Intuitions:
Dataset: web of science Econ journal article abstracts. |
|
hypothesis: we can predict events of companies (e.g. IPO, growth, bankruptcy, CEO firing) by their glassdoor company reviews. Data: Glassdoor company review data |
People speaking about Latin American politicians that ran for president (2005-2015):
Corpus del Español: This corpus contains about two billion words of Spanish, taken from about two million web pages from 21 different Spanish-speaking countries. It was web-scraped in 2015. Class dataset: Corpus del Español ("SPAN"). |
Subject: Topics and rhetoric change in the Marx-Engels Collected Works (MECW), 1835-1895 (+) So-called "Late Marx" (esp. MECW Volumes 28-37) and "Young Marx" (esp. Volumes 1-5) have substantial similarity in topics and rhetoric, with perhaps some changes in vocabulary Data: |
Hypothesis: There are overlap topics between presidential speeches and executive orders and that overlap differs by the president. Data: Presidential speeches corpus from https://millercenter.org/the-presidency/presidential-speeches and Executive orders from https://www.federalregister.gov/presidential-documents/executive-orders |
Intuitions: (+) In structured debates, the winning teams' arguments will be centered around the debate topic. I didn't collect the data on this because this is unrelated to my project, but it can be scraped from Munk Debates and Intelligence Squared websites. |
Intuitions: |
First, write down two intuitions you have about broad content patterns you will discover in your data. These can be the same as those from last week...or they can evolve based on last week's explorations and the novel possibilities that emerge from continuous, high-dimensional embeddings. As before, place an asterisk next to the one you expect most firmly, and a plus next to the one that, if true, would be the biggest or most important surprise to others (especially the research community to whom you might communicate it, if robustly supported). Note that these expectations become the basis of abduction--to condition your surprise. Second, describe the dataset(s) on which you will build an embedding model to explore these intuitions. Then place (a) a link to the data, (b) a script to download and clean it, (c) a reference to a class dataset, (d) an invitation for a TA to contact you about it, OR (e) a brief explanation why the data cannot be made available. Please do NOT spend time/space explaining the precise embedding or analysis strategy you will use to explore your intuitions. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).
The text was updated successfully, but these errors were encountered: