-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sampling, Crowd-Sourcing & Reliability - Challenge #49
Comments
Research Question: What's the time trend of people's anxiety toward personal finance and disposable imcome? Dataset: (could be made available to class this week for evaluation) 'Hot' Articles in the Personal Finance subreddit: Data Note: the sample size is only 926 and the time period is from 2021-01-03 to 2021-01-28. I tried to expand the time period using the PRAW package but haven't figured out how (because there is only 926 hot or new articles, I guess the number of articles below each category in subreddit (hot, new, top, etc.) is limited to 1000. Would appreciate any advice to scrape the whole subreddit! Measurement: The distribution of words; the frequency of words such as: housing, investment, debt, budgeting, tax, and so on Bias: Generalization Bias--most users of Reddit are young people |
|
Question How do historical events as narrated in history textbooks and popular synthetic histories relate to specialist historical research conducted by academic historians? |
Question: Can we predict the 'winner' of a debate in the eyes of an audience? Corpus: Intelligence Squared and Munk Debates debate transcripts (not available as a corpus yet), and audience votes in regards to debated statement both before and after the debate (to train the algorithm). Measurement: Debates that have a declared winner (measured by the difference in audience votes) can be used to measure the accuracy. Use of vocabulary, grammar and positivity/negativity in language can be introduced as predictors. Bias: It would be tough to obtain a large corpus of debate transcripts, therefore there could be a risk of overfitting the training data. The obtained sample would represent only debate platforms that share transcript text. Moreover, in general, results of academic debates may not generalize well into debates on political and social issues where partisanship can be a factor. |
Research question: Over the past century, had it become easier for kids from middle-class families to engage in high culture? Corpus: Biographies of Major American orchestras musicians on the Stokowski website, and musicians’ Wikipedia pages (if available). Measurement: The distribution of parental occupation. Bias: The possibility of mentioning parents' occupations is higher if their parents are musicians, too. |
Question: How much can machines learn to predict stock returns using Chinese textual data? Corpus: Company announcement from Juchao.com, which is the official information disclosure platform designated by the China Securities Regulatory Commission.(not available at present, I am still cleaning the data). Measurement: We are going to rely on the bag-of-word representation. There are three steps in constructing document sentiments: (1) screen for sentiment charged words. (2) Get the term weight via supervised topic modeling. (3) aggregate word-level sentiment to get the article level sentiment. Finally we are going to build trading strategy of stock portfolios by linking article sentiment to stock returns. Bias: (Possible)The unbalanced nature of stock announcement. Some stocks may make more announcements than others. The resulting strategy will be overly dominant by these stocks. Stock announcements may be distributed unevenly across time. |
Question: Are new CEOs appointed from outside more likely to change company culture than new CEOs promoted from within? data: Glassdoor company review data Bias: Glassdoor only contain enough reviews for large companies. Difficult to collect sufficient reviews on smaller companies. |
Question: How do fanfiction texts differ from their source material in regards to diversity and inclusion? Dataset: Davies TV and Movie Corpora compared to stories from https://www.fanfiction.net/ and http://archiveofourown.org/ Measurement: Use of inclusive language (terminology related to gender/LGBT identities, race, disability, etc.), and words used to describe characters belonging to different groups Bias: The classification of words considered inclusive, and the classification of words used to describe characters as positive or negative |
Research question: How do brand-related user generated contents (UGC) differ across theoretically categorized social media platforms? In other words, do conceptual categorizations of social media platforms in fact influence brand-related UGC and consumer engagement? Source: Scraping social media content using public API. Different categorized social media including Facebook (relationship media), Twitter (self-media), Instagram (creative outlets), and Reddit (collaboration platforms). Measurement: use sentiment analysis and topic modeling to compare differences between UGC from different channels Bias: sampling bias, because data collected from different platforms cannot be very inclusive and the results may also be depended on the time of data collection |
Research Question: Have movie and TV reviewers (non-professional) changed their focus of discussion over time? Source: Amazon Movies and TV reviews (1996-2018) (https://nijianmo.github.io/amazon/index.html) Measurement: N-gram features signifying the focus of content (e.g. synopsis, quality, plot); by year, top frequency word, tf-idf, and topic modeling Bias: (1) the selection of films: the collection might not cover all genres of film and each year might have different proportions of genres; (2) biased user population: this dataset only accounts for Amazon movie reviewers, not considering those active on IMDB, RottenTomatoes, etc. |
Research Question: How does science respond to COVID-19 pandemic? Particularly, what the diffusion dynamic of knowledge related to COVID-19 changes over time, and what's the relationship between publications? Source: CORD-19 dataset Measurement: topic modelling, cluster analysis, word2vec, network analysis(maybe random diffusion model like IRN...) Bias: (1) CORD-19 dataset does not cover all publication. Collecting papers from semantic scholar does not represent the population of publication. (2) data in some marginal topics might be very sparse. |
Research Question: How have social and economic themes associated with the discourse on cryptocurrencies and decentralised economies evolved over the past year? Source: Reddit forums: https://www.reddit.com/r/CryptoCurrency/, https://www.reddit.com/r/Bitcoin/ Measurement: Topic modelling, N-gram frequencies Bias: Discourse on Reddit disproportionately represents views of younger, white, males. Other social media could have discourse markers with more inclusive/different demographics. |
Research Question: What transformations have taken place in left-wing ideology and discourse as a function of time, in particular of the First International vs. Second International vs. modern inheritors (ex. Frankfurt School)? Source: Marxists.org archives - https://www.marxists.org/archive/index.htm Measurement: Topic modeling, n-gram frequency, changes in the usage of words (part-of-speech changes and/or changes in co-occuring words) Bias: The corpus is a "convenience" sample in that it represents texts which have already been digitized and uploaded to one particular archive. Thus there may be latent bias inherent to the act of curation itself. (One mitigating factor is that these texts probably tend to be the most popular/impactful texts of the discourse, an assertion which could probably be supported empirically.) |
RQ: How are the Latino-paradox, especially for immigrants (Latino/Hispanic American are generally healthier than others even though they have relatively poor education and SES ) these days? Would we see a difference in their diet, habits, and behaviors? |
Research Question: How have the trend in American concerns about the COVID-19 pandemic changed? Datasets: Measurement: The frequency of words. Bias: Twitter users in the United States represent only a fraction of the total U.S. population, and the characteristics of Twitter users do not necessarily reflect the characteristics of all Americans. |
Research Question: What's the relationship between news media coverage and presidential speeches on the topic of climate change? Does this relationship vary across news sources and time? Dataset: The NOW Corpus (a sample of it), US presidential speeches corpus (https://millercenter.org/the-presidency/presidential-speeches) Measurement: frequency of pro-climate change mitigation vs. against climate change mitigation, positive coverage vs negative coverage, clustering of coverage sentiments by news sources Bias: News articles in the NOW Corpus are heavily skewed towards pro-climate change mitigation; hence, sampling this Corpus for a good sample is tricky |
Research question: What is the relationship between news coverage, presidential rhetoric and approval rating? Dataset: NOW Corpus, US presidential speeches corpus. Measurement: Sentiment depicted by news coverage towards presidents. Bias: Sentiment models are still work's in progress and designing an objective sentiment towards named entity algorithm might result in biased sentiment scores. |
Research question: What is the relationship between presidential speeches related to climate change/energy policy and climate legislation introduced during various presidencies? Working with @RobertoBarrosoLuque and @dtmlinh, hence the similarities in topics (but different angles). Dataset: I downloaded this corpus of presidential speeches. It includes one folder for each president and a .txt file for each speech, so requires some wrangling to get it in a single file and is probably a little clunky for the purpose of today's exercises. For legislation, I've used this corpus of congressional bills. Easy to download and could be used in class. Measurement: Topic modeling, human perceptions of the extent to which rhetoric prioritizes fighting climate change, ngram frequencies and collocations (i.e., energy security/independence vs. renewable energy) Bias: I'm especially concerned about sampling bias. If I were to attempt to choose representative samples of a president's climate change rhetoric by hand, it would be impossible for me not to choose texts that confirm my beliefs about a president's policies (especially for presidents I perceive as anti-climate). I also think the details here might be more subtle, as no president is going to say "let's destroy the earth." It might come down to emphasizing jobs in the coal industry and the US energy economy more than emphasizing harms from fossil fuels. Sampling only explicitly climate-focused speeches may not pick up on these things, and I'm not sure how to account for this. |
RQ: 1) Has political speech by politicians in Twitter evolved into more successful forms? 2a) In what way are successful speeches similar and different? 2b) In what ways unsuccessful speeches? Dataset: Twitter speech by presidential candidates in Colombia (2010-2018). Meassument: Biases: the measures of engagement might mean something different, the electoral outcomes can be influenced by many other variables. |
Question: What language/themes are propagated by extremist-affiliated blogs, social media accounts and other news outlets? And how might these trends seep into and be reflected in online civic discourse? Dataset: Social media platforms - via public API (incl. Twitter, Youtube, FB) Measurement: topic models, word embeddings, sentiment & unique word frequencies Biases: Confounding factors, multiple word definitions, user populations |
Question: Do different social science field exhibit different political ideology, such as right or left, liberal or conservative? Or are there any time trends that seems to occur in different social science fields. Dataset: Academic paper dataset in COCA, or other corpora that can be found on line. Measurement: If we can identify a set of words in different political ideology, word frequency might be a good idea. Biases: If we want to identify any casual relation, we might need to take a greater view in say, news or other media change as well. |
First, pose a research question you would like to answer (in one, artfully worded sentence...ending with a question mark). This could be the same question you posed for the first week's assignment, or a new one that captures where your project is moving (hopefully toward your final project). Second, in a single-sentence list, describe all of the datasets and selections...e.g., REDDIT comments and responses from r/DonaldTrump through Jan. 8 when banned). This could also be the same as articulated in the first or second week, and parenthetically note whether it could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) Third, in a single sentence, describe one measurement you will use to assess your question with your dataset/corpora. Fourth, describe one or more biases resulting from your sample or your measurement that you would like your analysis to overcome. Please do NOT spend time/space explaining how you will de-bias or counter-bias your sample or your measure. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).
The text was updated successfully, but these errors were encountered: