-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counting Words & Phrases - Challenge #48
Comments
On February 4, 2014, Satya Nadella (Chicago Booth alum) became the new CEO of Microsoft. Under Nadella's leadership, Microsoft underwent a cultural transformation which also translated into improving company performance. In his book, Hit Fresh, Nadella articulates two major changes with Microsoft's company culture since he became CEO:
I hypothesize that Microsoft's Glassdoor employee reviews reflect these two cultural changes. Please help me test it. You can download the csv file of Microsoft's Glassdoor review from 2008 to 2018 here. (In the Github page, click 'Download' or 'View raw'. Then right click at blank spaces in the page to 'Save As...') The review data is already grouped by year. |
Hunch: While newspaper mentions of "artificial intelligence" have increased throughout the 2010s, mentions of "artificial intelligence" alongside ethics keywords (e.g. bias, fairness, racism, transparency) increased later, such as after 2016. Reasoning: I have heard this from a number of old-timers in the field of AI, and it makes intuitive sense that when a new technology first gets discussed, people are mostly just excited about its potential and novelty. When it becomes more commonplace and well-known (e.g. AI in 2016), people start to have more complex, nuanced discussions and worries about the technology. Corpus: News on the Web, available in the Davies corpora Dropbox (but it may take too long for people to download and decipher during class) |
Hypothesis: in the provenance of an art piece (say a painting), if one or more of its past holders are among the most prolific collectors, the auctioned price of that artwork might be higher than a similar piece. Reasoning: It is not unimaginable that the art world, like many other markets of luxury goods, is controlled by a small group of powerful players. They have a huge impact on the cultural (aesthetic) and the monetary appreciation of the work. When it comes to identifying those big aesthetic powers, the size of the collection might be an important factor. Corpus: Getty Provenance Index, Sotheby's auction records. |
Hunch: I suspect that fanfiction works use more inclusive language (such as LGBT+ terminology, gender-inclusive language, etc.) than the text of their source materials. Reasoning: Qualitative research has discussed ways in which fan communities incorporate themes of queerness or diversity into fan works, even when such themes are not depicted in the original source material. I believe that many of these works aim to fill gaps in on-screen representation, and that this will be reflected in the language used in the stories. Corpora: The TV Corpus and a corpus of text from fanfiction websites (I won't have this available for class tomorrow unfortunately) |
Hunch: Self-identified extroverts and introverts have highly different word-usage on aggregate which also reflect their personality preferences. Reasoning: Psychological preferences/personality differences should reflect in interest selection and also serve as causal precursors to differential language/word-use when people express themselves. Corpora: Personality forum threads specific to extraverts/introverts online: www.personalitycafe.com/forums/ |
Hunch: Conservative media may use more abstract language (as opposed to concrete) than liberal media to describe situation under the pandemic. Reasoning: The conservative media has been trying to advocate the government by downplaying the seriousness of the pandemic, therefore refraining from giving concrete descriptions and information about COVID and the pandemic situations. Corpora: Google news, etc. |
Hypothesis: It is harder for children from low-income families to engage in high culture over the past century. Reasoning: An article from the New York Times points out that class differences in child-rearing are on the rise in the US. Corpora: Biographies of major American Orchestras musicians on the Stokowski website (https://www.stokowski.org/), and musicians’ Wikipedia pages (if available). |
Hypothesis: AirBnB listing reviews from European cities focus more on the place characteristics (compared to the listing itself) as listings reviews of AirBnBs in US cities. Reasoning: Most people visit European cities for tourism/sightseeing; even if they visit these cities for work, sightseeing is an important part of the visit. Furthermore, European cities are more "compact" and where you live (even temporarily) has a stronger correlation with what you are going to do/explore etc. On the other hand, US cities are more widespread and the interesting sites (for visitors) are usually in their central area, which (probably) is far from the neighborhood where the AirBnB is located. Corpora: Airbnb reviews (http://insideairbnb.com/) from European and US cities. |
Hypothesis: Movie reviews are becoming shorter and less polite. Reasoning: The internet prompts everything faster, including the way people form their speech. I assume that people are spending less time writing a movie review, so that (1) they are writing shorter and (2) their negative sentiments are presented in a more straightforward way, therefore less "polite" (less euphemism). Corpora: Timestamped Amazon movie reviews (https://snap.stanford.edu/data/web-Movies.html) |
Hunch: During 2017-2020, words such as Xinjiang, Uyghur, Hong Kong, Wuhan etc., words that might paint a bad image on China will have higher counts on right-wing media compared to left-wing media. Reasoning: During the Trump era, right-wing media have tried to align with Trumps rhetoric on his social or diplomatic issues, specifically I want to investigate China-related issues. Based on different issues (eg. Xinjiang and Wuhan have very different meaning over the past 4 years), we can see how the word counts differ across media and get a sense of how ideology and political concerns influence media's reporting tactics. Corpora: COCA, NOW, and other course-provided corpora. |
Hunch: The number of people seeking financial advice on Reddit can reflect important economic events, such as pandemic and financial crisis. We finally can use data to predict/warn potential financial crisis. Reasoning: Though the number of online posts tends to increase as the internet plays a more and more important role in people's daily life, I assume there is a relatively faster change when special economic events happen. These events can be systematic (such as financial crisis) whose impact can be observed before the event, and interrupting (such as pandemic) whose impact can only be observed after its occurrence. Corpora: |
Hunch: Works of horror fiction are less popular the more they emphasize fantastical/supernatural components (such as ghosts, monsters, etc.) and, conversely, more popular the more they emphasize mundane life or words associated with ordinary drama. Discussion: As a horror fan, anecdotal evidence from recent years suggests that, for big budget films just as much as amateur short stories, the scariest stories are usually those which do not mention or dwell on explicit fantasy components. As a proxy for this hunch, we can look at the somewhat popular (14.5 million subscribers) Reddit forum r/nosleep, where amateur writers post short horror stories. If my hunch is correct, then stories which more frequently use words associated with fantasy should as a tendency receive fewer upvotes. Corpora: Reddit page: https://www.reddit.com/r/nosleep/ |
Hypothesis: For professional videos in the financial or economics area. The more "Big numbers" and fancy or emotional words used in the video title, the more likely the video is talking about the knowledge at the elementary level. Reasoning: Many videos of these types generally functions as an introduction to a certain area. Therefore they need to attract people with strong or exaggerating words to gain views and then maybe they can gain sights from beginners or pros. Corpora: Video titles and views from Famous youtube channels in Math, Economics, Finance, History. People's comments on the video. Likes and dislikes etc. |
Hunch: After the pandemic, the number of negative expressions about China among the population in developed countries first rose sharply and then eased slightly. Reasonsing: 1. Considering the impact that COVID-19 has had on the ordinary people, along with various factors including political propaganda and media coverage, it is likely that many people's attitudes toward China have shifted considerably since the pandemic began. 2. I have seen a lot of reports and pieces that after COVID-19, there has been a big change in attitude towards China at the national level in many regions. But I want to verify whether this is true from what the people have spontaneously posted on the Internet. Corpora: |
Hunch: Google search trends can indicate the introduction of a formerly 'niche' concept or discussion into the 'mainstream' Reasoning: Individuals participate in largely echo-chamer-like social networks (both on and offline) which can spiral into each grouping having their own functional 'base knowledge' (shared across the group). When this knowledge becomes relevant outside of the group, those who are seeking to understand will likely turn to google for background information and further research. Corpora: Google Trend Interface: https://trends.google.com/trends/?geo=US |
Hunch: the distribution of words in company news, announcements can predict stock returns Reasoning: (1) There is a significant 'sentiment' component in stock market returns. Textual information are reflective of the latent sentiment. If we can filter out the noise, it is likely to provide great signals to future stock returns. (2) Even under a efficient market settings, investors have rational limited attention. It is impossible to keep track of thousands of stocks at the same time. In this sense, textual analysis on thousands of stocks simultaneously can help us pick out the information left out by the investors. Corpora:
|
Hunch: During the pandemic, campaigns with the keywords of "Covid" would increase; campaigns with more emotional words may receive more donations. Reasoning: Corpora: |
hypothesis: the counts of "discredited" (shi xin) are more than those of "credited" (shou xin) in China's social credit system policy documents. The pragmatic nature of the social credit system suggests that the system focuses more on the issue of discredited persons or entities failing to pay outstanding debts and fines on time, than promoting normative values such as honesty. Corpora are the policy and legal documents on China's social credit system from 2002 and onwards. |
Hunch: Weather has an affect on how people express themselves and their moods on social media. Reasoning: During sunny days people post relatively more positive experiences and thoughts compared to times with cloudy/rainy weather. Corpora: Tweets sent from a single city in a specific time frame when both cloudy/rainy and sunny weather can be observed (e.g. Chicago and April 2019). Additionally, a climate database with information on hourly weather patterns, meteoblue.com's archives seem to have hourly regional data on "Total Cloud Cover" which can be downloaded in a csv format. |
Hunch: Political rhetoric between major parties in the USA has had periods of divergence and convergence, for the past decade we have been a steady rise in divergence. Reasoning: Superficial analysis of public speeches made by the last 3 US presidents shows a wide difference between the overall rhetoric and key messages each party communicates. Corpora: Corpus on inauguration speeches , corpus on other presidential speeches. |
Hunch: The distribution of news topics has become more skewed over time: instead of relatively diverse and equally distributed, over time, fewer topics (or types of topics) occupies a larger portion of the news bandwidth. Reasoning: The news/media industry is consolidating, also there's an echo-chamber effect within news/media. Additionally, media companies want readers and readers mostly want to read top headlines. Corpora: NOW corpus |
Hunch: Media sources deemed to be unreputable (i.e., fake news) may mirror credible news sources more closely today than they have in the past. |
Hunch: In the matter of combating climate change, right-wing media prefer solution, progression oriented words like green energy, and left-wing media focuses more on mitigation. |
Hunch: Children coming from poor families have lower linguistic diversity in word choice, sentence structures, content, etc. |
Articulate a one-sentence computational linguistics hunch or hypothesis regarding the distribution of words, phrases or parsed statements within your corpus relative to some variable (e.g., time, city size, number of likes), between your corpora, or between your corpus and some linguistic baseline (e.g., all current Wikipedia articles; a sample of 2020 news articles; French tweets from 2016 Paris). This need not be critical to your final project...but it could lead there. Next, in a short (2-5 sentence) paragraph, describe why you reason this hunch or hypothesis might be correct. Finally, list the corpus or corpora on which you will test it, and mention whether it could be made available to class this week for evaluation (not required...but if you offer it, you might get some free work done!) Please do NOT spend time/space explaining how you will explore your hunch or validate your hypothesis with the mentioned corpus. (Then upvote the 5 most interesting, relevant and challenging challenge responses from others).
The text was updated successfully, but these errors were encountered: