- Background
- Objective
- Tools and Packages
- Data Collection & Processing
- Data Visualization
- Results
- Future Work
I read an article which claimed Country songs make the most reference to substances in music. I've personally always thought Hip-Hop would win this by a landslide, and decided to use my resources to conduct my own research!
Analysis of song lyrics from five genres (Pop, Hip-Hop, Country, Rock, R&B) to see which reference substances the most.
I used a combination of Python and R for this project. Python is generally better for API calls/Web scraping, so I chose to take advantage of this functionality. I prefer R for plotting graphs & data exploration, so I switched programs after successfully downloading the lyrics.
R packages:
- tidyverse: Data manipulation & analysis
- tidyjson: Structuring .json data into tidy data frames
- rjson: Conversion of .json objects into R objects.
- tidytext: Editing text data using tidy data principles
- furrr: Future mapping (parallel processing similar to purrr)
- gsubfn: String manipulation
- plyr: Split/Apply/Combine strategies for data
Python modules:
- lyricsgenius: Client for Genius API
- pandas: Data manipulation and analysis
- dask: Parallel processing
- os: Operating system interfaces
- json: Working with json files
Method | Notes |
---|---|
search_artist | 150 * 50 songs downloaded in ~ 20 minutes |
save_lyrics | Downloaded .json files to drive |
dask.compute | Multiprocessing for search_artist |
Data Cleaning
- Change all lyrics to lower case
- Tokenization of words
- Changing plural mentions to singular - ex. "girls" to "girl"
When all the lyrics were downloaded and filtered, Hip Hop was the genre with the most words, with over 20,000 in comparison to other genres:
"Love" and "Yeah" were top words in all genres.
I thought it would be cool to see which genre made reference to love the most, so I did just that. By dividing the number of mentiones of the word "love" by the total words in each song, I was able to get percentage values.
R&B makes the most references to love (duh), with Pop in second place. Hip Hop mentions it the least of all genres.
"swear words" were words in this list found within the lyrics: "fuck", "shit", "bitch", "damn", "cunt", "slut", "whore", "ho", "piss", "bollocks" (for the British artists!), "dick", "cock".
First, I found the most common swear words per genre:
For this category, I expected Hip Hop to be at the top, by a lot (it was). I wasn't too sure what the rest of the rankings would look like for the other genres, and I was a little surprised by the results:
43% of rap music is swear words! R&B is in second place, with a shocking 36% difference.
I didn't bother looking through the most common swear words to figure out who would reference them the most - safe to say Hip Hop wins this round.
I grouped "Substances" into 7 categories: Marijuana (weed), Alcohol, Heroin, Meth, Pills, Cocaine, Ecstasy (including LSD, shrooms, molly). Hip Hop was in first place in terms of substance mentions, but I was shocked to see what was in second place:
Country music! I expected Pop/R&B to be in second, but apparently that isn't the case. Of all substances, Alcohol was the most commonly referenced, with Marijuana in 2nd place.
I decided to compare references to these two substances between groups:
Country music references alcohol the most! By quite a lot in comparison to the other genres as well. Hip Hop is in second place with this one.
Hip Hop references marijuana the most, far more than other genres!
To capture mentions of violence, I gathered words related to aggression (as much as I could think of, present in the code) and filtered each genre for mentions. Hip Hop was once again first in this category, with Pop narrowly beating out Rock for second place.
Lastly, I thought it would be cool to add a sentiment score to see how the genres stacked up against each other. I expected Hip-Hop to be far in the negatives due to the quantity of violence/substance/swear words present.
As you can see, that is the case! In fact, all genres are in the negatives, with the exception of R&B music, which is in the low positives (makes sense because they're talking about love so much).
I decided to do the same thing, but by artist to see if any were far more negative than others
- Eminem
- 2Pac
- Lil Wayne
- DMX
- JAY-Z
- Whitney Houston
- Mary J. Blige
- Celine Dion
- Stevie Wonder
- Janet Jackson
A larger collection of lyrics to analyze would be a major benefit as this would allow analysis of the same metrics accross more genres/languages, allowing for more accurate findings.