Introduction

The perception and use of profanity depend on the situation and differ from person to person. While swearing is relatively common and sometimes even welcomed in a relaxed environment, the use of such language is mostly frowned upon in a workplace or public spaces. This is why we are not likely to encounter it while reading the news. Still, heated debates and unexpected events can inspire the appearance of obscene language even in more formal settings. To better understand the circumstances in which profanity appears in media, we will explore Quotebank, a large and heterogeneous dataset of quotes extracted from media articles. We will analyze the distribution of obscene quotations through time and examine their presence with respect to the attributes of the speakers who uttered them and the media outlets who featured those speakers.

What is Profanity?

The dictionary offers a simple definition of the word ‘profanity’: a type of language that includes dirty words and ideas.

The naive approach in detecting profanity is to use a hard-coded list of curse words. However, there are glaring issues with this approach as we are completely ignoring the context surrounding these words.

“Finally! A pair of great tits has moved into my birdhouse!”

Would you classify this statement as profanity?

It seems that Twitter would, as the user who posted it was banned from the platform, even though a quick look at the Wikipedia article for the Great Tit can easily explain the meaning of this statement.

Identifying profanity in the text has proved to be a rather difficult task which, if not done carefully, can often result in a high rate of false positives, as demonstrated by the Scunthorpe problem. That’s why, in combination with simple word list based methods, we used a pre-trained machine learning model to identify profane quotations, and regular regular expressions to identify censorship.

1. Profanity in Quotebank

People curse a lot, just not in the media.

Quotebank is a dataset of ~110M quotes collected from different media articles. By observing the distribution of the obscene quotes identified by our models we can already reach our first conclusion:

Obscene quotes are not common in media articles.


Considering the temporal factor, the frequency distribution of quotes through time is mostly uniform. The sharp drops are caused by missing data, and the slight drop on March and April of 2020 can be explained by COVID-19 circumstances.

2. Zooming in

Isolating the profane quotes out of Quotebank leaves us with 1,146,168 quotes, out of which 68% have censored profanities. Let’s take a deep dive into this small subset of Quotebank.

WARNING: Curse words ahead!

People’s Choice Award for favorite curse word goes to…

Click to reveal Hell
Not quite what you expected, huh? The word 'Hell' has around 130K occurrences in Quotebank, meaning it appears in around 11% of all profane quotes which our model identified. The image below demonstrates all major obscene words found in Quotebank, where the word's boldness indicates it's occurrence frequency. Notice that less obscene words are more frequent, which is understandable as they are more likely to be published in media articles.

Obscene words

Does your mum or dad curse more?

It seems that around 76% of all profane quotes we found were spoken by a male speaker. On the other hand, taking into consideration the relative frequencies of profane quotes with respect to all quotes spoken by a specific gender, it seems that female speakers are more prone to using profane language. Our analysis also shows that media articles are heavily biased towards featuring male speakers.



These donut plots don’t really do justice to other genders beyond the binary ones, so let’s take a look at the average profanity for some of the gender identities registered in the dataset.

Wow! The genderfluids are killing it.

On the serious note, it seems that speakers associated with more uncommon genders tend to use more profane vocabulary. Psychological studies[1] suggest that the use of profanity is related to aggressive behavior. Our best guess is that people who associate themselves with uncommon genders still don’t feel accepted in today’s society, causing them to be more hostile and therefore more prone to use profane language. We decided to further explore this hypothesis by computing average aggresion scores for each gender category.

One can argue that this plot supports our hypothesis, but due to large confidence intervals, we can’t make any significant claims. We still encourage readers to respect and be nice to people associated with uncommon genders.

The Mic or the Stripping pole

If we had to guess, we would say that quotes by rappers would contain the most obscene language. On the contrary, our analysis shows that the most vulgar occupation of Quotebank speakers is stripping. Wait… strippers get quoted in media articles? Apparently they do, we have found over 23000 quotes spoken by strippers. That being said, MCs take the first place if we only consider speakers with more than 500 quotes. Intuition, check!

F#@k it’s Monday!

We have also grouped quotes by the day of the week in which the articles were posted and we have computed the average profanity score of each group. Surprise, surprise - Monday takes the lead with the highest average (we all know why). In addition, this paper[1] suggests that people become more hostile when exposed to profanity in media. Maybe that’s the reason why people are so grumpy on Mondays, except, you know, having to go to work again.

2. Individual speakers

We will now turn our attention to individual speaker analysis. We have prepared a simple search engine which allows you to search for your favorite speakers and see their average profanity score. We have only included speakers with more than 5000 quotes so that the average profanity metric remains reasonable. After filtering, there were still over 1000 speakers left so feel free to play around.


The return of the MCs

Observing the general distribution of most quoted speakers, we can see some familiar names. The profane quotes observed by Pope Francis were a bit concerning but a deeper analysis shows that they either contain the word ‘hell’ or are mostly false positives. Taking a look at speakers with highest percentage of profanity quotations, we can notice that most of them are rappers. The MCs are back!

A quick trip back to the strip club: most stripper quotes belong to Cardi B, who pursued a career as an exotic dancer before rising to fame as a rapper. This further explains the domination of strippers over rappers as the most vulgar occupation.

3. Media outlets

Media outlets play a very significant role in keeping everyone updated about the various events around the world. Before jumping into the general analysis of profanity in media outlets, please take a moment to use the search bar down below to find your favorite media outlet. Then proceed to the general analysis and compare your media outlet with the rest.


You are what you read

The interesting thing about quotations is that media outlets can’t alter them without losing credibility. Therefore, even if a certain media outlet has a policy against profanity, they must not alter quotations. The best they can do is censoring them or not publish them at all.

Quotebank data suggest that popular media outlets with a high number of quotations don’t publish many profanities. Indeed, only 1-2% of quotations published by each popular outlet were considered as profane by our model. This observation, combined with relatively high censorship rates, implies a fairly strict policy against profanity among popular outlets.

On the other hand, media outlets with highest censorship rates are mostly oriented around fashion or the entertainment industry (especially hip-hop culture). Outlets with the highest profanity rate are also mostly hip-hop related! All evidence points to a single conclusion: Hip-Hopers really got no chill…

Conclusion

Our journey through profanity has taken a weird and funny course. We have presented our most interesting insights, but Quotebank is a very heterogeneus dataset so there is still room for a lot of research.

We can summarize our data story through the following points:

  • Obscene quotes are not common in media articles
  • Respect non-binary genders
  • Take it easy on Mondays
  • Consider removing Cardi B from your Spotify
  • Don’t let your kids read Hip-Hop magazines

References

Sarah M. Coyne, Laura A. Stockdale, David A. Nelson, Ashley Fraser; Profanity in Media Associated With Attitudes and Behavior Regarding Profanity Use and Aggression. Pediatrics November 2011; 128 (5): 867–872. 10.1542/peds.2011-1062