Now Reading
When Scientists Used Machine Learning to Spot Bad Lines in 700 Bollywood Films

When Scientists Used Machine Learning to Spot Bad Lines in 700 Bollywood Films

CDs of famous Bollywood films lined up on a shelf. Photo: ryanready/Flickr, CC BY 2.0


  • Three researchers used machine-learning to analyse subtitles from 70 years of Bollywood films and quantified the persistence of various prejudices in their stories.
  • They used simple counting, WEAT, diachronic word embedding tests and cloze tests to unearth characters’ attitudes towards male children, fair skin, dowry, caste, etc.
  • While social scientists have been studying Bollywood for many years, the researchers said their study stands out by its size and by its ability to quantify various biases.

Hyderabad: There’s a line in the 2007 Bollywood film Jab We Met, starring Kareena Kapoor and Shahid Kapoor, that goes: “Akeli ladki khuli tijori ki tarah hoti hai” – Hindi for ‘A woman alone is like an open vault’. Another one, from the film Kambakkht Ishq (2009) starring Kareena Kapoor and Akshay Kumar, goes: “Marriage ke pehle ladkiyan sex objects hoti hai, aur marriage ke baad they object to sex!” (‘Women are sex objects before marriage, and they object to sex after marriage!’).

Bollywood is no stranger to such sexist and misogynistic dialogues. But a new analysis of 700 films produced by the Bombay film industry suggests some of them may be making less frequent appearances on screen.

Bollywood plus other regional Indian movie industries constitute the world’s largest movie industry based on the number of feature films they produce. These movies target a large number of audiences worldwide and colossal revenues. Jab We Met, for example, raked in Rs 114 crore, and Kambakkht Ishq, Rs 129 crore.

In a new study, published online this month, researchers from the Rochester Institute of Technology and Carnegie Mellon University used natural-language processing (NLP) techniques to analyse subtitles from 70 years of Bollywood films and identified colorism, sexism and religious and geographical prejudices in each film.

The study, conducted by machine-learning researchers Kunal Khadilkar, Ashique KhudaBukhsh and Tom Mitchell, also compared results obtained from their analysis of Bollywood movies with several movies from Hollywood and other parts of the world for insights on how Bollywood fares on the global stage in terms of fair representation.

The results suggest that while Bollywood movies have been hosting offensive lines for a long time, things may be improving.

Why bother?

According to the study, Bollywood movies have a target audience of at least 1.2 billion people in 90 countries, so the views and stereotypes they contain prove influential and quickly.

Khadilkar and KhudaBukhsh told The Wire Science that the study was born at Carnegie Mellon University, where Khadilkar was a graduate student and KhudaBukhsh a visiting faculty member. They were both very interested in Bollywood affairs and had followed comments on the social media about the industry’s issues with representation, so they decided to undertake a large study that used their expertise in machine-learning.

According to KhudaBukhsh, they “wanted to see how popular entertainment captures social norms”, since “any improvement or degradation in Bollywood content can affect lots and lots of people”.

They began by collecting and organising the English subtitles of the 100 top-grossing Bollywood movies in each decade from 1950 through 2020.

To compare the trends they observed in Bollywood movies with the global picture, they also collected the subtitles of the 100 top-grossing Hollywood movies in each of the same decades and those of 150 movies nominated for the ‘Best International Feature Film Award’ at the Academy Awards – or Oscars – since 1970.

According to Khadilkar, this corpus itself was “unique” – reportedly the first of its kind.

Then, the duo classified the 70 years into three categories: “old” (1950-1969), “mid” (1970-1999) and “new” (2000-2020). “Our choice of separation points in the timeline is guided by the global emergence of counter-culture in the late 60s and early 70s and the rapid rise of multiplex culture in Indian cinema,” they wrote in their study’s paper, and added that the periods could be sliced in other ways as well.

To find and quantify bias, Khadilkar and KhudaBukhsh used four methods.

First, they counted the number of times male and female pronouns occurred in the Bollywood films’ corpus, to check for gender bias.

Second, they used the ‘word embedding associated test’ (WEAT), a common measure that quantifies relationships between words. WEAT involves mapping words in a dataset as vectors in a “high-dimensional” space, and then quantifying the distance between these vectors as a measure of the relationship between them. The distance is computed as a score, commonly called the ‘WEAT score’, which goes from -1 to 1.

In the current study, 0 meant ‘no bias’, a positive score meant a bias towards men and a negative score meant a bias towards women.

Third, they checked whether the context in which certain words were being used in movies had changed. They did this using diachronic word embedding, in which they mapped the nearest “neighbours” of a particular word in a dynamic dataset in a similar high-dimensional space as in the previous method. For example, by tracking the nearest neighbours of the word “beautiful”, they could say whether the context in which “beautiful” was being used in movies had changed.

Also read: In Support of #MeToo, 11 Women Filmmakers Pledge to Never Work With Proven Offenders

Photo: Denise Jans/Unsplash

Finally, Khadilkar and KhudaBukhsh used cloze tests – fill-in-the-blanks tasks that a program performs after ‘learning’ a large number of movie dialogues. How it filled the blanks could be used to deduce what biases it may have learnt along with the dialogues.

For example, after training their program, the researchers posed the following question: “A woman should be ______ by occupation” and “A man should be ______ by occupation”. If the program had said “a cleaner” and “an engineer”, the duo could infer that that’s what the dialogues had ‘taught’ the program.

In addition, the researchers also checked whether dialogues harboured biases about families’ preference for male children, fair skin and the dowry system, the occupations of Hindu versus Muslim characters, and representation of characters from different parts of India.

Radhika Mamidi, a computational linguist at the International Institute of Information Technology, Hyderabad, commended the paper for using “correct and sound methodology”. But she also said “the data may not be representative enough” simply because “there are many more stereotypes portrayed in the movies which may not have found a place in the paper.”

Gender bias in Bollywood films

The authors found that considerable gender bias continues to persist in Bollywood movies, although Pritha Chakrabarti, a cultural studies researcher who has worked previously on representation in Bollywood1, pointed out an important caveat. As an illustration, she quoted a line from the movie Jab We Met that the study finds to be an example of misogyny: “A girl who is alone is like an open treasure” (translation from the paper).

According to her, “the movie is actually trying to challenge the dominant notions of patriarchy”, and “rather than preaching the bias, the movie is trying to expose the bias”.

This said, the researchers were quickly able to find the imprints of Bollywood’s representation problem in their data.

For example, by counting the occurrences of male versus female pronouns in their corpus, they concluded that both Bollywood and Hollywood movies are skewed towards the use of the male pronoun. Many of us may have already intuited this, but according to the study’s authors, their work’s value lies in, among other things, the size of their analysis and the quantification of various biases.

The WEAT scores also suggested that Bollywood movies had a bigger bias towards men compared to that of Hollywood movies – especially in the romance genre. But Hollywood movies scored worse on action films.

The results from the cloze tests indicated reason for some optimism: for a given occupation, the representation of women in both Bollywood and Hollywood movies has been improving over time – and more substantially in Bollywood. The authors attribute this to “the continual fight for gender equality in India”.

Chakrabarti said that this improvement could be a result of the “post-Nirbhaya moment” – referring to the 2012 Delhi gangrape incident. This incident, in her words, “inaugurated the stage in India about a more egalitarian view of women and their position in the society in popular culture”.

She also said that while there have been attempts before to change the way women are represented and addressed on the silver screen, they have still portrayed women as people in need of being “saved” or “respected”, but “never as an equal”.

The NLP researchers also found that there has been a considerable shift in the sex ratios of children born in movies. They wrote that the birth of a child is an important plot point in many Bollywood films, with one in every 10 films using the trope. And they found that whereas 74% of children born in the ‘old’ group were male, a relatively better 55% in the ‘new’ group were male.

Skin colour, dowry and caste

Bollywood also continues to have a pronounced bias towards fair skin (as do all other major regional filmmaking sectors in the country). According to the researchers, they performed a cloze test in which they asked the program: “A beautiful woman should have _______ skin”, the most common response before training was “soft”. But after being fed a diet of Bollywood dialogues, it started to say “fair”.

And diachronic word embedding tests found that this fair-skin bias has been consistent across old, mid and new Bollywood movies.

However, the same tests also revealed that perception towards the dowry system had improved in the ‘new’ group. While “money” and “debt” were the words most commonly associated with “dowry” in the ‘old’ set, words such as “guts” and “refusal” featured closer to ‘dowry’ in newer ones.

The unevenness of improvements continued into the religious representation portion of the analysis.

The words most commonly associated with “Hindu” in older movies were “worshipped”, “loyal” and “righteous” – versus “industrialist”, “wealthy” and “respected” in the newer lot. And the words most commonly associated with “Muslim” changed from “urdu”, “sage”, “saint” and “scholar” to “shameless” and “traitor” between the same lots.

When the researchers compared the representation of various religious communities in movies with their numbers in the Indian population (per Census data), they observed that Muslims have been consistently underrepresented in Bollywood.

For example, only about 6% of all surnames indicated a Muslim character, while the Census data from the same time indicated that Muslims made up around 10% of the national population. The trend has reportedly continued into new Bollywood movies as well – 8% on screen, 14% off screen.

In similar vein, the researchers also noted that the caste and religious representation of people in the medical profession have been skewed heavily in favour of Hindu Brahmin men, reinforcing a well-documented casteist bias in the practice of medicine in India.

Geographic representation

Discs of Bollywood movies on display at a video store in Islamabad, October 2016. Photo: Reuters/Caren Firouz

By analysing the number of times a particular geographical region was mentioned in each movie, the researchers were also able to conclude that Mumbai and Delhi were the most common locales. This is not as surprising as the finding that the states of Manipur, Arunachal Pradesh, Meghalaya, Tripura and Mizoram found no mention at all in their corpus of films.

The researchers also asked the program in cloze tests: “The biggest problem in India is _______” and “The biggest problem in America is _______”.

When the program was trained using the corpus of new Bollywood movies, it replied with “Pakistan” and “Kashmir” to the former query. When trained on Hollywood dialogues, it answered with “racism” for the latter.

The researchers wrote in their paper that such answers could be a useful way to extract information about national priorities, at least as portrayed in popular culture.

Also read: Does Not Compute: Why Machines Need a Practical Sense of Humor

Quantifying bias

The researchers acknowledged that although most of their insights were not novel – in the sense that they have been discussed in qualitative social science research for some time – they have been able to scale their analysis to a large number of movies and were able to quantify the bias.

“Now, we have a proper number associated with each of the biases – with the gender disparity in dialogues, how dowry is represented in films, how babies born in films are predominantly sons, etc.,” Khadilkar told The Wire Science.

According to KhudaBukhsh, the quantitative nature of their study provides a tool with which to understand the sort of biases that have persisted in Bollywood movies and could now allow others to track how their presence changes over time.

At the same time, Chakrabarti cautioned against waiting for data to believe something when evidence of it is already common. “We don’t need to go from door to door to do a survey to prove that we live in a patriarchal society,” she said. “Similarly, we don’t need to get data out of hundreds of films to say that there is a sexist bias in the language used in cinema.”

KhudaBukhsh replied, “We don’t see our work as a competition to what social scientists do. In terms of research questions, I still feel that social scientists will have more meaningful research questions and more insightful understanding of the results that we found.” Instead, he added, their work could be “a tool to help social scientists scale their work to a much larger number of movies. Our work complements the contributions of social scientists.”

Going ahead, Khadilkar and KhudaBukhsh plan to apply their techniques to other texts, including radio transcripts and books, in the hope that their work will start conversations both within and outside Bollywood about representation, while supporting the work of social scientists. Chakrabarti herself had a wish: “I think it would be interesting to see what happens when the machine learns sarcasm.”

Sayantan Datta (they/them) are a queer-trans science writer, communicator and journalist. They currently work with the feminist multimedia science collective TheLifeofScience.com, and tweet at @queersprings.


  1. She also teaches at Krea University, where the author are also a faculty member.

Scroll To Top