Scientists Elicit Universal Pattern of Sound Use in Languages

26/02/2018

We use a smaller variety of sounds with which to end words than we do to begin them.

The words we write and speak start off with a variety of sounds. However, the set of sounds they end with is more restricted. All known languages have this unique property, which researchers at the Institute of Mathematical Sciences (Matscience), Chennai, who analysed it and its implications in a new study, have called a directional asymmetry. It is universal, and informs our ways of speaking and writing using a language.

The researchers believe that, by using the asymmetry, “we can infer the direction of unknown languages, too,” because they will also have the same property.

“This is a significant step towards deciphering undeciphered texts, including Indus valley inscriptions,” Mohammad Izhar Ashraf, a research scholar at the B.S. Abdur Rahman University, Chennai, and a member of the study, told The Wire. And true to that, Ashraf and his colleagues have confirmed that the Indus valley script, which remains undeciphered to this day, flows from right to left.

“For me, the work is most valuable in showing that [asymmetry] is a universal property of all languages and scripts,” said P.P. Divakaran, a retired professor of the Tata Institute of Fundamental Research, Mumbai. “The confirmation that the Indus script was written [from right to left] is a bonus.” He was not involved with the study.

While the directionality of writing – i.e. which way the text flows – in the Indus valley script has been discussed by scholars before, the present paper provides strong quantitative evidence that it flows R-L based on statistical considerations.

Divakaran thinks that the present work fits in very well in the context of renewed interest in the study of syntactical properties of the Indus Valley script. “I think, it will play an increasingly important role in its eventual decipherment,” he said.

The Gini index

The researchers – Ashraf and Sitabhra Sinha, the latter a professor at Matscience – first compiled lists of the most frequently used words in 25 languages written using different systems, ranging from those using alphabets (such as English) to syllabic systems (Japanese kana) to logographic ones (Mandarin). The languages they used in their study include Hindi, English, Chinese, Hebrew, Arabic, German, Turkish and Russian, among others. They also factored in inscriptions written in languages that are now no longer spoken, such as ancient Greek, Egyptian hieroglyphs, Sumerian cuneiform and the Linear B script used in ancient Crete and Southern Greece.

Then they calculated the frequency with which certain signs or letters appeared at the beginnings and ends of the words. With English, for example, they analysed the number of times the letter ‘A’ appeared at the beginnings and ends of words; then the letter ‘B’, and so on, up to ‘Z’. Finally, they plotted the distribution of their occurrences.

They found that the signs used at the beginning of a word tend to be more uniformly distributed than those used at the end. “That means we have a freedom of choice while starting our utterances than when we end them,” Ashraf said. “It could be due to phonotactic restrictions inherent to [each] language.”

Next, they set about quantifying the distribution using information entropy and the Gini index. Entropy measures how random a distribution is. They found that the entropy value for the distribution of signs is maximum for the beginnings of the words, meaning that this position has the largest variety of letters and/or signs in use. The endings, on the other hand, showed low entropy, meaning that a few signs are used more commonly to end words with.

The Gini index measures the degree of inequality in a distribution. In other words: whether the difference in frequencies of letters/signs at the beginnings and ends is an artefact of the way entropy is defined or if it is something more real.

As Sinha explained, a higher Gini index meant greater inequality, with G = 1 corresponding to perfect inequality: only one sign is used all the time while the others are never used. A lower Gini index meant more equality, with G = 0 corresponding to perfect equality, i.e., all signs are used with exactly the same frequency.

The Gini index for signs showed higher equality of signs at the beginning of the words and lower equality at the ends.

Semitic v. Roman

The way their analysis helps make sense of extinct languages is, as Ashraf said, “If it’s a writing system, it will have an asymmetry, because all languages have it. That helps in finding the directionality of the script, which is a critical first step in eventually understanding the language.”

When they looked at the terminal sign distribution of Indus valley inscriptions, they found that “right-side sign distribution is more uniform than left side,” in Sinha’s words. “It implies that the Indus valley inscriptions should be read from right to left direction.”

Sinha added that Iravatham Mahadevan, a former civil servant, was among the first scholars to have figured out that the Indus valley script is written from right to left by looking at how inscriptions were cramped up towards the left. This suggested “that the writer started from right and then, halfway through, realised there was not enough space to finish the inscription in the available space and so started to put the subsequent signs closer together.”

The archaeologist B.B. Lal had also deduced the right-to-left flow of the script in the 1960s by examining overlapping lines in the inscriptions, according to Sinha.

Among the existing languages today, all semitic languages such as Arabic are written from right to left, while all languages that use the Roman alphabet are written from left to right. There are also examples of top to bottom, bottom to top and alternating directions in every other line.”

Shared structural characteristics

A single line of an Indus valley inscription consists of one to 14 signs. The longest so far known has 26 signs in three distinct lines. Scholars differ in their estimation of the total number of signs the Indus script has. Generally, they say it’s around 400, while one scholar puts it at about 676, explained Nisha Yadav, a researcher at the Tata Institute of Fundamental Research who was not involved with the study.

An intriguing problem with the Indus script is, as Yadav put it, “We do not know what each sign corresponds to in the case of the Indus script. In fact, we have no clue about the underlying contents of the script.” However, stronger restraints on letters at the end of words “suggest that words in different languages seem to share certain structural characteristics and that is interesting.”

Ashraf was quick to add that he and Sinha hadn’t factored in an underlying grammar. The focus of the paper, he said, was to analyse words at the level of individual sign use rather than using many words together.

Undeciphered languages are also expected to exhibit the same asymmetry and so, according to them, their analysis could help efforts to unravel them.

While the present work emphasises that the asymmetry is the outstanding problem, according to Divakaran, it could also be “to understand the reason(s) for the universality of the asymmetry.” One reason he thinks could be that our sound-producing organs find it easier to utter certain combinations of sounds than others.

The study was published in the journal PLOS One on January 17, 2018.

G.B.S.N.P. Varma is a freelance journalist.