Can Statistics Help Crack the Mysterious Voynich Manuscript?

28/08/2021

Some pages of the Voynich manuscript. Photo: Unknown author, Beinecke Rare Book & Manuscript Library, Yale University/Wikimedia Commons, Public Domain

The 15th century Voynich manuscript has puzzled scholars and confounded attempts to decipher it for centuries.
Above the word level, at the level of pages or sections of the manuscript, there is an internal structure that looks pretty similar to other natural languages.
By applying different cypher methods to medieval and historical to see if the result matches the statistical properties of ‘Voyinchese’.

The 15th century Voynich manuscript has puzzled scholars and confounded attempts to decipher it for centuries. Its 200-odd pages contain dozens of colourful illustrations of plants, astrological diagrams and naked female figures bathing in elaborately plumbed pools of green water.

Stranger still, the manuscript is not written in any known script or language. If it’s written in code, no one has cracked it, though many have tried.

The manuscript takes its name from Wilfrid Voynich, the Polish-born antiquarian who acquired and publicised it in the early 20th century. Some scholars have argued that the text is gibberish, the document an elaborate hoax. Others have variously claimed that the underlying language is Latin, or one of the Romance languages or Hebrew. In 2018, a pair of researchers, citing the apparent similarity of some of the manuscript’s plant illustrations to the flora of Central America, claimed that the manuscript was produced by the ancient Aztecs. None of these claims has gained widespread acceptance.

The manuscript now resides in the Beinecke Rare Book & Manuscript Library at Yale University. “It’s surprisingly small, a bit bigger than a paperback,” says Yale linguist Claire Bowern. It appears to have five main sections, she says. The section on plants is the longest, making up just over half of the manuscript. The astrological section includes zodiac charts and depictions of the sun and moon. The section with the bathing nymphs is often called the balneological section, a reference to the science of baths and bathing. A “pharmaceutical” section depicts what may be herbal remedies – plant roots alongside medicine bottles – and a fifth section, unillustrated, has blocks of text demarcated with little stars.

The mystery surrounding the Voynich manuscript has inspired novels, cameo appearances in popular TV shows and video games, and even a symphony – which debuted at Yale in 2017, along with an exhibit Bowern attended with a couple of her students. Seeing the manuscript in person got Bowern thinking: Even though her main research focus is on documenting endangered Indigenous languages in Australia (where she’s from), perhaps some of the statistical methods, software and approaches that she and other linguists use to study and compare languages could be used to study the Voynich manuscript.

Bowern created and taught an undergraduate class to explore the possibilities, which she and post-doctoral researcher Luke Lindemann describe in a recent paper in the Annual Review of Linguistics. She spoke with Knowable about some of their insights. This conversation has been edited for length and clarity.

Do we know where the manuscript came from or who created it?

No, not at all. We know that the manuscript was in Prague in the early 1600s. And from there it went to the library of the Jesuit scholar Athanasius Kircher, and presumably stayed there until it ended up in a Jesuit archives outside Rome, where Wilfrid Voynich found it in 1911 or 1912.

Voynich himself shrouded the manuscript in mystery. He was never clear in his lifetime about where it came from. He said he found it in a castle, but that seems like he was trying to be unclear about where he got it.

Wilfrid Voynich. Photo: Unknown author/Wikimedia Commons, Public Domain

Was he trying to increase the price he could get by creating an air of mystery around it? Or what was he up to?

Partly that, and also it’s not quite clear whether he obtained the manuscript totally above board. He received a number of manuscripts from the Jesuit archives, and it’s not quite clear whether they knew that this manuscript was part of it, or whether the person who was selling the manuscript had the authority to do so.

Is there any chance he created the manuscript himself?

I’m pretty comfortable saying this is an early 15th century object. We get that from the carbon dating of the parchment, which puts it between 1404 and 1438. The type of ink is typical of what was used in that time period, and the clothing of the figures in the illustrations and so on are all consistent with that time period. Of course, it could be a copy of even earlier material, just as we have modern paperbacks of Shakespeare but the plays themselves go back hundreds of years.

Why would someone in those days create a ciphered manuscript?

I think people in the medieval period probably acted from similar sorts of motivations to people these days. So, why do people encipher things in general? Either to hide it from people who shouldn’t see it, or to create some sort of in-group solidarity type of thing.

One theory that’s come up, which I’m not sure I buy into, is that this was witchcraft or it was a manuscript that contained information that the Catholic Church didn’t want to get out. But that strikes me rather more as a Dan Brown scenario than something that might have actually happened.

We do have examples of information being made secret, but it’s military information or political information, and it’s 100 or 150 years later. Books of herbal remedies, on the other hand, were widely distributed and not secret. So that raises the question of why someone would have enciphered information that was readily attainable.

One possible analogy is the technical terminology in academia. As a linguist, I have a huge number of technical terms I use with other linguists, and they’re not exactly meant to keep people out, but they’re a shorthand way of talking with other linguists and a marker that I’m part of the in-group of knowledgeable individuals. So maybe we should think about this not so much as hiding information from others, but more as a kind of in-joke or preservation of knowledge for people who knew that particular language or way of writing.

Some people still think it’s a medieval hoax. Why?

One reason is that – I’m going to be flippant about this – we haven’t solved it yet, so maybe there is nothing to solve. Another is that the language in some ways looks very different from other natural languages. Languages have particular statistical properties, which are very difficult to consciously manipulate. At the word level, “Voynichese” looks very, very different from other languages.

For example, we can look at how predictable different letters in a writing system are. For instance, in English, if I think of a word whose first letter is q, then it’s pretty likely the second letter is going to be u. We can calculate this for sequences of characters in different languages. It’s a metric called the h2, or second-order conditional entropy, and there is a range of values, between three and four, for languages across the world. For Voynichese it’s more like two, which makes it look at first sight like Voynichese is maybe not a natural language. The character sequences are much more predictable than they are in other languages.

So why do you think it actually is natural language?

When you look above the word level, at the level of pages or sections of the manuscript, we find internal structure that looks pretty similar to other natural languages. For instance, we can look at the way particular words cluster on a page. If you think about a newspaper, it has stories about particular items that use a lot of vocabulary related to that item. A story about COVID will have a lot of COVID-related vocabulary.

The Voynich manuscript shows that same sort of topic distribution. There are words used in the herbal pages that are not used in other parts of the manuscript. We can look at these computationally, and the sort of clustering of words we see on pages of the Voynich manuscript is extremely unlikely to be random.

Are there other insights that could come from using these kinds of computational tools?

We know from previous work that different parts of the manuscript were written by different scribes, so we looked at how the parts written by each scribe lined up with topics and word use. Scribe Four, for example, wrote all of the astrological and astronomical sections, whereas other scribes seem to have collaborated on other sections. Our computational modelling suggests that different scribes also had slightly different ways of writing or perhaps used different encipherment mechanisms.

For example, the scribes seem to share a substantial common vocabulary, but it appears as if some scribes are using certain terms or spelling words differently than other scribes, perhaps like how Australian and US English have subtle differences. Of course, we don’t know any of the words in Voynichese and how they’re spelled, so it’s impossible to know what’s really behind these differences.

Does this work reveal anything about how the manuscript was created?

It raises some questions, but overall it suggests the scribes were working in a consistent way, despite these minor differences. If it’s just gibberish, why would it matter if they’re all working the same way? I might expect that if the scribes each had their own encipherment methods then we would find much more differentiation.

Are you still working on this?

One thing we’ve just started working on is exploring what type of encipherment methods give the sort of odd character distributions — the low h2 — we see in Voynichese. We have a corpus of text from Wikipedia and an ancient language corpus of digitised materials from medieval and historical texts, and we can apply different cypher methods to these texts and see if the result matches the statistical properties of Voynichese.

We can also test other people’s claims about the underlying language and cypher mechanism. So if the claim is that it’s Latin encoded with a 15th century Crema cypher, we can take a block of Latin and apply a Crema cypher and see if it has similar properties to Voynichese. But what we’re finding so far is that the language we test doesn’t matter that much, because all natural languages are sufficiently alike and Voynichese is so different at this level.

That implies to me that someone did some very deliberate manipulation to the way the language is written, which coincidentally made it very different from other natural languages at the word level. But at the same time, they included information that ultimately works the same way as other natural languages, and we can see that structure when we look at sections or pages of the manuscript. It’s a very interesting contradiction. It’s quite a puzzle!

Do you think it will ever be solved?

I don’t know. I think it’s quite possible that we’ll have a pretty good idea about how the language was constructed, but we won’t be able to undo the code and recover the message. It’s not impossible, but I think that’s pretty unlikely at this stage unless we find an original source manuscript. Let’s just say: I’m having fun learning more about the manuscript without any expectation that I will ever be able to read what’s underneath.

Greg Miller is a science journalist based in Portland, Oregon.

This article originally appeared in Knowable Magazine, an independent journalistic endeavor from Annual Reviews. Sign up for the newsletter.