How to calculate how difficult a language is
How accurate are the infographics floating around the web these days about language difficulty? — Pretty inaccurate! They're mostly subjective opinions that people have based on cursory glances of writing systems. And writing systems have nothing to do with achieving fluency in a foreign language! (We'll discuss the many reasons why you should avoid learning foreign scripts at the beginning in a future article).
Today's post takes a very scientific approach to calculating how difficult languages are to learn based on your base language.
I'm going to tackle two very commonly asked questions, but the real answers to these questions require a considerable amount of explanation:
- Is there an objective method for measuring language difficulty?
- What are the most difficult languages in the world?
Is there an objective method for measuring language difficulty?
In order to come up with a way to objectively measure the difficulty between languages, I had to start with the generally accepted norms that people have towards language based on commonly studied languages of French, German, Spanish; the assumed differences people have placed on the more difficult languages of Russian, Chinese, Japanese; and whether the model I built could replicate or at least confirm these norms.
Although my views may be very different from other people's (for example, I find Chinese an easy language), my own perceptions would not play any bias towards the development of such a model.
But what does it mean when I say that I find Chinese an easy language? I'm not being specific enough. And this is the fault that I find with most of the discussion happening online. What I mean is that grammatically speaking, Chinese is perhaps one of the easiest languages I have learned. However, its syntax is tricky and the phonics are definitely a challenge for English speakers.
So my objective model is built around three aspects of the language, all of which disregard the writing system entirely because writing systems should not create impediments for us in the process of acquiring fluency in a foreign language.
These three aspects of difficulty are:
- vocabulary acquisition
- syntax and morphology for fluency
- phonology for fluency
Not included: writing systems.
The majority of people agree that French and German are more difficult to pronounce than Spanish because they have a lot more vowels and consonants. French is probably harder due to the spelling, but I want to throw this wild card out and focus specifically on whether its phonics are harder or not.
Most people would agree that French and Spanish verbs are tough because of a large number of conjugations that need to be learned. German's a little easier in this regard, but its difficulty lies in the nouns. And comparing results between German and Russian or Icelandic, the model should be reflecting that increased complexity.
What I really wanted to find out from the data was the relative difficulty between languages like Polish, Hungarian, Russian, Chinese and if it was possible to determine which of these was actually easier to learn. Finally, the results look accurate for all the languages so far, and quite surprising in certain cases.
The final model uses a point system: one point added for each set of data collected. This can only work if data is collected from all languages for each data point. So if certain data cannot be collected, then that data point is deleted from all languages. Most of the data can be acquired from the WALS online database.
Let's look at each of the categories to see how I calculated difficulty.
- Same subbranch: 0 pts
- Different subbranch: 1 pt
- Different branch: 10 pts
- Different family: 100 pts
All languages of the world can be classified into families. The vocabulary in all the languages within a family all come from one source: the proto-language of that family. For example, English is an Indo-European language. So all the vocabulary in English passed down from the proto-language is the same vocabulary found in Bengali, Russian, Armenian, Greek, and so on. This is not an opinion; it's fact. I'll be writing more in a future post about how to learn massive amounts of vocabulary in relatively short period of time using language family strategies.
Within families are subgroups and sometimes multiple layers of subgroups. These multiple layers of subgroups can be ignored at first. What we're looking for is how close the reflexes (the changes in the vocabulary passed down from the proto-language) are across these subgroups. Since the Germanic and Romance languages all share a large number of vocabulary from each other with a relatively small number of differing reflexes, I group Germanic and Romance into a macro-group, which is further subdivided within. This means that for an English (Germanic) speaker learning German (Germanic) or French (Romance), there will only be a slight difference in difficulty.
Since most of our English colloquial language is more Germanic in nature, speaking colloquially and fluently in another Germanic language will feel very similar to how we express ourselves in English. The differing word order and grammar experienced with German will be dealt with in the next section. So learning any language within Germanic is awarded one point of difficulty in terms of vocabulary. Learning any language outside this group gets 10x more difficult in terms of vocabulary, including Slavic, Greek, and Indo-Iranian languages.
Most of the vocabulary of Romance languages comes from a higher, more prestigious register of English (because our English vocabulary came directly from Latin and these Romance languages), so it could pose a bit more of a challenge to speak colloquially in these languages than it would in Germanic languages. I have given it a halfway mark in terms of difficulty: 5 points.
Any language outside of our Indo-European language family is completely different in terms of vocabulary, and I have given it a rating of 10x more difficult than any language within Indo-European, in other words: 100 points. This means that Arabic, Japanese, Chinese, etc, all start with 100 points of difficulty right from the beginning.
I expect to improve the way that I have calculated this over time. Some ways I can improve this is by calculating the relative ease that new vocabulary can be learned after gaining a foundation. English continues to prove to be unreasonably difficult even as students progress to more advanced levels. This is not the case with Chinese or Georgian, where advanced vocabulary is made up logical compounds of vocabulary already acquired. In fact, the majority of the world's languages follow suit.
There are many things measured in this section. Here is a quick list with explanations:
- Language type: isolating (1 point), agglutinating (2 pts), fusional (3 pts), polysynthetic (4 pts). Subtract the difference between languages only if target language is more, ex: Chinese (1) to German (3) gets 2 points, German (3) to Chinese (1) gets 0 points.
- Word order (SVO, SOV, VSO, VOS, OVS, OSV): 1 pt increase if different.
- Adjective-Noun order (AN, NA): 1 pt increase if different.
- Genitive (possessor) - Noun order (GN, NG): 1 pt increase if different.
- Determiner-Noun order (DN, ND): 1 pt increase if different.
- Relative (clause) - Noun order (RN, NR): 1 pt increase if different.
- Noun Declension (Noun classes x Endings): Subtract differences when target language has more. Don't count declensions if they play the same role as postpositions, as in Uralic and Altaic languages.
- Tenses (count only marked, ex: English doesn't mark future tense but Spanish does): Subtract differences when target language has more.
- Aspect (perfect(ive), imperfect(ive), aorist, include subjunctive and conditional if marked; imperative has been left out): Subtract differences when target language has more.
- Mood: I did not count this category as data was inconsistent across languages, however high frequency usage such as subjunctive and conditional were included in the aspect category and counted for every language)
- Conjugation (person, number expressed in the verb): Subtract differences when target language has more.
- Adposition (preposition, circumfix, postposition, Austronesian focus): add 1 pt when different.
Let's take a look at a few languages in our database:
If we calculate between them, what this means is:
A German speaker learning French would end up with a difficulty score of 6 Points.
A Japanese speaker learning Spanish would get a difficulty score of 13 Points.
A Chinese speaker learning Polish would get a difficulty score of 34 Points.
Phonology for Fluency
Again, many things getting measured here. Total calculations take into account the difference in total phonemes. If phonemes (the written sounds) have significantly different allophones (the real sounds people actually say), then these are included. If allophones are inconsistent or quite different across a population, then these are not counted, or we separate the language into two dialects as we have for Mandarin.
12 Points of Articulation
- Labials: one point for each sound, subtract the difference when greater
- Labio-dentals: one point for each sound, subtract the difference when greater
- Dentals: one point for each sound, subtract the difference when greater
- Alveolars: one point for each sound, subtract the difference when greater
- Post-alveolars: one point for each sound, subtract the difference when greater
- Retroflex: one point for each sound, subtract the difference when greater
- Alveo-palatals: one point for each sound, subtract the difference when greater
- Palatals: one point for each sound, subtract the difference when greater
- Velars: one point for each sound, subtract the difference when greater
- Uvulars: one point for each sound, subtract the difference when greater
- Pharyngeals: one point for each sound, subtract the difference when greater
- Glottal: one point for each sound, subtract the difference when greater
Vowels and Intonation
- Vowels: total number of vowels including nasal and length
- Intonation / Tone: Number of tone and tone sandhi are given a point each. If pitch accent, then a point given for each pitch. If intonational, then a point given for each kind of intonation.
Let's take a look at a few languages in our database:
If we calculate between them, what this means is:
A German speaker learning French would end up with a difficulty score of 1 Point.
A Japanese speaker learning Spanish would get a difficulty score of 11 Points.
A Chinese speaker learning Polish would get a difficulty score of 15 Points.
Writing/script/orthography should not be concerns because: if a language has writing, the language has been encoded. If the language has been encoded, then there are most certainly dictionaries available which means it's easier to learn the language.
Therefore, the most difficult languages are those that meet the following criteria:
- don't have writing systems (like most minority languages and undocumented divergent dialects, like Romany languages of the gypsies in Europe); the hardest language is of course one we cannot even approach to document, such as Sentinelese spoken on the impossible-to-visit Sentinel Island;
- have large phonology sets (like Ubykh which has 84 consonants);
- are polysynthetic in nature (like Greenlandic or Bella Coola);
- have complex grammars.
Most of the official languages of the world are not polysynthetic and have few points of articulation. A language that meets all three criteria would be Nuxalk (Bella Coola) which is only written by linguists to record the grammar.
In the next post, I'm going to put some real numbers on all of this theory to compare the relative difficulty between a dozen languages.
A language – usually hypothetical or reconstructed, and unattested – from which a number of attested, or documented, known languages are believed to have descended by evolution, or slow modification of the proto-language into languages that form a language family.
- Rank of Language Difficulty
- Spatial Memory
- How Does Memory Work
- Memory Anchors
- How to Understand Language Families
- Hack Foreign Vocabulary with the DNA of Language