Historical Linguistics Part 2: Language Evolution, The Comparative Method and Language Families

Cool Linguistics Stuff
Author

Vaishnav Sudarshan

Originally Published

June 30, 2026

Last Updated

July 1, 2026

Abstract
How did old people figure out the ancestor language to all the Indo-European languages? Or the Dravidian languages? Or any other group of languages that people believe to be related? The comparative method is how.

Introduction

Imagine you see the following Romanized words for “three” in a bunch of different languages:

Language Word for “three” IPA
English three [θɹiː]
German drei [dʁaɪ]
Swedish tre [tɾe]
Latin trēs [ˈtreːs]
French trois [tʁwa]
Sanskrit tráyas [ˈtr̩ajəs]
Romani trin [tɾin]
Bengali tin [tɪn]
Tajik se [se]
Ancient Greek treîs [treːs]
Russian tri [trʲi]
Polish trzy [tʂɨ]
Irish trí [tɾʲiː]

Most of these look very similar. They are definitely not just loanwords, because this would also be very crazy historically - a lot of these languages never even contacted each other. Additionally, loanwords would have to fit the phonology of the language borrowing them, but those aren’t the sound changes that happened. Loanwords are typically also not for core, fundamental cultural concepts like numbers, but rather some idea the speakers didn’t have access to before.

This means that these words must be related, which means they come from a common ancestor language that split and diverged into all of these different languages.

A simpler, more tangible example is the Romance languages, which most people know come from Latin. For example, the word for “three” in Spanish is “tres”, in Italian is “tre”, in French is “trois”, and in Romanian is “trei”. These are all very similar, and they all come from the Latin word “trēs”. As Latin speakers spread out, their forms and pronounciations changed over time.

Now, you wonder what the common ancestor word for “three” even is. However, this feels like a question that is way too open-ended just by looking at the words, because with enough creativity or weirdness, you could probably evolve any word into any other word. However, there is a certain technique that is used to reconstruct many such ancestor words to a pretty confident degree of accuracy, called the comparative method. And if you had many more words and many more languages, you could even be able to reconstruct the entire ancestor language and even understand a bit about the culture of its speakers. By the way, the answer is *tréyes.

Proto-Languages

So, what is this kind of language that all these words come from? It is called a proto-language, which means it has never been directly observed in speech or writing, but instead was reconstructed by linguists using the comparative method. While its grammar and vocabulary aren’t necessarily 100% correct, because no one knows its exact form, its existence is almost guaranteed since it was able to be fully reconstructed.

In this case, this language is called Proto-Indo-European. It is estimated to have been spoken around 6000 years ago in the Eurasian Steppe.

The name of a language family, which is a group of languages that are all related to each other, usually shares a name with that of the proto-language. For example, the Indo-European language family descends from Proto-Indo-European, and the Dravidian language family descends from Proto-Dravidian.

Proto-languages can also be the ancient, reconstructed ancestor of any group of languages, not just the whole family. For example, the ancestor of all Germanic languages, like English, German, and Swedish, is called Proto-Germanic.

Another strange example is of the Romance languages. Even though we know they are all descended from Latin, there is still a proto-language called Proto-Romance, which is in between the Latin spoken by the elites in the Roman Empire and the modern Romance languages. It’s obtained using the comparative method on all the living Romance languages to find the most recent common ancestor. So, proto-languages can even be spoken after a known, recorded language.

The Comparative Method and Backtracking Sound Changes

I’ve mentioned that the comparative method is how linguists reconstruct proto-languages, but how does it work?

An important fact is that sound changes are regular, so typically, a certain rule like “a [k] sound turns into an [h] sound” will apply to all words in a language, and not just some of them. This is what allows us to backtrack sound changes, which is the main technique used in the comparative method. This, by the way, was one of the sound changes from Grimm’s Law, which I talked about in the previous post.

As a fun note, you can also find out timelines using sound changes. For example, in Arabic, there was a sound change where [p] turned into [f], and this time period is known. Nowadays, since the [p] sound doesn’t exist, speakers might borrow words with that sound but pronounce it like a [b]. So, if an Arabic loanword is pronounced as [b] while other languages pronounce that same word with a [p], then we know that this word was borrowed after the sound change, because if it was borrowed before, then it would be pronounced as an [f] instead.

Also, the Indo-European languages are usually classified into “centum” (Latin) or “satem” (Avestan), based on the first consonant in their word for “hundred”. The Indo-Iranian, Armenian, and Balto-Slavic languages underwent a sound change where the *ḱ was pronounced as [s]. The Italic, Hellenic, and Celtic languages have the first consonant in “hundred” pronounced like [k], so we know that they split off before that sound change happened (there’s not any known sound change that would have brought the [s] back to the [k], otherwise). The Germanic languages are where we have to apply multiple sound laws. The first consonant in “hundred” (or “Hundert” in German) is pronounced as [h]. There is no law changing [s] to [h] for Germanic languages (but this does exist for Iranian languages). However, Grimm’s Law does bring *k to [h]. Therefore, we can conclude that the Germanic languages split off before the sound change that shaped the satem languages.

Here is a diagram showing how the word for “hundred” evolved through the different Indo-European languages. This is a much smaller diagram, only showing a few selected branches and languages. It also points out some of the notable sound changes.

graph TD
PIE["PIE: *ḱm̥tóm"] --> Centum["Centum Languages"]
PIE -->|*k->s| Satem["Satem Languages"]
Satem --> PBS["Proto-Balto-Slavic: *śimtas"]
Satem --> PII["Proto-Indo-Iranian: *ćatám"]
PII --> Iranian
Iranian --> Avestan["Avestan: satem"]
PII --> Indo-Aryan
Indo-Aryan --> Sanskrit["Sanskrit: śatam"]
PBS --> Baltic
PBS --> Slavic
Baltic --> Lithuanian["Lithuanian: šimtas"]
Slavic --> Russian["Russian: sto"]
Centum --> Italic
Centum --> Germanic
Italic --> Latin["Latin: centum"]
Latin -->|*k->s| Spanish["Spanish: cien"]
Germanic -->|*k->h| PG["Proto-Germanic: *hundą"]
PG --> English["English: hundred"]

Note that the Romance languages, descended from Latin, pronounced it as [s], not [k] - it’s the same sound change that caused the centum-satem split, but it happened much later. Additionally, the satem languages don’t necessarily pronounce the first consonant as exactly [s], but it is generally a voiceless fricative in around that area, like [ʃ] for instance.

Some sound changes are much more likely to occur than others. For example, it’s much more likely for a /t/ sound to turn into a /tʃ/ sound if it’s near an /i/ or a /j/ sound, than if it is near an /o/ or an /a/ sound, because of a trend called palatalization. Also, /t/ probably won’t turn into a /ɬ/ near an /i/ or a /j/. This is how the Latin word “leite” turned into the Spanish word “leche”.

Typically, simplifications or reductions won’t be reversed, because that information is gone. This is why the word “eau” in French won’t turn back into “aqua” in Latin, because a lot of information was lost, so if we know that either French or Latin is the ancestor of the other, we can tell that Latin is the ancestor of French.

Looking for corresponding phonemes cross-linguistically and across many cognates is the first step. For example, if you look across Tamil, Telugu, and Kannada to try an reconstruct Proto-South-Dravidian (a descendant of Proto-Dravidian), you might notice that wherever Telugu and Kannada end in a /u/, Tamil reduces this final vowel and turns it into the null phoneme, /∅/.

Then, you try to figure out which sound change is the most likely to have happened, which kind of feels like a pretty hard puzzle. With enough data, you can reconstruct common vocabulary for the ancestor language without any logical contradictions.

Sometimes there are multiple possible reconstructions. Since they are all valid from the data alone, you need to rely on other tools, like how likely the sound changes are cross-linguistically.

Figuring out the culture of the speakers

Reconstructions from cognates can tell us a lot how the speakers lived. For example, across the Indo-European languages, there are cognates for “wheel”, like “cyclos” (related to “cycle”) in Greek, “chakra” in Sanskrit, and the word “wheel” itself. The reconstructed word is *kʷékʷlos in Proto-Indo-European. While it’s possible that semantic drift happened, and the original word meant something else, it’s pretty unlikely, since that would imply that the exact same drift happened in every single Indo-European language, which is nearly impossible. Also, there are reconstructed words for many other transportation technology related vocabulary as well, like “axle”. So, the Proto-Indo-European speakers, at least at some point, probably had wheels.

One extra detail is that Anatolian languages did NOT have a word for wheel that was related to the other Indo-European words for “wheel”. Anatolian languages were discovered after Proto-Indo-European was already reconstructed using non-Anatolian languages. Therefore, many people believe that that the Anatolian languages (like Hittite or Luwian) split off from Proto-Indo-European the earliest before Indo-European speakers had access to wheels. This is why some linguists prefer to call the language that gave rise to both Proto-Anatolian and the other Indo-European branches as Proto-Indo-Anatolian, having Proto-Indo-European refer solely to the ancestor of non-Anatolian Indo-European languages. However, it’s possible that Anatolian just obtained a different word for “wheel” from some other origin, but its ancestral speakers always had wheels.

This works for mythology too. The word “Zeus” is the shortened form of “Zeus Pater”, meaning “Sky Father” in Greek. The Latin word “Jupiter”, representing the same god, also means the same and is etymologically related to it. Even the Hindu sky god Indra’s father’s name was “Dyaus Pitar”, again meaning the same thing. The Norse god “Tyr” is also derived from the same root for “sky”. So, the Proto-Indo-European speakers probably had a sky father god as well, whose reconstructed name was “*Dyēus Phtḗr”.

Representing Sound Changes

We can represent sound changes by writing the original and altered pronunciation of a sound, and also including the environment where the change takes place. This looks pretty similar to representing the allophonic variation of a phoneme.

For example, Proto-Dravidian had a *k before unbacked vowels. In Tamil, this sometimes changed to a c (pronounced as [s]) if the consonant after the unbacked vowel wasn’t retroflex.

We can represent this as the following rule:

*k –> c / _ V[-BACKED] C[-RETROFLEX]

Order of sound changes

If you have multiple sound changes, the order in which they take place matters a lot. For a very simple example, take these two hypothetical sound changes:

Suppose that in Proto-X, the word for “cat” is *zop.

To get that word in the modern language X (zombe), we know there are three sound changes that happened:

  1. Final voiceless stops become voiced (zob).
  2. Final voiced stops have an “e” added after them (zobe).
  3. Vowels before voiced stops have a nasal added right after them (zombe).

Now, see what happens if we put 2, then 3, then 1:

First, no change, because the final consonant isn’t voiced. Second, no change, because there is no vowel before a voiced stop. Third, the final consonant becomes voiced, so the word is “zob”.

Thus, due to the nature of conditionals in sound changes, changing the order of sound changes can lead to different outcomes. In this sense, finding the right order is heavily reliant on logic.

It’s not always true that the order of sound changes matters, if the sound changes are independent of each other and can only build on each other, not mutually interfering. For example, if you start with the sound *ɸ, you can have these sound changes, and all 5040 rearrangements of them will lead to the same result:

  1. fricatives become stops
  2. voiceless consonants become voiced
  3. unaspirated consonants become aspirated
  4. ungeminated consonants become geminated
  5. add an “a” after all consonants
  6. add a “u” before all consonants
  7. bilabial consonants become alveolar

There is a really easy NACLO problem exactly like this, except for a real language, called Rewrite me badd.

Proving Language Relationships

At first, we don’t know for sure if languages are related or not. One of the most reliable ways to prove that they are related is to try and reconstruct a proto-language including that language, and see if it can be done completely and without much of a stretch.

This isn’t always a simple yes or no. The Niger-Congo family is hypothesized to link the Atlantic-Congo and Mande language families. Although Proto-Niger-Congo has not been reconstructed apart from some vocabulary, the theory is still relatively strongly believed in.

Some people who weren’t quite linguists, such as Nicolaes Witsen and Philip Johan von Strahlenberg, believed that the Turkic, Mongolic, and Tungusic languages (Manchu, the language of the early part of the Qing Dynasty, is a Tungusic language) were all part of the same language family, a hypothetical Altaic family. The philologist Matthias Castrén even went so far as to add the Uralic languages as a separate branch too. The mainstream pro-Altaic view sometimes added Japonic and Koreanic languages (mainly Japanese and Korean respectively, but they have smaller languages too) to the proposed family as well, making it the Macro-Altaic hypothesis.

The basis in this theory was mainly due to similarities in syntax, like how they all had agglutinative morphology. They did have some shared vocabulary that appeared to be basic, like body parts. However, there was simply just not enough data to reconstruct a proto-language. Therefore, the Altaic hypothesis was not proven. Some linguists argue that technically, all languages, even English and Chinese, could be related, but we just have no way of knowing since they have diverged so much.

Sprachbunds

In fact, there is a much more convincing explanation for the similarities between the Altaic languages, and this is that they form a sprachbund instead. A sprachbund means that languages that are not necessarily related can influence each other, including both loanwords and grammatical structure, due to their geographical proximity. Sprachbunds typically get closer and closer over time, because of their continued contact, whereas language families actually tend to diverge over time because they spread out. The Altaic languages showed signs of the former, because the early forms of Mongolic, Turkic, and Tungusic languages differed more than they do now, and there is no evidence of them diverging at some point.

It is possible for sprachbunds to consist of related languages. For example, the Balkan sprachbund contains Greek, Bulgarian, Serbain, Albanian, Romanian, Romani, and some dialects of Turkish. Apart from Turkish (which is, and you’ll never guess this, a Turkic language), the others are all Indo-European. The similarities between them (articles as postpositions, evidentiality, shared cases, loanwords/cliques, and a lot more) cannnot be explained because they are Indo-European - that would mean the Romance, Slavic, Hellenic, Albanian, and even Indo-Iranian languages all share a pretty close common ancestor, but that contradicts how ancient the centum-satem split is, and the Balkan languages are on either side of it.

Conservative Languages

Some people might be surprised that Sanskrit is more mutually intelligible with Lithuanian than it is with Hindi, even though Hindi is a direct descendant of Sanskrit, while Lithuanian isn’t even Indo-Iranian.

For example, look at the sentence “your gift is the fire”.

Sanskrit: तव दानं अग्निः अस्ति (tava dānaṃ agniḥ)

Lithuanian: tavo dovana ugnis

Hindi: तुम्हारा उपहार आग है (tumhārā upahār āg hai)

As you can see, the Sanskrit and Lithuanian sentences both have a similar structure, and the vocabulary appears more similar. While Hindi has cognates for “your” and “fire”, they are less recognizeable than the Lithuanian ones. Also, the copula (the word for “is”) is mandatory in Hindi, because it is more analytic. Meanwhile, since Sanskrit and Lithuanian are more synthetic, the parts of speech are built into the words themselves, so the copula is not needed and used more for emphasis (they are asti and yra, respectively). Also, Lithuanian happens to have SVO word order while Sanskrit and Hindi have SOV word order, but Sanskrit and Lithuanian both have flexible word order.

So why is Lithuanian so similar to Sanskrit? It’s because Lithuanian is pretty conservative, or resistant to change.

Sanskrit, being an ancient language, is relatively close to Proto-Indo-European, because it hasn’t had that much time to evolve compared to modern languages. While it was alive, it still was conservative, because it was heavily standardized.

Lithuanian, on the other hand, is still spoken natively to this day. However, the Baltic countries were relatively isolated from the rest of Europe and all the conquests that happened that caused language change, so their languages (excluding Estonian, which is Uralic) have preserved many features from Proto-Indo-European, and innovated new features relatively slowly. Because of this, Lithuanian and Sanskrit are both similar to Proto-Indo-European, and therefore similar to each other.

Tamil is another example of a conservative language. It is relatively close to Proto-Dravidian, compared to Telugu and Kannada, which have innovated more. This is why ancient Tamil literature is still understandable to modern Tamil speakers. However, Tamil has huge diglossia between the formal variety learned in school and spoken on the news, and the colloquial variety spoken on the streets and at home. The formal variety is the one that is more conservative. I only understand the colloquial variety, which is why ancient Tamil literature sounds like an alien language to me. I’m going to write a future blog post about languages and dialects, but this is a small taste of that.

Conclusion

Now, we know how proto languages are reconstructed, and how to disprove language family claims. However, some linguists didn’t get the memo that theories about language families must be realistic, so that’s what part 3 is gonna be all about.