The pair score is then calculated by comparing how many pairs in the gold standard are identically clustered by the algorithm, and vice versa. This is simple and straightforward, since, for pairs, there are only two possible decisions, namely whether they are cognate or not. We can then simply count how many pairs in the gold standard are also judged to be cognate by the algorithm, or how many pairs proposed to be cognate by the algorithm are also cognate according to the gold standard.
The advantage of this score is that we can directly convert it into an intuitive notion of false positives and false negatives versus true positives and true negatives. Breaking down the comparison of two different clusters into pairs is, however, problematic, since it has a strong bias in favoring datasets containing large amounts of non-cognate words [ 19 ]. In order to avoid these problems, we used B-Cubed scores as our primary evaluation method [ 37 , 60 , 61 ]. For the calculation of B-Cubed scores, we need to determine for each of the words the intersection of words between its cognate set in the gold standard and its cognate set proposed by the algorithm, as well as the size of the respective cognate sets.
This is illustrated in Table 5 for a fictive test analysis of the five words in Fig 1 , which wrongly clusters the Greek word with the English and the German word. For the B-Cubed precision we then average the size of the intersection divided by the size of the cognate set proposed by the algorithm for each of the words in our sample:. For the B-Cubed recall we average the intersection size divided by the cognate set size in the gold standard:. Cognate clusters, cluster size and cluster intersection for a fictive test analysis of the five words from Fig 1 compared to a gold standard.
Apart from the Turchin method, all analyses require a threshold which ranges between 0 and 1, denoting the amount of similarity needed to judge two items as cognate. In order to find the most suitable threshold for each of the three methods, we used the expert cognate decisions in our training set and ran the analyses on these data with varying thresholds starting from 0. Fig 2 shows box-plots of the training analyses for the four methods, depending on the threshold.
As can be seen from this figure, all methods show a definite peak where they yield the best results for all datasets. In order to select the best threshold for each of the four methods, we selected the threshold which showed the best average B-Cubed F-Score i. For the Edit Distance Method, the threshold was thus set to 0.
The B-Cubed scores for these analyses are given in Table 6.
Dieter Wanner (Author of New Analyses in Romance Linguistics)
Of the two worst-performing methods, the Turchin method performs worst in terms of F-Scores, but shows a much higher precision than the Edit-Distance method. The y -axis shows the B-Cubed F-scores averaged over all training sets, and the x -axis shows the threshold for the 5 methods we tested. Infomap shows the best results on average, Edit Distance performs worst. Dots in the plots indicate the mean for each sample, with triangular symbols indicating the peak.
We analyzed the datasets with each of the five methods described above, using the individual thresholds for each method, setting the number of permutations to 10,, and using the default parameters in LingPy. For each analysis, we further calculated the B-Cubed scores to evaluate the performance of each method on each dataset. Table 7 shows the averaged results of our experiments.
While the LexStat method shows the highest precision, the Infomap method shows the highest recall and also the best general performance. The results are generally consistent with those reported by List [ 19 ] for the performance of Turchin, Edit Distance, SCA, and LexStat: The Turchin method is very conservative with a low amount of false positives as reflected by the high precision, but a very large amount of undetected cognate relations as reflected by the low recall.
The Edit Distance method shows a much higher cognate detection rate, but at the cost of a high rate of false positives. The SCA method outperforms the Edit Distance, thus showing that refined distance scores can make a certain difference in automatic cognate detection. However, as the performance of LexStat and Infomap shows: Language-specific approaches for cognate detection clearly outperform language-independent approaches.
The reason for this can be found in the specific similarity measure that is employed by the methods: the better performing methods are not based on surface similarities, but on similarities derived from previously inferred probability scores for sound correspondences. These methods are therefore much closer to the traditional comparative method than methods which employ simple surface similarities between sounds. Our experiment with the Infomap algorithm shows that a shift from simple agglomerative clustering approaches to a network perspective may further strengthen the results.
Similarity networks have been successfully employed in evolutionary biology for some time now and should now become a fruitful topic of research in computational historical linguistics as well. There are interesting differences between method performance across language datasets, with marked variation in cognate identification accuracy between different languages.
New Approaches to Old Problems
Fig 3 shows the performance of the methods on the individual test sets, indicating which method performed best and which method performed worst. These results confirm the high accuracy of the LexStat method and the even better accuracy of the Infomap approach.
All methods apart from the Turchin method perform the worst on the Chinese data. Since compounding is very frequent in Chinese, it is difficult to clearly decide which words to assign to the same cognate set. Often, words show some overlap of cognate material without being entirely cognate. This is illustrated in Fig 4 , where cognates and partial cognates for Germanic and Sinitic languages are compared. We followed a strict procedure by which only words in which all morphemes are cognate are labelled as cognate [ 62 ], rather than loosely placing all words sharing a single cognate morpheme in the same cognate set [ 63 ].
Since neither of the algorithms we tested is specifically sensitive for partial cognate relations for a recent proposal for this task, see [ 53 ] , they all show a very low precision, because they tend to classify only partially related words as fully cognate. The figure shows the individual results of all algorithms based on B-Cubed F-Scores for each of the datasets. Results marked by a red triangle point to the worst result for a given subset, and results marked by a yellow star point to the best result.
Apart from Uralic, our new Infomap approach always performs best, while the Turchin approach performs worst in four out of six tests. The different cognate relations among the morphemes in the Chinese words make it impossible to give a binary assessment regarding the cognacy of the four words. The Turchin method has three extreme outliers in which it lags far behind the other methods: Chinese, Bahnaric and Romance. There are two major reasons for this. First, the Turchin method only compares the first two consonants and will be seriously affected by the problem of partial cognates discussed above.
These partial cognates are especially prevalent in Chinese and Bahnaric where compounding is an important linguistic process. Second, a specific weakness of the Turchin method is the lack of an alignment and words are not exhaustively compared for structural similarities but simply mapped in their first two initial consonants. When there is substantial sound change, as is evident in both Bahnaric and some branches of Romance, this may lead to an increased amount of false negatives.
Since the Turchin method only distinguishes 10 different sound classes and only compares the first two consonant classes in each word in the data, it is very likely to miss obvious cognates. The main problem here is that the method does not allow for any transition probabilities between sound classes, but treats them as discrete units. As a result, it is likely that the Turchin method often misses valid cognate relations which are easily picked up by the other methods. This shortcoming of the Turchin approach is illustrated in Fig 5 , where the amount of true positives and negatives is contrasted with the amount of false positives and negatives in each dataset and for each of the five methods.
This figure indicates that the Turchin method shows exceptionally high amounts of false negatives in Bahnaric and Romance. The clear advantage of the Turchin method is its speed, as it can be computed in linear time. Its clear disadvantage is its simplicity which may under certain circumstances lead to a high amount of false negatives. The Edit-Distance method also performs very poorly.
While, on average, it performs better than the Turchin approach, it performs considerably worse on the Chinese and Huon test sets. The reason for this poor performance can be found in a high amount of false positives as shown in Fig 5. While the Turchin method suffers from not finding valid cognates, the Edit-Distance method suffers from the opposite problem—identifying high amounts of false cognates. Since false positives are more deleterious for language comparison, as they might lead to false conclusions about genetic relationship [ 15 ], the Edit-Distance method should be used with very great care.
Given that the SCA method performs better while being similarly fast, there is no particular need to use the Edit-Distance method at all. In Fig 6 , we further illustrate the difference between the worst and the best approaches in our study by comparing false positives and false negatives in Turchin and Infomap across all language pairs in the Chinese data. As can be seen from Fig 5 , the Turchin approach has about as many false positives as false negatives. The Infomap approach shows slightly more false positives than false negatives.
This general picture, however, changes when looking at the detailed data plotted in Fig 6. Here, we can see that false positives in the Turchin approach occur in almost all dialect pairings, while the major number of cognates is missed in the mainland dialects bottom of the y -axis. Infomap, on the other hand, shows drastically fewer false positives and false negatives, but while false negatives can be mostly observed in the Northern dialects bottom of y -axis , false positives seem to center around the highly diverse Southern dialects top of the y -axis. This reflects the internal diversity in Northern and Southern Chinese dialects, and the challenges resulting from it for automatic cognate detection.
While word compounding is very frequent in the North of China, where almost all words are bisyllabic and bimorphemic, the Southern dialects often preserve monosyllabic words. While Northern dialects are rather homogeneous, showing similar sound systems and a rather large consonant inventories, Southern dialects have undergone many consonant mergers in their development, and are highly diverse. The unique threshold for cognate word detection overestimates similarities among the Southern dialects upper triangle, left quarter , while it underestimates similarities among Northern dialects compared to Southern dialects lower triangle, left quarter.
What further contributes to this problem is also the limited size of the word lists in our sample, which make it difficult for the language-specific algorithms to acquire enough deep signal. The figure compares the amount of false positives and false negatives, as measured in pairwise scores for the Turchin method and our Infomap approach for all pairs of language varieties in the Chinese test set.
The upper triangle of the heatmaps shows the amount of false positives, while the lower triangle shows the amount of false negatives.
In this study we have applied four published methods and one new method for automated cognate detection to a set of six different test sets from five different language families. By training our data on an already published dataset of similar size, we identified the best thresholds to obtain a high accuracy for detecting truly related words for four out of the five methods Edit-Distance: 0.
Using these thresholds, we tested the methods on our new gold standard, and found that most methods identified cognates with a considerable amount of accuracy ranging from 0. Our new method, which builds on the LexStat method but employs the Infomap algorithm for community detection to partition words into cognate sets, outperforms all other methods in almost all regards, slightly followed by the LexStat approach. Given that the LexStat method and our Infomap approach are based on language-specific language comparison, searching for similar patterns in individual language pairs, our results confirm the superiority of cognate detection approaches which are closer to the theoretical foundation of the classical comparative method in historical linguistics.
The Consonant Class Matching method by Turchin et al. While the major drawback of the Turchin approach is a rather large amount of false negatives, the Edit-Distance approach shows the highest amount of false positives in our test. The method of choice may well depend on the task to which cognate detection is to be applied. If the task is to simply identify some potential cognates for future inspection and annotation, then a fast algorithm like the one by Turchin et al. This practice, which is already applied by some scholars [ 64 ], is further justified by the rather small amount of false positives.
While the use of the Turchin method may be justified in computer-assisted workflows, the use of the Edit-Distance approach should be discouraged, since it lacks the speed advantages and is very prone to false positives. When searching for deeper signals in larger datasets, however, we recommend using the more advanced methods, like SCA, LexStat or our new Infomap approach. LexStat and Infomap have the great advantage of taking regular sound correspondences into account. As a result, these methods tend to refuse chance resemblances and borrowings. Their drawback is the number of words needed to carry out the analysis.
As we know from earlier tests [ 65 ], language-specific methods require at least words for moderately closely related languages. When applied to datasets with higher diversity among the languages, the number of words should be even higher. Thus, when searching for cognates in very short word lists, we recommend using the SCA method to achieve the greatest accuracy. However, as demonstrated by the poorer performance of all methods on the Chinese language data where compounding has played a major role in word formation, language family specific considerations about the methods and processes need to be taken into consideration.
Our results show that the performance of computer-assisted automatic cognate detection methods has advanced substantially, both with respect to the applicability of the methods and the accuracy of the results. Moreover, given that the simple change we made from agglomerative to network-based clustering could further increase the accuracy of the results, shows that we have still not exhausted the full potential of cognate detection methods. Essential tasks for the future include a the work on parameter-free methods which do not require user-defined thresholds and state the results as probabilities rather as binary decisions, b the further development of methods for partial cognate detection [ 53 ], c approaches that search for cognates not only in the same meaning slot but across different meanings [ 66 ], and d approaches that integrate expert annotations to allow for a true iterative workflow for computer-assisted language comparison.
A key problem to solve is the performance of these methods on larger datasets that trace language relationships to a greater depth. Most of our test cases in this paper are shallow families or subgroups of larger families. Deeper relationships between languages spoken in more complicated language situations are where the real challenge lies. Currently automatic cognate detection algorithms are highly accurate at detecting a substantial proportion of the cognates in a lexical dataset.
Tools like LingPy are already at a stage where they can act as a computer-assisted framework for language comparison. These tools therefore provide a powerful way of supplementing the historical linguistics toolkit by enabling linguists to rapidly identify the cognate sets which can then be checked, corrected, and augmented as necessary by experts. In regions where there has been an absence of detailed historical comparative work, these automated cognate assignments can provide a way to pre-process linguistic data from less well studied languages and speed up the process by which experts apply the comparative method.
Additionally, these tools can be employed for exploratory data analysis of larger datasets, or to arrive at preliminary classifications for language families which have not yet been studied with help of the classical methods. We thank the anonymous reviewers for helpful advice. Saenko Romance for sharing their data with us by either exchanging them directly or making them accessible online. National Center for Biotechnology Information , U. PLoS One.
Published online Jan Greenhill , 2, 3 and Russell D. Gray 2. Simon J. Russell D. Robert C Berwick, Editor. Author information Article notes Copyright and License information Disclaimer. Competing Interests: The authors have declared that no competing interests exist. Data curation: JML. Project administration: RDG. Software: JML. Writing — original draft: JML. Received Oct 18; Accepted Dec This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
This article has been cited by other articles in PMC.
- In This Article.
- Services on Demand!
- Earth Chemistry.
- Language Relatedness and Divergence: Quantitative and Phylogenetic Approaches.
- Cited by other publications.
- The Shorter Routledge Encyclopedia of Philosophy;
- Tue Oct 13 1998!
Associated Data Data Availability Statement The Supplementary Material contains additional results, as well as data and code to replicate the analyses. Abstract The amount of data from languages spoken all over the world is rapidly increasing. Introduction Historical linguistics is currently facing a dramatic increase in digitally available datasets [ 1 — 5 ]. Materials and Methods Materials There are few datasets available for testing the potential of cognate detection methods on language data, As such, testing algorithms run the risk of over-fitting.
Table 1 Test data used in our study. Dataset Words Conc. Bahnaric Sidwell, [ 20 ] 24 0. Open in a separate window. Table 2 Training data used in our study. Austronesian Greenhill et al. Methods Automatic Cognate Detection Many methods for automatic cognate detection have been proposed in the past see Table 3 below. Table 3 Recent approaches to cognate detection. Cognate Detection Approach ML? Cognate Detection using the Edit Distance approach A second method provided by LingPy, the Edit Distance approach, takes the normalized Levenshtein distance [ 46 ], between all word pairs in the same meaning slot and clusters these words into potential cognate sets using a flat version of the UPGMA algorithm [ 47 ] which terminates once a certain threshold of average distances between all words is reached.
Fig 1. Workflows for automatic cognate detection. Cognate Detection using the LexStat method The last publicly available method we tested, the LexStat method, is again based on flat UPGMA clustering, but in contrast to both the Edit-Distance method and the SCA method, it uses language-specific scoring schemes which are derived from a Monte-Carlo permutation of the data [ 19 ]. Differences between algorithms In order to illustrate the differences between these four algorithms, we analysed the test set by Kessler [ 49 ]. Table 4 Cognate detection algorithms in LingPy.
Similarity Networks All the above cognate detection methods currently use a rather simple flat clustering procedure. Table 5 Preliminaries for B-Cubed score calculation. Word Cogn. Threshold and Parameter Selection Apart from the Turchin method, all analyses require a threshold which ranges between 0 and 1, denoting the amount of similarity needed to judge two items as cognate. Fig 2. Determining the best thresholds for the methods. Table 6 Results of the training analysis to identify the best thresholds. Bold numbers indicate best values. Method Thr. Recall F-Score Turchin - 0.
Results We analyzed the datasets with each of the five methods described above, using the individual thresholds for each method, setting the number of permutations to 10,, and using the default parameters in LingPy. Table 7 General results of the test analysis. Method Prec. Recall F-Score Turchin 0. Dataset Specific Results There are interesting differences between method performance across language datasets, with marked variation in cognate identification accuracy between different languages.
Fig 3. Individual test results B-Cubed F-Scores. Fig 4. Partial and non-partial cognate relations. Fig 5. Distribution of true and false positives and true and false negatives. Fig 6. Comparing false positives and false negatives in the Chinese data. Discussion In this study we have applied four published methods and one new method for automated cognate detection to a set of six different test sets from five different language families. Acknowledgments We thank the anonymous reviewers for helpful advice.
Data Availability The Supplementary Material contains additional results, as well as data and code to replicate the analyses. References 1. Evolutionary Bioinformatics. Dunn M. Indo-European lexical cognacy database IELex. Nijmegen: Max Planck Institute for Psycholinguistics; Greenhill SJ. Proc Biol Sci. August; — Bowern C. Chirila: Contemporary and historical resources for the indigenous languages of Australia. Language Documentation and Conservation. Fox A. Linguistic reconstruction. Oxford: Oxford University Press; Language classification by numbers. Embleton S.
Time depth in historical linguistics. Holm HJ. Journal of Quantitative Linguistics. Explorations in automated lexicostatistics. Folia Linguistica. Historical linguistics as a sequence optimization problem: the evolution and biogeography of Uto-Aztecan languages. Support for linguistic macrofamilies from weighted alignment.
Campbell L. Current Anthropology. Levenshtein distances fail to identify language relationships accurately. Computational Linguistics. Sidwell P. Trask RL. The dictionary of historical and comparative linguistics. Edinburgh: Edinburgh University Press; Ross MD, Durie M. Introduction In: Durie M, editor. The comparative method reviewed.
Issues in Romance historical linguistics
New York: Oxford University Press; List JM. Sequence comparison in historical linguistics. Austroasiatic dataset for phylogenetic analysis: version. The volume itself mirrors current American tendencies in Romance historical linguistics and reflects the modern shift from Romance to linguistics that the discipline of Romance linguistics has observed in the s. In addition to language-specific essays, the volume also contains two studies on broad methodological issues.
- New Approaches to Old Problems.
- LINGUIST List 9.1427.
- Sport, Health and Drugs: A Critical Sociological Perspective;
- Hispania. Volume 73, Number 1, March | Biblioteca Virtual Miguel de Cervantes.
- An Approach to Teaching Autistic Children.
- Navigation Bar.
One of them was given as the opening plenary paper of the symposium and argues in favor of an immanent dynamic perspective on language acquisition and language change which, unlike typologies, grammaticalization approaches, and especially parameters, considers the role of social conditioning of language change fundamental. The other essay offers an alternative approach to the traditional perspective on grammaticalization as a dichotomy between analytic and synthetic constructions. According to the editors, the characteristic common to all papers in this volume is the emphasis on theory and on how different perspectives bring in changes to a specific theoretical framework.
Though not all papers fit neatly into the framework of a particular [End Page ] formal theory, they all echo recent trends in the discipline—so old and in the meantime so new—and illustrate how it has changed over the years. They offer the Romance linguist a fresh overview of what Romance linguistics looks like at the beginning of a new millennium. In addition to the editors, the contributors to this volume include: J urgen K lausenburger; T homas D. C ravens; D ale H artkemeyer; G lenn A.
M artinez; D onald N. Project MUSE promotes the creation and dissemination of essential humanities and social science resources through collaboration with libraries, publishers, and scholars worldwide.