Introducing SearchPioneer.Lingua
Russ Cam
12 August 2024A fast and accurate language detection library for .NET
We are thrilled to announce SearchPioneer.Lingua, a natural language detection library for .NET. This is a .NET port of the wonderful JVM and Rust implementations of Lingua created by Peter M. Stahl.
We really wanted to name the package Lingua to align with the other implementations, but unfortunately it's already taken by an unlisted and no longer maintained package 🙁.
We'll refer to SearchPioneer.Lingua as simply Lingua throughout this article.
The task of Lingua is simple; it tells you which language some provided textual data is written in. This is very useful as a preprocessing step in a variety of applications that deal with natural language, including
- Text Classification
- Spell Checking
- Search / Information Retrieval
Language detection is often done as part of large machine learning frameworks or natural language processing (NLP) applications. There are some cases however where you may not want to use such large systems; one such case is in the query flow of a high traffic search system where the overall latency of each request is paramount, and every additional millisecond counts.
Lingua supports 79 languages, with a focus on quality over quantity. Languages supported are:
Getting started
Lingua doesn't need much configuration to use and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.
Getting started with SearchPioneer.Lingua
is as simple as installing the nuget package:
dotnet add package SearchPioneer.Lingua
Then building a detector with the desired configuration, and using it
using Lingua;
using static Lingua.Language;
var detector = LanguageDetectorBuilder
.FromLanguages(English, French, German, Spanish) // detect only these languages
.Build();
var detectedLanguage = detector.DetectLanguageOf("languages are awesome");
Assert.Equal(English, detectedLanguage);
Minimum relative distance
By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language; The word prologue for instance is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed probabilities for each possible language have to satisfy. This can be configured as follows:
using Lingua;
using static Lingua.Language;
var detector = LanguageDetectorBuilder
.FromLanguages(English, French, German, Spanish)
.WithMinimumRelativeDistance(0.9)
.Build();
var detectedLanguage = detector.DetectLanguageOf("languages are awesome");
Assert.Equal(Unknown, detectedLanguage);
Be aware that the distance between the language probabilities is dependent on the length of the
input text. The longer the input text, the larger the distance between the languages. So if you
want to classify very short text phrases, do not set the minimum relative distance too high
otherwise Unknown
language will be returned most of the time as in the example above.
This is the return value for cases where language detection is not reliably possible.
Confidence values
Sometimes you want to know the confidence values for each of the configured languages. Referring back to our search system example, we may have search indexes and fields for each language that the search system supports, and want to query those indexes where the query text confidence value is above a given threshold. The library supports this operation too:
using Lingua;
using static Lingua.Language;
var detector = LanguageDetectorBuilder
.FromLanguages(English, French, German, Spanish)
.Build();
var detectedLanguages = detector.ComputeLanguageConfidenceValues("languages are awesome")
.Select(kv => (kv.Key, Math.Round(kv.Value, 2)))
.ToArray();
Assert.Equal(new []
{
(English, 0.93),
(French, 0.04),
(German, 0.02),
(Spanish, 0.01)
}, detectedLanguages);
The return value is an ordered dictionary of languages, order by confidence value descending. In the example above, the dictionary is projected into an array of tuples ordered by their truncated confidence value descending. Each confidence value is a probability between 0.0 and 1.0 and the probabilities of all languages sums to 1.0. If the language is unambiguously identified by the rule engine, the value 1.0 will always be returned for this language and the other languages will receive a value of 0.0.
Eager loading versus lazy loading
By default, Lingua lazy-loads only those language models which are considered relevant by the rule-based filter engine. Sometimes it is more beneficial to preload all language models into memory on application startup. This can be achieved as follows:
using Lingua;
var detector = LanguageDetectorBuilder
.FromAllLanguages()
.WithPreloadedLanguageModels()
.Build();
The language models loaded into memory are shared across all LanguageDetector
instances in a thread-safe manner.
Low accuracy mode versus high accuracy mode
Lingua's high detection accuracy can come with a latency and memory trade-off; The large language models can consume sizeable amounts of memory. These requirements might not be feasible for systems running low on resources. If you want to classify mostly long texts or need to save resources, you can enable a low accuracy mode that loads only a small subset of the language models into memory:
using Lingua;
var detector = LanguageDetectorBuilder
.FromAllLanguages()
.WithLowAccuracyMode()
.Build();
The downside of low accuracy mode is that detection accuracy for short texts of less than 120 characters drops significantly. However, detection accuracy for texts longer than 120 characters remains mostly unaffected.
An alternative for a smaller memory footprint and faster performance is to reduce the set of languages when building the language detector. In most cases, it is not advisable to build the detector from all supported languages; when you have knowledge about the texts you want to classify, you can almost always rule out certain languages as impossible or unlikely to occur.
Accuracy
The source repository provides the ability to test the accuracy of Lingua or any other language detection library in .NET, using 1000 single words, 1000 word pairs and 1000 sentences extracted from the same dataset used for training, namely the Wortschatz corpora provided by Leipzig University.
Accuracy reports 🎯 are available in the repository. The following table summarizes the mean accuracy of Lingua in high accuracy mode, along with two other popular language detection libraries, LanguageDetection and NTextCat. The mean of means (Average Accuracy) is also included in each case.
Each language detection library is configured and tested with all the languages it supports:
- Lingua supports 79 languages
- LanguageDetection supports 49 languages
- NTextCat supports 15 languages
The number of languages configured should be taken into consideration when perusing the following table; language detection with more configured languages compared to fewer has a higher likelihood to be wrong when results are ambiguous, which is also more likely with shorter input text. As an example,
NTextCat scores 100% on Russian language on 1000 single words, whilst Lingua scores 76.5%. Lingua erroneously classifies some Russian single words in high accuracy mode as
- Ukrainian: 6.30%
- Bulgarian: 5.50%
- Serbian: 3.40%
- Belarusian: 3.30%
- Macedonian: 3.00%
- Mongolian: 1.10%
- Kazakh: 0.90%
i.e. a mixture of other Slavic languages and those that use Cyrillic script. Importantly, NTextCat doesn't support any of these arguably similar languages, so has a lower likelihood of being wrong than Lingua.
Language | Lingua Average High Accuracy |
Lingua Single Words High Accuracy |
Lingua Word Pairs High Accuracy |
Lingua Sentences High Accuracy |
LanguageDetection Average Accuracy |
LanguageDetection Single Words Accuracy |
LanguageDetection Word Pairs Accuracy |
LanguageDetection Sentences Accuracy |
NTextCat Average Accuracy |
NTextCat Single Words Accuracy |
NTextCat Word Pairs Accuracy |
NTextCat Sentences Accuracy |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Afrikaans | 78.63 | 58.3 | 80.8 | 96.8 | 0 | 0 | 0 | 0 | ||||
Albanian | 87.6 | 68.5 | 94.8 | 99.5 | 78.8 | 53.2 | 83.8 | 99.4 | ||||
Amharic | 97.63 | 97.3 | 99.7 | 95.9 | ||||||||
Arabic | 98.5 | 96.4 | 99.2 | 99.9 | 97.37 | 93.7 | 98.5 | 99.9 | ||||
Armenian | 100 | 100 | 100 | 100 | ||||||||
Azerbaijani | 89.5 | 77.2 | 92.3 | 99 | ||||||||
Basque | 83.57 | 70.9 | 87.3 | 92.5 | ||||||||
Belarusian | 96.87 | 91.5 | 99.2 | 99.9 | ||||||||
Bengali | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ||||
Bokmal | 57.6 | 38.7 | 58.7 | 75.4 | 65.3 | 39.5 | 64.7 | 91.7 | 60.5 | 38.3 | 57.3 | 85.9 |
Bosnian | 34.53 | 29 | 34.6 | 40 | ||||||||
Bulgarian | 86.77 | 70.2 | 91.2 | 98.9 | 71.47 | 50.1 | 67.9 | 96.4 | ||||
Catalan | 70.17 | 50.6 | 73.9 | 86 | ||||||||
Chinese | 100 | 100 | 100 | 100 | 72.67 | 27.5 | 90.5 | 100 | 60.86 | 16.6 | 67.5 | 98.49 |
Croatian | 72.4 | 53.4 | 74.3 | 89.5 | 73.03 | 50.4 | 71.2 | 97.5 | ||||
Czech | 80.23 | 65.5 | 84.5 | 90.7 | 70.7 | 51.5 | 72.1 | 88.5 | ||||
Danish | 80.97 | 61.2 | 83.9 | 97.8 | 71 | 50.6 | 68.3 | 94.1 | 74.2 | 56.5 | 70.6 | 95.5 |
Dutch | 77.27 | 54.9 | 80.7 | 96.2 | 58.2 | 27.3 | 49.9 | 97.4 | 72.53 | 50.1 | 69.9 | 97.6 |
English | 80.77 | 54.6 | 88.6 | 99.1 | 61 | 23.4 | 60.4 | 99.2 | 76.97 | 54.2 | 76.7 | 100 |
Esperanto | 83.5 | 67.1 | 85.1 | 98.3 | ||||||||
Estonian | 91.6 | 79.1 | 95.9 | 99.8 | 82.77 | 61.7 | 87.1 | 99.5 | ||||
Finnish | 95.8 | 89.6 | 97.8 | 100 | 93.03 | 83.8 | 95.4 | 99.9 | ||||
French | 89.17 | 74.2 | 94.5 | 98.8 | 75.8 | 49.2 | 79.2 | 99 | 86.37 | 70.6 | 89.6 | 98.9 |
Ganda | 91.2 | 78.6 | 95 | 100 | ||||||||
Georgian | 99.97 | 100 | 100 | 99.9 | ||||||||
German | 89.2 | 73.8 | 94.1 | 99.7 | 73.13 | 49 | 70.7 | 99.7 | 81.83 | 62.9 | 82.7 | 99.9 |
Greek | 99.97 | 100 | 100 | 99.9 | 99.97 | 100 | 100 | 99.9 | ||||
Gujarati | 99.93 | 99.9 | 100 | 99.9 | 99.97 | 100 | 100 | 99.9 | ||||
Hebrew | 99.87 | 100 | 100 | 99.6 | 99.93 | 100 | 100 | 99.8 | ||||
Hindi | 72.6 | 60.7 | 64.3 | 92.8 | 77.3 | 60 | 72.8 | 99.1 | ||||
Hungarian | 94.8 | 86.5 | 97.9 | 100 | 88.57 | 74.6 | 91.4 | 99.7 | ||||
Icelandic | 93.07 | 82.7 | 96.7 | 99.8 | ||||||||
Indonesian | 60.3 | 39.2 | 60.9 | 80.8 | 80.27 | 56.1 | 84.8 | 99.9 | ||||
Irish | 90.7 | 81.8 | 94.3 | 96 | ||||||||
Italian | 86.8 | 68.9 | 91.8 | 99.7 | 77.67 | 52 | 81.7 | 99.3 | 88.03 | 74.5 | 90.3 | 99.3 |
Japanese | 100 | 100 | 100 | 100 | 25.65 | 1.27 | 4.8 | 70.87 | 95.08 | 85.35 | 99.9 | 100 |
Kazakh | 91.83 | 79.9 | 96.5 | 99.1 | ||||||||
Korean | 99.97 | 100 | 100 | 99.9 | 99.87 | 100 | 100 | 99.6 | 99.87 | 100 | 100 | 99.6 |
Latin | 86.97 | 71.8 | 92.5 | 96.6 | ||||||||
Latvian | 93.37 | 84.7 | 96.7 | 98.7 | 88.7 | 76.3 | 91.5 | 98.3 | ||||
Lithuanian | 94.57 | 86.4 | 97.6 | 99.7 | 87.33 | 71.7 | 91 | 99.3 | ||||
Macedonian | 83.5 | 65.7 | 86.3 | 98.5 | 86.43 | 70.6 | 88.7 | 100 | ||||
Malay | 31.43 | 26 | 38.3 | 30 | ||||||||
Maori | 91.97 | 84.4 | 92.5 | 99 | ||||||||
Marathi | 84.7 | 73.9 | 84.7 | 95.5 | 91.13 | 82.9 | 92.2 | 98.3 | ||||
Mongolian | 96.9 | 92.8 | 98.7 | 99.2 | ||||||||
Nynorsk | 65.5 | 40.8 | 65.4 | 90.3 | 59.8 | 31.7 | 53.4 | 94.3 | 53.03 | 31.8 | 46.5 | 80.8 |
Oromo | 94.7 | 92.9 | 97.7 | 93.5 | ||||||||
Persian | 90.2 | 77.6 | 93.7 | 99.3 | 80.1 | 63.2 | 77.7 | 99.4 | ||||
Polish | 94.57 | 85.4 | 98.4 | 99.9 | 89.7 | 75.8 | 93.4 | 99.9 | ||||
Portuguese | 80.73 | 59 | 85.3 | 97.9 | 61.9 | 30.3 | 57 | 98.4 | 66.8 | 42.3 | 61 | 97.1 |
Punjabi | 99.97 | 100 | 100 | 99.9 | 99.97 | 100 | 100 | 99.9 | ||||
Romanian | 86.5 | 68.7 | 91.7 | 99.1 | 77.37 | 56.3 | 78.2 | 97.6 | ||||
Russian | 89.7 | 76.5 | 94.8 | 97.8 | 84.13 | 69.7 | 87.6 | 95.1 | 100 | 100 | 100 | 100 |
Serbian | 87.53 | 73.5 | 90 | 99.1 | ||||||||
Shona | 91 | 77.6 | 95.5 | 99.9 | ||||||||
Sinhala | 99.77 | 100 | 100 | 99.3 | ||||||||
Slovak | 84.3 | 63.9 | 90.1 | 98.9 | 74.63 | 50.2 | 76.1 | 97.6 | ||||
Slovene | 82.17 | 61.2 | 86.7 | 98.6 | 72.5 | 47.9 | 71.4 | 98.2 | ||||
Somali | 89.9 | 75.8 | 94 | 99.9 | 90.5 | 77.2 | 94.4 | 99.9 | ||||
Sotho | 85.47 | 66.6 | 90.4 | 99.4 | ||||||||
Spanish | 69.63 | 43.6 | 68.6 | 96.7 | 57.5 | 26.2 | 48.6 | 97.7 | 72.47 | 49.3 | 69.6 | 98.5 |
Swahili | 80.47 | 59.5 | 83.7 | 98.2 | ||||||||
Swedish | 83.6 | 63.7 | 88.4 | 98.7 | 68 | 41.3 | 66.5 | 96.2 | 79.53 | 59.8 | 80.6 | 98.2 |
Tagalog | 77.8 | 51.7 | 83.1 | 98.6 | 76.77 | 53 | 78.5 | 98.8 | ||||
Tamil | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | ||||
Telugu | 99.97 | 100 | 100 | 99.9 | 100 | 100 | 100 | 100 | ||||
Thai | 99.7 | 100 | 100 | 99.1 | 99.93 | 100 | 100 | 99.8 | ||||
Tigrinya | 97.7 | 98.3 | 99.9 | 94.9 | ||||||||
Tsonga | 84.43 | 66.1 | 89.1 | 98.1 | ||||||||
Tswana | 84.1 | 65.1 | 88.5 | 98.7 | ||||||||
Turkish | 93.67 | 83.7 | 97.6 | 99.7 | 82.23 | 62.9 | 84.3 | 99.5 | ||||
Ukrainian | 92.3 | 84.4 | 97.3 | 95.2 | 82.87 | 66.4 | 84.8 | 97.4 | ||||
Urdu | 90.07 | 80 | 94.5 | 95.7 | 82 | 66.4 | 83.2 | 96.4 | ||||
Vietnamese | 90.73 | 78.84 | 94.15 | 99.2 | 84.39 | 61.09 | 92.27 | 99.8 | ||||
Welsh | 91.07 | 78.2 | 95.8 | 99.2 | ||||||||
Xhosa | 82.27 | 63.8 | 85 | 98 | ||||||||
Yoruba | 74.5 | 50.2 | 76.8 | 96.5 | ||||||||
Zulu | 80.47 | 61.7 | 83.1 | 96.6 |
Lingua consistently performs better than the other libraries, even when configured with more languages.
How does it work?
Most language detectors typically rely on a probabilistic N-gram model, such as that described in the N-Gram-Based Text Categorization (Cavnar & Trenkle, 1994) paper. The model is trained using the character distribution of a specific training corpus. Most libraries limit themselves to using 3-grams (trigrams), which work well for identifying the language of longer text fragments composed of multiple sentences. However, when it comes to short phrases or single words, trigrams fall short. The shorter the text, the fewer N-grams are available, making the probability estimates less reliable. To address this, Lingua employs N-grams ranging in size from 1 to 5, significantly improving the accuracy of language prediction. In low accuracy mode, only N-grams of size 3 are employed whereas high accuracy mode can use the full range of N-gram sizes from 1 to 5.
In addition to a probabilistic N-gram model, Lingua also employs a rule engine. The engine first looks at the
alphabet of the input text and looks for characters that are unique to specific languages. It does so by employing
Unicode Script information and a lookup table of unique
characters. For example, The character ß
, or "Eszett" or "sharp S", is unique to the German language. If a language
can be unambiguously identified by the rule engine, the more computationally expensive probabilistic model doesn't need
to be used.
Benchmarks
A language detection library needs to be both accurate and fast. The source repository contains a benchmarking suite to compare latency and allocations using BenchmarkDotNet, and the latest benchmarks ⚡ are available in the repository.
Running the English single word, word pairs and sentence benchmarks on the following local setup
BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3958/23H2/2023Update/SunValley3)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 8.0.303
[Host] : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
ShortRun : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
Job=ShortRun IterationCount=3 LaunchCount=1
WarmupCount=3
yields the following results
Benchmark | Method | Text | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|
English Single Word | LinguaLowAccuracy | suspiciously | 14.54 μs | 0.473 μs | 0.026 μs | 1.00 | 0.00 | 2.3651 | 0.0153 | - | 43.6 KB | 1.00 |
English Single Word | Lingua | suspiciously | 23.08 μs | 1.860 μs | 0.102 μs | 1.59 | 0.01 | 3.6316 | 0.0916 | - | 66.49 KB | 1.53 |
English Single Word | LanguageDetection | suspiciously | 95.41 μs | 16.064 μs | 0.881 μs | 6.56 | 0.07 | 1.9531 | 1.5869 | 1.2207 | 249.73 KB | 5.73 |
English Single Word | NTextCat | suspiciously | 16.29 μs | 0.279 μs | 0.015 μs | 1.12 | 0.00 | 1.6479 | 0.0610 | - | 30.46 KB | 0.70 |
English Word Pairs | LinguaLowAccuracy | outlining issues | 18.28 μs | 2.251 μs | 0.123 μs | 1.00 | 0.00 | 2.5635 | - | - | 47.21 KB | 1.00 |
English Word Pairs | Lingua | outlining issues | 28.64 μs | 6.390 μs | 0.350 μs | 1.57 | 0.02 | 3.8147 | 0.0916 | - | 70.16 KB | 1.49 |
English Word Pairs | LanguageDetection | outlining issues | 91.06 μs | 24.040 μs | 1.318 μs | 4.98 | 0.08 | 2.0752 | 1.5869 | 1.2207 | 250.11 KB | 5.30 |
English Word Pairs | NTextCat | outlining issues | 22.12 μs | 2.612 μs | 0.143 μs | 1.21 | 0.00 | 1.7090 | 0.0610 | - | 31.91 KB | 0.68 |
English Sentence | LinguaLowAccuracy | On n(...)own. [125] | 77.73 μs | 8.632 μs | 0.473 μs | 1.00 | 0.00 | 7.3242 | - | - | 136.46 KB | 1.00 |
English Sentence | Lingua | On n(...)own. [125] | 81.47 μs | 7.988 μs | 0.438 μs | 1.05 | 0.00 | 7.3242 | - | - | 136.46 KB | 1.00 |
English Sentence | LanguageDetection | On n(...)own. [125] | 105.56 μs | 31.676 μs | 1.736 μs | 1.36 | 0.03 | 2.1973 | 1.9531 | 0.9766 | 260.55 KB | 1.91 |
English Sentence | NTextCat | On n(...)own. [125] | 188.29 μs | 25.444 μs | 1.395 μs | 2.42 | 0.01 | 6.8359 | 0.7324 | - | 126.2 KB | 0.92 |
Each language detection library is configured with the same set of supported languages, and the input text in each benchmark has been verified to return the correct result for each implementation. This has been done to try to make the benchmark as fair as possible.
Lingua is generally the fastest in low accuracy mode, never using Generation 2 (Gen2) of the managed heap, mainly using Gen0. The NTextCat library allocates the least amount of memory, though looks to get much slower as the length of text increases. The LanguageDetection library is much slower and allocates more.
Lingua makes use of Span<T>
and ref structs
to limit allocations. With the proposal to allow an alternate lookup structure in .NET 9 to search dictionaries and
hashsets, it'll be possible to reduce allocations further,
making it possible to compare N-grams parsed from input text as spans against the language model dictionaries
held in memory, rather than allocating substrings for each N-gram size.
Summary
SearchPioneer.Lingua offers a powerful, efficient, and accurate solution for language detection in .NET applications. We'll look to add new features in the future, such as add identifying multiple languages in mixed-language texts, something the Rust implementation supports. We encourage you to explore the repository, try out the library in your own projects, and contribute to its continued development.
A special thank you to Peter for releasing the original implementations as open source 🙏.