Introducing SearchPioneer.Lingua

Russ Cam
Russ Cam
12 August 2024

A fast and accurate language detection library for .NET

Tower of Babel

We are thrilled to announce SearchPioneer.Lingua, a natural language detection library for .NET. This is a .NET port of the wonderful JVM and Rust implementations of Lingua created by Peter M. Stahl.

We really wanted to name the package Lingua to align with the other implementations, but unfortunately it's already taken by an unlisted and no longer maintained package 🙁.

We'll refer to SearchPioneer.Lingua as simply Lingua throughout this article.

The task of Lingua is simple; it tells you which language some provided textual data is written in. This is very useful as a preprocessing step in a variety of applications that deal with natural language, including

Language detection is often done as part of large machine learning frameworks or natural language processing (NLP) applications. There are some cases however where you may not want to use such large systems; one such case is in the query flow of a high traffic search system where the overall latency of each request is paramount, and every additional millisecond counts.

Lingua supports 79 languages, with a focus on quality over quantity. Languages supported are:

Afrikaans Albanian Amharic Arabic Armenian Azerbaijani Basque Belarusian Bengali Bokmal (Norwegian) Bosnian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Esperanto Estonian Finnish French Ganda Georgian German Greek Gujarati Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Kazakh Korean Latin Latvian Lithuanian Macedonian Malay Maori Marathi Mongolian Nynorsk (Norwegian) Oromo Persian Polish Portuguese Punjabi Romanian Russian Serbian Shona Sinhala Slovak Slovene Somali Sotho Spanish Swahili Swedish Tagalog Tamil Telugu Thai Tigrinya Tsonga Tswana Turkish Ukrainian Urdu Vietnamese Welsh Xhosa Yoruba Zulu

Getting started

Lingua doesn't need much configuration to use and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

Getting started with SearchPioneer.Lingua is as simple as installing the nuget package:

dotnet add package SearchPioneer.Lingua

Then building a detector with the desired configuration, and using it

using Lingua;
using static Lingua.Language;

var detector = LanguageDetectorBuilder
    .FromLanguages(English, French, German, Spanish) // detect only these languages
    .Build();

var detectedLanguage = detector.DetectLanguageOf("languages are awesome");
Assert.Equal(English, detectedLanguage);

Minimum relative distance

By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language; The word prologue for instance is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed probabilities for each possible language have to satisfy. This can be configured as follows:

using Lingua;
using static Lingua.Language;

var detector = LanguageDetectorBuilder
  .FromLanguages(English, French, German, Spanish)
  .WithMinimumRelativeDistance(0.9)
  .Build();

var detectedLanguage = detector.DetectLanguageOf("languages are awesome");
Assert.Equal(Unknown, detectedLanguage);

Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high otherwise Unknown language will be returned most of the time as in the example above. This is the return value for cases where language detection is not reliably possible.

Confidence values

Sometimes you want to know the confidence values for each of the configured languages. Referring back to our search system example, we may have search indexes and fields for each language that the search system supports, and want to query those indexes where the query text confidence value is above a given threshold. The library supports this operation too:

using Lingua;
using static Lingua.Language;

var detector = LanguageDetectorBuilder
  .FromLanguages(English, French, German, Spanish)
  .Build();

var detectedLanguages = detector.ComputeLanguageConfidenceValues("languages are awesome")
  .Select(kv => (kv.Key, Math.Round(kv.Value, 2)))
  .ToArray();

Assert.Equal(new []
{
  (English, 0.93),
  (French, 0.04),
  (German, 0.02),
  (Spanish, 0.01)
}, detectedLanguages);

The return value is an ordered dictionary of languages, order by confidence value descending. In the example above, the dictionary is projected into an array of tuples ordered by their truncated confidence value descending. Each confidence value is a probability between 0.0 and 1.0 and the probabilities of all languages sums to 1.0. If the language is unambiguously identified by the rule engine, the value 1.0 will always be returned for this language and the other languages will receive a value of 0.0.

Eager loading versus lazy loading

By default, Lingua lazy-loads only those language models which are considered relevant by the rule-based filter engine. Sometimes it is more beneficial to preload all language models into memory on application startup. This can be achieved as follows:

using Lingua;

var detector = LanguageDetectorBuilder
  .FromAllLanguages()
  .WithPreloadedLanguageModels()
  .Build();

The language models loaded into memory are shared across all LanguageDetector instances in a thread-safe manner.

Low accuracy mode versus high accuracy mode

Lingua's high detection accuracy can come with a latency and memory trade-off; The large language models can consume sizeable amounts of memory. These requirements might not be feasible for systems running low on resources. If you want to classify mostly long texts or need to save resources, you can enable a low accuracy mode that loads only a small subset of the language models into memory:

using Lingua;

var detector = LanguageDetectorBuilder
  .FromAllLanguages()
  .WithLowAccuracyMode()
  .Build();

The downside of low accuracy mode is that detection accuracy for short texts of less than 120 characters drops significantly. However, detection accuracy for texts longer than 120 characters remains mostly unaffected.

An alternative for a smaller memory footprint and faster performance is to reduce the set of languages when building the language detector. In most cases, it is not advisable to build the detector from all supported languages; when you have knowledge about the texts you want to classify, you can almost always rule out certain languages as impossible or unlikely to occur.

Accuracy

The source repository provides the ability to test the accuracy of Lingua or any other language detection library in .NET, using 1000 single words, 1000 word pairs and 1000 sentences extracted from the same dataset used for training, namely the Wortschatz corpora provided by Leipzig University.

Accuracy reports 🎯 are available in the repository. The following table summarizes the mean accuracy of Lingua in high accuracy mode, along with two other popular language detection libraries, LanguageDetection and NTextCat. The mean of means (Average Accuracy) is also included in each case.

Each language detection library is configured and tested with all the languages it supports:

  • Lingua supports 79 languages
  • LanguageDetection supports 49 languages
  • NTextCat supports 15 languages

The number of languages configured should be taken into consideration when perusing the following table; language detection with more configured languages compared to fewer has a higher likelihood to be wrong when results are ambiguous, which is also more likely with shorter input text. As an example,

NTextCat scores 100% on Russian language on 1000 single words, whilst Lingua scores 76.5%. Lingua erroneously classifies some Russian single words in high accuracy mode as

  • Ukrainian: 6.30%
  • Bulgarian: 5.50%
  • Serbian: 3.40%
  • Belarusian: 3.30%
  • Macedonian: 3.00%
  • Mongolian: 1.10%
  • Kazakh: 0.90%

i.e. a mixture of other Slavic languages and those that use Cyrillic script. Importantly, NTextCat doesn't support any of these arguably similar languages, so has a lower likelihood of being wrong than Lingua.

Language Lingua
Average High Accuracy
Lingua
Single Words High Accuracy
Lingua
Word Pairs High Accuracy
Lingua
Sentences High Accuracy
LanguageDetection
Average Accuracy
LanguageDetection
Single Words Accuracy
LanguageDetection
Word Pairs Accuracy
LanguageDetection
Sentences Accuracy
NTextCat
Average Accuracy
NTextCat
Single Words Accuracy
NTextCat
Word Pairs Accuracy
NTextCat
Sentences Accuracy
Afrikaans 78.63 58.3 80.8 96.8 0 0 0 0
Albanian 87.6 68.5 94.8 99.5 78.8 53.2 83.8 99.4
Amharic 97.63 97.3 99.7 95.9
Arabic 98.5 96.4 99.2 99.9 97.37 93.7 98.5 99.9
Armenian 100 100 100 100
Azerbaijani 89.5 77.2 92.3 99
Basque 83.57 70.9 87.3 92.5
Belarusian 96.87 91.5 99.2 99.9
Bengali 100 100 100 100 100 100 100 100
Bokmal 57.6 38.7 58.7 75.4 65.3 39.5 64.7 91.7 60.5 38.3 57.3 85.9
Bosnian 34.53 29 34.6 40
Bulgarian 86.77 70.2 91.2 98.9 71.47 50.1 67.9 96.4
Catalan 70.17 50.6 73.9 86
Chinese 100 100 100 100 72.67 27.5 90.5 100 60.86 16.6 67.5 98.49
Croatian 72.4 53.4 74.3 89.5 73.03 50.4 71.2 97.5
Czech 80.23 65.5 84.5 90.7 70.7 51.5 72.1 88.5
Danish 80.97 61.2 83.9 97.8 71 50.6 68.3 94.1 74.2 56.5 70.6 95.5
Dutch 77.27 54.9 80.7 96.2 58.2 27.3 49.9 97.4 72.53 50.1 69.9 97.6
English 80.77 54.6 88.6 99.1 61 23.4 60.4 99.2 76.97 54.2 76.7 100
Esperanto 83.5 67.1 85.1 98.3
Estonian 91.6 79.1 95.9 99.8 82.77 61.7 87.1 99.5
Finnish 95.8 89.6 97.8 100 93.03 83.8 95.4 99.9
French 89.17 74.2 94.5 98.8 75.8 49.2 79.2 99 86.37 70.6 89.6 98.9
Ganda 91.2 78.6 95 100
Georgian 99.97 100 100 99.9
German 89.2 73.8 94.1 99.7 73.13 49 70.7 99.7 81.83 62.9 82.7 99.9
Greek 99.97 100 100 99.9 99.97 100 100 99.9
Gujarati 99.93 99.9 100 99.9 99.97 100 100 99.9
Hebrew 99.87 100 100 99.6 99.93 100 100 99.8
Hindi 72.6 60.7 64.3 92.8 77.3 60 72.8 99.1
Hungarian 94.8 86.5 97.9 100 88.57 74.6 91.4 99.7
Icelandic 93.07 82.7 96.7 99.8
Indonesian 60.3 39.2 60.9 80.8 80.27 56.1 84.8 99.9
Irish 90.7 81.8 94.3 96
Italian 86.8 68.9 91.8 99.7 77.67 52 81.7 99.3 88.03 74.5 90.3 99.3
Japanese 100 100 100 100 25.65 1.27 4.8 70.87 95.08 85.35 99.9 100
Kazakh 91.83 79.9 96.5 99.1
Korean 99.97 100 100 99.9 99.87 100 100 99.6 99.87 100 100 99.6
Latin 86.97 71.8 92.5 96.6
Latvian 93.37 84.7 96.7 98.7 88.7 76.3 91.5 98.3
Lithuanian 94.57 86.4 97.6 99.7 87.33 71.7 91 99.3
Macedonian 83.5 65.7 86.3 98.5 86.43 70.6 88.7 100
Malay 31.43 26 38.3 30
Maori 91.97 84.4 92.5 99
Marathi 84.7 73.9 84.7 95.5 91.13 82.9 92.2 98.3
Mongolian 96.9 92.8 98.7 99.2
Nynorsk 65.5 40.8 65.4 90.3 59.8 31.7 53.4 94.3 53.03 31.8 46.5 80.8
Oromo 94.7 92.9 97.7 93.5
Persian 90.2 77.6 93.7 99.3 80.1 63.2 77.7 99.4
Polish 94.57 85.4 98.4 99.9 89.7 75.8 93.4 99.9
Portuguese 80.73 59 85.3 97.9 61.9 30.3 57 98.4 66.8 42.3 61 97.1
Punjabi 99.97 100 100 99.9 99.97 100 100 99.9
Romanian 86.5 68.7 91.7 99.1 77.37 56.3 78.2 97.6
Russian 89.7 76.5 94.8 97.8 84.13 69.7 87.6 95.1 100 100 100 100
Serbian 87.53 73.5 90 99.1
Shona 91 77.6 95.5 99.9
Sinhala 99.77 100 100 99.3
Slovak 84.3 63.9 90.1 98.9 74.63 50.2 76.1 97.6
Slovene 82.17 61.2 86.7 98.6 72.5 47.9 71.4 98.2
Somali 89.9 75.8 94 99.9 90.5 77.2 94.4 99.9
Sotho 85.47 66.6 90.4 99.4
Spanish 69.63 43.6 68.6 96.7 57.5 26.2 48.6 97.7 72.47 49.3 69.6 98.5
Swahili 80.47 59.5 83.7 98.2
Swedish 83.6 63.7 88.4 98.7 68 41.3 66.5 96.2 79.53 59.8 80.6 98.2
Tagalog 77.8 51.7 83.1 98.6 76.77 53 78.5 98.8
Tamil 100 100 100 100 100 100 100 100
Telugu 99.97 100 100 99.9 100 100 100 100
Thai 99.7 100 100 99.1 99.93 100 100 99.8
Tigrinya 97.7 98.3 99.9 94.9
Tsonga 84.43 66.1 89.1 98.1
Tswana 84.1 65.1 88.5 98.7
Turkish 93.67 83.7 97.6 99.7 82.23 62.9 84.3 99.5
Ukrainian 92.3 84.4 97.3 95.2 82.87 66.4 84.8 97.4
Urdu 90.07 80 94.5 95.7 82 66.4 83.2 96.4
Vietnamese 90.73 78.84 94.15 99.2 84.39 61.09 92.27 99.8
Welsh 91.07 78.2 95.8 99.2
Xhosa 82.27 63.8 85 98
Yoruba 74.5 50.2 76.8 96.5
Zulu 80.47 61.7 83.1 96.6

Lingua consistently performs better than the other libraries, even when configured with more languages.

How does it work?

Most language detectors typically rely on a probabilistic N-gram model, such as that described in the N-Gram-Based Text Categorization (Cavnar & Trenkle, 1994) paper. The model is trained using the character distribution of a specific training corpus. Most libraries limit themselves to using 3-grams (trigrams), which work well for identifying the language of longer text fragments composed of multiple sentences. However, when it comes to short phrases or single words, trigrams fall short. The shorter the text, the fewer N-grams are available, making the probability estimates less reliable. To address this, Lingua employs N-grams ranging in size from 1 to 5, significantly improving the accuracy of language prediction. In low accuracy mode, only N-grams of size 3 are employed whereas high accuracy mode can use the full range of N-gram sizes from 1 to 5.

In addition to a probabilistic N-gram model, Lingua also employs a rule engine. The engine first looks at the alphabet of the input text and looks for characters that are unique to specific languages. It does so by employing Unicode Script information and a lookup table of unique characters. For example, The character ß, or "Eszett" or "sharp S", is unique to the German language. If a language can be unambiguously identified by the rule engine, the more computationally expensive probabilistic model doesn't need to be used.

Benchmarks

A language detection library needs to be both accurate and fast. The source repository contains a benchmarking suite to compare latency and allocations using BenchmarkDotNet, and the latest benchmarks ⚡ are available in the repository.

Running the English single word, word pairs and sentence benchmarks on the following local setup

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3958/23H2/2023Update/SunValley3)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 8.0.303
  [Host]   : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
  ShortRun : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

yields the following results

Benchmark Method Text Mean Error StdDev Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
English Single Word LinguaLowAccuracy suspiciously 14.54 μs 0.473 μs 0.026 μs 1.00 0.00 2.3651 0.0153 - 43.6 KB 1.00
English Single Word Lingua suspiciously 23.08 μs 1.860 μs 0.102 μs 1.59 0.01 3.6316 0.0916 - 66.49 KB 1.53
English Single Word LanguageDetection suspiciously 95.41 μs 16.064 μs 0.881 μs 6.56 0.07 1.9531 1.5869 1.2207 249.73 KB 5.73
English Single Word NTextCat suspiciously 16.29 μs 0.279 μs 0.015 μs 1.12 0.00 1.6479 0.0610 - 30.46 KB 0.70
English Word Pairs LinguaLowAccuracy outlining issues 18.28 μs 2.251 μs 0.123 μs 1.00 0.00 2.5635 - - 47.21 KB 1.00
English Word Pairs Lingua outlining issues 28.64 μs 6.390 μs 0.350 μs 1.57 0.02 3.8147 0.0916 - 70.16 KB 1.49
English Word Pairs LanguageDetection outlining issues 91.06 μs 24.040 μs 1.318 μs 4.98 0.08 2.0752 1.5869 1.2207 250.11 KB 5.30
English Word Pairs NTextCat outlining issues 22.12 μs 2.612 μs 0.143 μs 1.21 0.00 1.7090 0.0610 - 31.91 KB 0.68
English Sentence LinguaLowAccuracy On n(...)own. [125] 77.73 μs 8.632 μs 0.473 μs 1.00 0.00 7.3242 - - 136.46 KB 1.00
English Sentence Lingua On n(...)own. [125] 81.47 μs 7.988 μs 0.438 μs 1.05 0.00 7.3242 - - 136.46 KB 1.00
English Sentence LanguageDetection On n(...)own. [125] 105.56 μs 31.676 μs 1.736 μs 1.36 0.03 2.1973 1.9531 0.9766 260.55 KB 1.91
English Sentence NTextCat On n(...)own. [125] 188.29 μs 25.444 μs 1.395 μs 2.42 0.01 6.8359 0.7324 - 126.2 KB 0.92

Each language detection library is configured with the same set of supported languages, and the input text in each benchmark has been verified to return the correct result for each implementation. This has been done to try to make the benchmark as fair as possible.

Lingua is generally the fastest in low accuracy mode, never using Generation 2 (Gen2) of the managed heap, mainly using Gen0. The NTextCat library allocates the least amount of memory, though looks to get much slower as the length of text increases. The LanguageDetection library is much slower and allocates more.

Lingua makes use of Span<T> and ref structs to limit allocations. With the proposal to allow an alternate lookup structure in .NET 9 to search dictionaries and hashsets, it'll be possible to reduce allocations further, making it possible to compare N-grams parsed from input text as spans against the language model dictionaries held in memory, rather than allocating substrings for each N-gram size.

Summary

SearchPioneer.Lingua offers a powerful, efficient, and accurate solution for language detection in .NET applications. We'll look to add new features in the future, such as add identifying multiple languages in mixed-language texts, something the Rust implementation supports. We encourage you to explore the repository, try out the library in your own projects, and contribute to its continued development.

A special thank you to Peter for releasing the original implementations as open source 🙏.