Introducing SearchPioneer.Lingua

Russ Cam

12 August 2024

A fast and accurate language detection library for .NET

We are thrilled to announce SearchPioneer.Lingua, a natural language detection library for .NET. This is a .NET port of the wonderful JVM and Rust implementations of Lingua created by Peter M. Stahl.

We really wanted to name the package Lingua to align with the other implementations, but unfortunately it's already taken by an unlisted and no longer maintained package 🙁.

We'll refer to SearchPioneer.Lingua as simply Lingua throughout this article.

The task of Lingua is simple; it tells you which language some provided textual data is written in. This is very useful as a preprocessing step in a variety of applications that deal with natural language, including

Text Classification
Spell Checking
Search / Information Retrieval

Language detection is often done as part of large machine learning frameworks or natural language processing (NLP) applications. There are some cases however where you may not want to use such large systems; one such case is in the query flow of a high traffic search system where the overall latency of each request is paramount, and every additional millisecond counts.

Lingua supports 79 languages, with a focus on quality over quantity. Languages supported are:

Afrikaans Albanian Amharic Arabic Armenian Azerbaijani Basque Belarusian Bengali Bokmal (Norwegian) Bosnian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Esperanto Estonian Finnish French Ganda Georgian German Greek Gujarati Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Kazakh Korean Latin Latvian Lithuanian Macedonian Malay Maori Marathi Mongolian Nynorsk (Norwegian) Oromo Persian Polish Portuguese Punjabi Romanian Russian Serbian Shona Sinhala Slovak Slovene Somali Sotho Spanish Swahili Swedish Tagalog Tamil Telugu Thai Tigrinya Tsonga Tswana Turkish Ukrainian Urdu Vietnamese Welsh Xhosa Yoruba Zulu

Getting started

Lingua doesn't need much configuration to use and yields pretty accurate results on both long and short text, even on single words and phrases. It draws on both rule-based and statistical methods but does not use any dictionaries of words. It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

Getting started with SearchPioneer.Lingua is as simple as installing the nuget package:

dotnet add package SearchPioneer.Lingua

Then building a detector with the desired configuration, and using it

using Lingua;
using static Lingua.Language;

var detector = LanguageDetectorBuilder
    .FromLanguages(English, French, German, Spanish) // detect only these languages
    .Build();

var detectedLanguage = detector.DetectLanguageOf("languages are awesome");
Assert.Equal(English, detectedLanguage);

Minimum relative distance

By default, Lingua returns the most likely language for a given input text. However, there are certain words that are spelled the same in more than one language; The word prologue for instance is both a valid English and French word. Lingua would output either English or French which might be wrong in the given context. For cases like that, it is possible to specify a minimum relative distance that the logarithmized and summed probabilities for each possible language have to satisfy. This can be configured as follows:

using Lingua;
using static Lingua.Language;

var detector = LanguageDetectorBuilder
  .FromLanguages(English, French, German, Spanish)
  .WithMinimumRelativeDistance(0.9)
  .Build();

var detectedLanguage = detector.DetectLanguageOf("languages are awesome");
Assert.Equal(Unknown, detectedLanguage);

Be aware that the distance between the language probabilities is dependent on the length of the input text. The longer the input text, the larger the distance between the languages. So if you want to classify very short text phrases, do not set the minimum relative distance too high otherwise Unknown language will be returned most of the time as in the example above. This is the return value for cases where language detection is not reliably possible.

Confidence values

Sometimes you want to know the confidence values for each of the configured languages. Referring back to our search system example, we may have search indexes and fields for each language that the search system supports, and want to query those indexes where the query text confidence value is above a given threshold. The library supports this operation too:

using Lingua;
using static Lingua.Language;

var detector = LanguageDetectorBuilder
  .FromLanguages(English, French, German, Spanish)
  .Build();

var detectedLanguages = detector.ComputeLanguageConfidenceValues("languages are awesome")
  .Select(kv => (kv.Key, Math.Round(kv.Value, 2)))
  .ToArray();

Assert.Equal(new []
{
  (English, 0.93),
  (French, 0.04),
  (German, 0.02),
  (Spanish, 0.01)
}, detectedLanguages);

The return value is an ordered dictionary of languages, order by confidence value descending. In the example above, the dictionary is projected into an array of tuples ordered by their truncated confidence value descending. Each confidence value is a probability between 0.0 and 1.0 and the probabilities of all languages sums to 1.0. If the language is unambiguously identified by the rule engine, the value 1.0 will always be returned for this language and the other languages will receive a value of 0.0.

Eager loading versus lazy loading

By default, Lingua lazy-loads only those language models which are considered relevant by the rule-based filter engine. Sometimes it is more beneficial to preload all language models into memory on application startup. This can be achieved as follows:

using Lingua;

var detector = LanguageDetectorBuilder
  .FromAllLanguages()
  .WithPreloadedLanguageModels()
  .Build();

The language models loaded into memory are shared across all LanguageDetector instances in a thread-safe manner.

Low accuracy mode versus high accuracy mode

Lingua's high detection accuracy can come with a latency and memory trade-off; The large language models can consume sizeable amounts of memory. These requirements might not be feasible for systems running low on resources. If you want to classify mostly long texts or need to save resources, you can enable a low accuracy mode that loads only a small subset of the language models into memory:

using Lingua;

var detector = LanguageDetectorBuilder
  .FromAllLanguages()
  .WithLowAccuracyMode()
  .Build();

The downside of low accuracy mode is that detection accuracy for short texts of less than 120 characters drops significantly. However, detection accuracy for texts longer than 120 characters remains mostly unaffected.

An alternative for a smaller memory footprint and faster performance is to reduce the set of languages when building the language detector. In most cases, it is not advisable to build the detector from all supported languages; when you have knowledge about the texts you want to classify, you can almost always rule out certain languages as impossible or unlikely to occur.

Accuracy

The source repository provides the ability to test the accuracy of Lingua or any other language detection library in .NET, using 1000 single words, 1000 word pairs and 1000 sentences extracted from the same dataset used for training, namely the Wortschatz corpora provided by Leipzig University.

Accuracy reports 🎯 are available in the repository. The following table summarizes the mean accuracy of Lingua in high accuracy mode, along with two other popular language detection libraries, LanguageDetection and NTextCat. The mean of means (Average Accuracy) is also included in each case.

Each language detection library is configured and tested with all the languages it supports:

Lingua supports 79 languages

LanguageDetection supports 49 languages

NTextCat supports 15 languages

The number of languages configured should be taken into consideration when perusing the following table; language detection with more configured languages compared to fewer has a higher likelihood to be wrong when results are ambiguous, which is also more likely with shorter input text. As an example,

NTextCat scores 100% on Russian language on 1000 single words, whilst Lingua scores 76.5%. Lingua erroneously classifies some Russian single words in high accuracy mode as

Ukrainian: 6.30%

Bulgarian: 5.50%

Serbian: 3.40%

Belarusian: 3.30%

Macedonian: 3.00%

Mongolian: 1.10%

Kazakh: 0.90%

i.e. a mixture of other Slavic languages and those that use Cyrillic script. Importantly, NTextCat doesn't support any of these arguably similar languages, so has a lower likelihood of being wrong than Lingua.

Language	Lingua Average High Accuracy	Lingua Single Words High Accuracy	Lingua Word Pairs High Accuracy	Lingua Sentences High Accuracy	LanguageDetection Average Accuracy	LanguageDetection Single Words Accuracy	LanguageDetection Word Pairs Accuracy	LanguageDetection Sentences Accuracy	NTextCat Average Accuracy	NTextCat Single Words Accuracy	NTextCat Word Pairs Accuracy	NTextCat Sentences Accuracy
Afrikaans	78.63	58.3	80.8	96.8	0	0	0	0
Albanian	87.6	68.5	94.8	99.5	78.8	53.2	83.8	99.4
Amharic	97.63	97.3	99.7	95.9
Arabic	98.5	96.4	99.2	99.9	97.37	93.7	98.5	99.9
Armenian	100	100	100	100
Azerbaijani	89.5	77.2	92.3	99
Basque	83.57	70.9	87.3	92.5
Belarusian	96.87	91.5	99.2	99.9
Bengali	100	100	100	100	100	100	100	100
Bokmal	57.6	38.7	58.7	75.4	65.3	39.5	64.7	91.7	60.5	38.3	57.3	85.9
Bosnian	34.53	29	34.6	40
Bulgarian	86.77	70.2	91.2	98.9	71.47	50.1	67.9	96.4
Catalan	70.17	50.6	73.9	86
Chinese	100	100	100	100	72.67	27.5	90.5	100	60.86	16.6	67.5	98.49
Croatian	72.4	53.4	74.3	89.5	73.03	50.4	71.2	97.5
Czech	80.23	65.5	84.5	90.7	70.7	51.5	72.1	88.5
Danish	80.97	61.2	83.9	97.8	71	50.6	68.3	94.1	74.2	56.5	70.6	95.5
Dutch	77.27	54.9	80.7	96.2	58.2	27.3	49.9	97.4	72.53	50.1	69.9	97.6
English	80.77	54.6	88.6	99.1	61	23.4	60.4	99.2	76.97	54.2	76.7	100
Esperanto	83.5	67.1	85.1	98.3
Estonian	91.6	79.1	95.9	99.8	82.77	61.7	87.1	99.5
Finnish	95.8	89.6	97.8	100	93.03	83.8	95.4	99.9
French	89.17	74.2	94.5	98.8	75.8	49.2	79.2	99	86.37	70.6	89.6	98.9
Ganda	91.2	78.6	95	100
Georgian	99.97	100	100	99.9
German	89.2	73.8	94.1	99.7	73.13	49	70.7	99.7	81.83	62.9	82.7	99.9
Greek	99.97	100	100	99.9	99.97	100	100	99.9
Gujarati	99.93	99.9	100	99.9	99.97	100	100	99.9
Hebrew	99.87	100	100	99.6	99.93	100	100	99.8
Hindi	72.6	60.7	64.3	92.8	77.3	60	72.8	99.1
Hungarian	94.8	86.5	97.9	100	88.57	74.6	91.4	99.7
Icelandic	93.07	82.7	96.7	99.8
Indonesian	60.3	39.2	60.9	80.8	80.27	56.1	84.8	99.9
Irish	90.7	81.8	94.3	96
Italian	86.8	68.9	91.8	99.7	77.67	52	81.7	99.3	88.03	74.5	90.3	99.3
Japanese	100	100	100	100	25.65	1.27	4.8	70.87	95.08	85.35	99.9	100
Kazakh	91.83	79.9	96.5	99.1
Korean	99.97	100	100	99.9	99.87	100	100	99.6	99.87	100	100	99.6
Latin	86.97	71.8	92.5	96.6
Latvian	93.37	84.7	96.7	98.7	88.7	76.3	91.5	98.3
Lithuanian	94.57	86.4	97.6	99.7	87.33	71.7	91	99.3
Macedonian	83.5	65.7	86.3	98.5	86.43	70.6	88.7	100
Malay	31.43	26	38.3	30
Maori	91.97	84.4	92.5	99
Marathi	84.7	73.9	84.7	95.5	91.13	82.9	92.2	98.3
Mongolian	96.9	92.8	98.7	99.2
Nynorsk	65.5	40.8	65.4	90.3	59.8	31.7	53.4	94.3	53.03	31.8	46.5	80.8
Oromo	94.7	92.9	97.7	93.5
Persian	90.2	77.6	93.7	99.3	80.1	63.2	77.7	99.4
Polish	94.57	85.4	98.4	99.9	89.7	75.8	93.4	99.9
Portuguese	80.73	59	85.3	97.9	61.9	30.3	57	98.4	66.8	42.3	61	97.1
Punjabi	99.97	100	100	99.9	99.97	100	100	99.9
Romanian	86.5	68.7	91.7	99.1	77.37	56.3	78.2	97.6
Russian	89.7	76.5	94.8	97.8	84.13	69.7	87.6	95.1	100	100	100	100
Serbian	87.53	73.5	90	99.1
Shona	91	77.6	95.5	99.9
Sinhala	99.77	100	100	99.3
Slovak	84.3	63.9	90.1	98.9	74.63	50.2	76.1	97.6
Slovene	82.17	61.2	86.7	98.6	72.5	47.9	71.4	98.2
Somali	89.9	75.8	94	99.9	90.5	77.2	94.4	99.9
Sotho	85.47	66.6	90.4	99.4
Spanish	69.63	43.6	68.6	96.7	57.5	26.2	48.6	97.7	72.47	49.3	69.6	98.5
Swahili	80.47	59.5	83.7	98.2
Swedish	83.6	63.7	88.4	98.7	68	41.3	66.5	96.2	79.53	59.8	80.6	98.2
Tagalog	77.8	51.7	83.1	98.6	76.77	53	78.5	98.8
Tamil	100	100	100	100	100	100	100	100
Telugu	99.97	100	100	99.9	100	100	100	100
Thai	99.7	100	100	99.1	99.93	100	100	99.8
Tigrinya	97.7	98.3	99.9	94.9
Tsonga	84.43	66.1	89.1	98.1
Tswana	84.1	65.1	88.5	98.7
Turkish	93.67	83.7	97.6	99.7	82.23	62.9	84.3	99.5
Ukrainian	92.3	84.4	97.3	95.2	82.87	66.4	84.8	97.4
Urdu	90.07	80	94.5	95.7	82	66.4	83.2	96.4
Vietnamese	90.73	78.84	94.15	99.2	84.39	61.09	92.27	99.8
Welsh	91.07	78.2	95.8	99.2
Xhosa	82.27	63.8	85	98
Yoruba	74.5	50.2	76.8	96.5
Zulu	80.47	61.7	83.1	96.6

Lingua consistently performs better than the other libraries, even when configured with more languages.

How does it work?

Most language detectors typically rely on a probabilistic N-gram model, such as that described in the N-Gram-Based Text Categorization (Cavnar & Trenkle, 1994) paper. The model is trained using the character distribution of a specific training corpus. Most libraries limit themselves to using 3-grams (trigrams), which work well for identifying the language of longer text fragments composed of multiple sentences. However, when it comes to short phrases or single words, trigrams fall short. The shorter the text, the fewer N-grams are available, making the probability estimates less reliable. To address this, Lingua employs N-grams ranging in size from 1 to 5, significantly improving the accuracy of language prediction. In low accuracy mode, only N-grams of size 3 are employed whereas high accuracy mode can use the full range of N-gram sizes from 1 to 5.

In addition to a probabilistic N-gram model, Lingua also employs a rule engine. The engine first looks at the alphabet of the input text and looks for characters that are unique to specific languages. It does so by employing Unicode Script information and a lookup table of unique characters. For example, The character ß, or "Eszett" or "sharp S", is unique to the German language. If a language can be unambiguously identified by the rule engine, the more computationally expensive probabilistic model doesn't need to be used.

Benchmarks

A language detection library needs to be both accurate and fast. The source repository contains a benchmarking suite to compare latency and allocations using BenchmarkDotNet, and the latest benchmarks ⚡ are available in the repository.

Running the English single word, word pairs and sentence benchmarks on the following local setup

BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3958/23H2/2023Update/SunValley3)
13th Gen Intel Core i9-13900K, 1 CPU, 32 logical and 24 physical cores
.NET SDK 8.0.303
  [Host]   : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2
  ShortRun : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3

yields the following results

Benchmark	Method	Text	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
English Single Word	LinguaLowAccuracy	suspiciously	14.54 μs	0.473 μs	0.026 μs	1.00	0.00	2.3651	0.0153	-	43.6 KB	1.00
English Single Word	Lingua	suspiciously	23.08 μs	1.860 μs	0.102 μs	1.59	0.01	3.6316	0.0916	-	66.49 KB	1.53
English Single Word	LanguageDetection	suspiciously	95.41 μs	16.064 μs	0.881 μs	6.56	0.07	1.9531	1.5869	1.2207	249.73 KB	5.73
English Single Word	NTextCat	suspiciously	16.29 μs	0.279 μs	0.015 μs	1.12	0.00	1.6479	0.0610	-	30.46 KB	0.70
English Word Pairs	LinguaLowAccuracy	outlining issues	18.28 μs	2.251 μs	0.123 μs	1.00	0.00	2.5635	-	-	47.21 KB	1.00
English Word Pairs	Lingua	outlining issues	28.64 μs	6.390 μs	0.350 μs	1.57	0.02	3.8147	0.0916	-	70.16 KB	1.49
English Word Pairs	LanguageDetection	outlining issues	91.06 μs	24.040 μs	1.318 μs	4.98	0.08	2.0752	1.5869	1.2207	250.11 KB	5.30
English Word Pairs	NTextCat	outlining issues	22.12 μs	2.612 μs	0.143 μs	1.21	0.00	1.7090	0.0610	-	31.91 KB	0.68
English Sentence	LinguaLowAccuracy	On n(...)own. [125]	77.73 μs	8.632 μs	0.473 μs	1.00	0.00	7.3242	-	-	136.46 KB	1.00
English Sentence	Lingua	On n(...)own. [125]	81.47 μs	7.988 μs	0.438 μs	1.05	0.00	7.3242	-	-	136.46 KB	1.00
English Sentence	LanguageDetection	On n(...)own. [125]	105.56 μs	31.676 μs	1.736 μs	1.36	0.03	2.1973	1.9531	0.9766	260.55 KB	1.91
English Sentence	NTextCat	On n(...)own. [125]	188.29 μs	25.444 μs	1.395 μs	2.42	0.01	6.8359	0.7324	-	126.2 KB	0.92

Each language detection library is configured with the same set of supported languages, and the input text in each benchmark has been verified to return the correct result for each implementation. This has been done to try to make the benchmark as fair as possible.

Lingua is generally the fastest in low accuracy mode, never using Generation 2 (Gen2) of the managed heap, mainly using Gen0. The NTextCat library allocates the least amount of memory, though looks to get much slower as the length of text increases. The LanguageDetection library is much slower and allocates more.

Lingua makes use of Span<T> and ref structs to limit allocations. With the proposal to allow an alternate lookup structure in .NET 9 to search dictionaries and hashsets, it'll be possible to reduce allocations further, making it possible to compare N-grams parsed from input text as spans against the language model dictionaries held in memory, rather than allocating substrings for each N-gram size.

Summary

SearchPioneer.Lingua offers a powerful, efficient, and accurate solution for language detection in .NET applications. We'll look to add new features in the future, such as add identifying multiple languages in mixed-language texts, something the Rust implementation supports. We encourage you to explore the repository, try out the library in your own projects, and contribute to its continued development.

A special thank you to Peter for releasing the original implementations as open source 🙏.

lingua language detection natural language processing NLP language classification C# .NET