Compact Compact Language Detector

In a modern world full of AI, detecting the language of a text sounds like a simple task. But in applied AI, it still matters a lot. And because of that, we naturally want a simple, compact library that can detect language reliably in any kind of application. By any kind, I mean literally any: a server app, a desktop app, a mobile app, or even a web app.

Why CLD2 Worked For Us

When we ran into this problem a long time ago at Spark, we decided we needed a cross-platform language detection library that could run on the client side on every platform. We chose CLD2 (Compact Language Detection 2), developed by Google. It is a fairly old Apache 2.0 C++ library based on n-grams, and it does exactly what its name promises. It is a small C++ library with zero dependencies, with all data embedded directly in the source code - just compile and run. We liked it, wrapped it in Swift, and started using it in our apps.

Why Modern Alternatives Stopped Feeling Compact

But over time, I kept wondering why we were still using such an old library, especially because I knew there were more modern options that performed better. Some Spark users had complained that detection was sometimes inaccurate, especially for short messages. CLD3 from Google and FastText from Facebook both do a better job in terms of accuracy. In most benchmarks and quality comparisons, they outperform CLD2. But here is the real problem: they are no longer compact. At least not in the way I understand that word.

CLD3 also became part of the Chromium open-source project. Its model weights are actually embedded in the source code, much like CLD2. But the codebase depends on Protocol Buffers for its internal framework - configuration parsing, feature extraction descriptors, and data structures all use protobuf types. You might ask: “Why would a language detector need Protocol Buffers?” The answer is that CLD3 inherited Chromium’s infrastructure. Everything there already uses C++ and Protocol Buffers, so it made sense internally. But from the point of view of someone consuming the library outside Chromium, this heavy dependency makes it much less practical in the kind of environment where we used CLD2.

FastText from Facebook takes a different approach. It ships the model as a separate binary file (lid.176.ftz), and you need the FastText library to load and run it. Great for research, but not exactly “drop into your project and compile.”

If CLD stands for Compact Language Detector, and its successors are no longer compact, then the fix seems obvious: make them compact again. Compact Compact.

What I Built

Let me introduce two libraries:

Both libraries are written in pure C. They have zero dependencies and no external files, with all weights embedded directly in the source. For the build system, I provide CMake and my beloved SPM, but you can integrate them however you like.

In other words, “compact” here means four things:

pure C
zero dependencies
no external model files
weights embedded directly in the source

Quick Comparison

Here is a side-by-side summary of the tradeoffs:

	CLD2	Compact FastText	CCLD3
Languages	80+	176	109
Model data	~8.5 MB	~2.6 MB	~1.4 MB
Approach	N-gram lookup tables	PQ embeddings + HS tree	Neural network (quantized)
Validated against	—	Official Python ✅	Official C++ ✅
Agreement	—	100% / 3000 samples	100% / 3000 samples

An important thing to note: we did not shrink or re-quantize the models ourselves. The weights are used exactly as the original authors published them. CLD3’s embeddings were already uint8-quantized by Google in the original source code. FastText’s ftz was already product-quantized by Facebook (that is how they compressed the model from 125 MB down to 917 KB). We just wrote the inference engine in plain C and converted the original weights into C arrays. That is why we can validate against the reference implementations and get 100% agreement - the math is the same, the weights are the same, only the packaging changed.

Bottom line: both projects preserve the original weights and behavior while packaging them as dependency-free C libraries.

Live Demo

To show how compact these libraries have become, all three compile into a single 2.8 MB WebAssembly binary. Here is a live comparison:

CLD2

---

FastText

---

CLD3

---

Loading WASM module...

There is a known issue with CLD3 on short text. It is generally recommended for inputs longer than 150 characters, or to use it together with CLD2 as a companion.

I deliberately don’t want to compare accuracy here - there are plenty of articles and benchmarks about that already. We’re focused on evaluating these libraries against our own data and our own pain points at Spark, and we’re still deciding which one we’ll ultimately switch to. But the hardest part is done: both CLD3 and FastText have been compacted back to what a language detection library should be - a single dependency you drop into your project and forget about.

P.S. There is a separate page just for language detection where you can try all three libraries live in WebAssembly.

Blog • Projects • OSS • About

🇺🇦 #StandWithUkraine