langdetect#
Language detection library for OCaml using n-gram frequency analysis.
This is an OCaml port of the Cybozu langdetect algorithm. It detects the natural language of text using n-gram frequency profiles. It was ported from https://github.com/validator/validator.
Features#
- Detects 49 languages including English, Chinese, Japanese, Arabic, and many European languages
- Fast probabilistic detection using n-gram frequency analysis
- Configurable detection parameters (smoothing, convergence thresholds)
- Reproducible results with optional random seed control
- Pure OCaml implementation with minimal dependencies
Installation#
opam install langdetect
Usage#
(* Create a detector with all built-in profiles *)
let detector = Langdetect.create_default ()
(* Detect the best matching language *)
let () =
match Langdetect.detect_best detector "Hello, world!" with
| Some lang -> Printf.printf "Detected: %s\n" lang
| None -> print_endline "Could not detect language"
(* Get all possible languages with probabilities *)
let () =
let results = Langdetect.detect detector "Bonjour le monde" in
List.iter (fun r ->
Printf.printf "%s: %.2f\n" r.Langdetect.lang r.Langdetect.prob
) results
(* Use custom configuration *)
let config = { Langdetect.default_config with prob_threshold = 0.3 }
let detector = Langdetect.create_default ~config ()
Supported Languages#
Arabic, Bengali, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Malayalam, Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Sinhalese, Albanian, Spanish, Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Chinese (Simplified), Chinese (Traditional).
License#
MIT License - see LICENSE file for details.
Based on the Cybozu langdetect algorithm. Copyright (c) 2007-2016 Mozilla Foundation and 2025 Anil Madhavapeddy.