Detect which human language a document uses from OCaml, from the Nu Html validator
languages unicode ocaml
OCaml 1.9%
HTML 0.6%
Dune 0.1%
Other 97.4%
14 2 0

Clone this repository

https://tangled.org/anil.recoil.org/ocaml-langdetect
git@git.recoil.org:anil.recoil.org/ocaml-langdetect

For self-hosted knots, clone URLs may differ based on your setup.

README.md

langdetect#

Language detection library for OCaml using n-gram frequency analysis.

This is an OCaml port of the Cybozu langdetect algorithm. It detects the natural language of text using n-gram frequency profiles. It was ported from https://github.com/validator/validator.

Features#

  • Detects 49 languages including English, Chinese, Japanese, Arabic, and many European languages
  • Fast probabilistic detection using n-gram frequency analysis
  • Configurable detection parameters (smoothing, convergence thresholds)
  • Reproducible results with optional random seed control
  • Pure OCaml implementation with minimal dependencies

Installation#

opam install langdetect

Usage#

(* Create a detector with all built-in profiles *)
let detector = Langdetect.create_default ()

(* Detect the best matching language *)
let () =
  match Langdetect.detect_best detector "Hello, world!" with
  | Some lang -> Printf.printf "Detected: %s\n" lang
  | None -> print_endline "Could not detect language"

(* Get all possible languages with probabilities *)
let () =
  let results = Langdetect.detect detector "Bonjour le monde" in
  List.iter (fun r ->
    Printf.printf "%s: %.2f\n" r.Langdetect.lang r.Langdetect.prob
  ) results

(* Use custom configuration *)
let config = { Langdetect.default_config with prob_threshold = 0.3 }
let detector = Langdetect.create_default ~config ()

Supported Languages#

Arabic, Bengali, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Macedonian, Malayalam, Dutch, Norwegian, Panjabi, Polish, Portuguese, Romanian, Russian, Sinhalese, Albanian, Spanish, Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Chinese (Simplified), Chinese (Traditional).

License#

MIT License - see LICENSE file for details.

Based on the Cybozu langdetect algorithm. Copyright (c) 2007-2016 Mozilla Foundation and 2025 Anil Madhavapeddy.