detectCharset function

String detectCharset(
  1. Uint8List bytes
)

Detects the character encoding of the given byte sequence.

Returns a lowercase IANA encoding label string. The following labels may be returned:

  • BOM-detected: 'utf-8', 'utf-16be', 'utf-16le', 'utf-32be', 'utf-32le'
  • UTF-8 structural: 'utf-8'
  • Legacy 8-bit: 'windows-1252', 'iso-8859-1', 'iso-8859-2', 'iso-8859-15'
  • CJK multi-byte: 'shift-jis', 'euc-jp', 'euc-kr', 'gbk'
  • Fallback: 'windows-1252'

Detection proceeds through three ordered stages, falling through only when the current stage cannot make a determination:

Stage 1 — BOM inspection (deterministic) A byte-order mark (BOM), when present, is authoritative. The four-byte UTF-32 BOMs are checked before the two-byte UTF-16 BOMs to prevent UTF-32 LE from being misidentified as UTF-16 LE (they share the same first two bytes: FF FE).

Stage 2 — UTF-8 structural validation A leading 8 KB sample of the input is decoded with utf8.decode(allowMalformed: false). A successful decode means the content is UTF-8 (or pure ASCII, which is a strict UTF-8 subset). Empty input passes this stage and returns 'utf-8'.

Stage 3 — Candidate probe via the charset package The sample is tested against each candidate Encoding using the static Charset.canDecode method. CJK encodings are promoted when more than 15% of sample bytes are ≥ 0x80. The first candidate to successfully decode the sample wins. If no candidate matches, 'windows-1252' is returned as the fallback (following the WHATWG Encoding specification default).

Note: Charset.canDecode considers a decode invalid if the resulting string contains the Unicode replacement character U+FFFD ('?'). Input that legitimately contains U+FFFD will therefore be rejected by every non-UTF codec regardless of its actual encoding. This is a known limitation of the structural validity approach.

Example (web-safe — works on all platforms):

import 'dart:typed_data';
import 'package:betto_charset_detector/betto_charset_detector.dart';

void main() {
  // Bytes may come from an HTTP response, file picker, dart:io, etc.
  final bytes = Uint8List.fromList([0xEF, 0xBB, 0xBF, 104, 101, 108, 108, 111]);
  final encoding = detectCharset(bytes);
  print('Detected encoding: $encoding'); // utf-8
}

On native platforms only, you can read bytes from a file:

import 'dart:io';
import 'package:betto_charset_detector/betto_charset_detector.dart';

void main() {
  final bytes = File('data.csv').readAsBytesSync();
  final encoding = detectCharset(bytes);
  print('Detected encoding: $encoding');
}

Implementation

String detectCharset(Uint8List bytes) {
  // Extract a leading sample to limit memory usage.
  // The sample cap is applied first so that all subsequent stages operate
  // on the same bounded input.
  final sample = bytes.length > _sampleSize
      ? Uint8List.sublistView(bytes, 0, _sampleSize)
      : bytes;

  // Stage 1: BOM inspection.
  final bomResult = _detectBom(sample);
  if (bomResult != null) {
    return bomResult;
  }

  // Stage 2: UTF-8 structural validation.
  if (_isValidUtf8(sample)) {
    return 'utf-8';
  }

  // Stage 3: Candidate probe.
  return _probeEncoding(sample);
}