detectCharset function
- Uint8List bytes
Detects the character encoding of the given byte sequence.
Returns a lowercase IANA encoding label string. The following labels may be returned:
- BOM-detected:
'utf-8','utf-16be','utf-16le','utf-32be','utf-32le' - UTF-8 structural:
'utf-8' - Legacy 8-bit:
'windows-1252','iso-8859-1','iso-8859-2','iso-8859-15' - CJK multi-byte:
'shift-jis','euc-jp','euc-kr','gbk' - Fallback:
'windows-1252'
Detection proceeds through three ordered stages, falling through only when the current stage cannot make a determination:
Stage 1 — BOM inspection (deterministic)
A byte-order mark (BOM), when present, is authoritative. The four-byte
UTF-32 BOMs are checked before the two-byte UTF-16 BOMs to prevent
UTF-32 LE from being misidentified as UTF-16 LE (they share the same
first two bytes: FF FE).
Stage 2 — UTF-8 structural validation
A leading 8 KB sample of the input is decoded with
utf8.decode(allowMalformed: false). A successful decode means the
content is UTF-8 (or pure ASCII, which is a strict UTF-8 subset). Empty
input passes this stage and returns 'utf-8'.
Stage 3 — Candidate probe via the charset package
The sample is tested against each candidate Encoding using the static
Charset.canDecode method. CJK encodings are promoted when more than 15%
of sample bytes are ≥ 0x80. The first candidate to successfully decode
the sample wins. If no candidate matches, 'windows-1252' is returned as
the fallback (following the WHATWG Encoding specification default).
Note: Charset.canDecode considers a decode invalid if the resulting
string contains the Unicode replacement character U+FFFD ('?'). Input
that legitimately contains U+FFFD will therefore be rejected by every
non-UTF codec regardless of its actual encoding. This is a known
limitation of the structural validity approach.
Example (web-safe — works on all platforms):
import 'dart:typed_data';
import 'package:betto_charset_detector/betto_charset_detector.dart';
void main() {
// Bytes may come from an HTTP response, file picker, dart:io, etc.
final bytes = Uint8List.fromList([0xEF, 0xBB, 0xBF, 104, 101, 108, 108, 111]);
final encoding = detectCharset(bytes);
print('Detected encoding: $encoding'); // utf-8
}
On native platforms only, you can read bytes from a file:
import 'dart:io';
import 'package:betto_charset_detector/betto_charset_detector.dart';
void main() {
final bytes = File('data.csv').readAsBytesSync();
final encoding = detectCharset(bytes);
print('Detected encoding: $encoding');
}
Implementation
String detectCharset(Uint8List bytes) {
// Extract a leading sample to limit memory usage.
// The sample cap is applied first so that all subsequent stages operate
// on the same bounded input.
final sample = bytes.length > _sampleSize
? Uint8List.sublistView(bytes, 0, _sampleSize)
: bytes;
// Stage 1: BOM inspection.
final bomResult = _detectBom(sample);
if (bomResult != null) {
return bomResult;
}
// Stage 2: UTF-8 structural validation.
if (_isValidUtf8(sample)) {
return 'utf-8';
}
// Stage 3: Candidate probe.
return _probeEncoding(sample);
}