Package groovy.util
Class CharsetToolkit
java.lang.Object
groovy.util.CharsetToolkit
public class CharsetToolkit
extends java.lang.Object
Utility class to guess the encoding of a given text file.
Unicode files encoded in UTF-16 (low or big endian) or UTF-8 files with a Byte Order Marker are correctly discovered. For UTF-8 files with no BOM, if the buffer is wide enough, the charset should also be discovered.
A byte buffer of 4KB is used to be able to guess the encoding.
Usage:
CharsetToolkit toolkit = new CharsetToolkit(file);
// guess the encoding
Charset guessedCharset = toolkit.getCharset();
// create a reader with the correct charset
BufferedReader reader = toolkit.getReader();
// read the file content
String line;
while ((line = br.readLine())!= null)
{
System.out.println(line);
}
-
Constructor Summary
Constructors Constructor Description CharsetToolkit(java.io.File file)Constructor of theCharsetToolkitutility class. -
Method Summary
Modifier and Type Method Description static java.nio.charset.Charset[]getAvailableCharsets()Retrieves all the availableCharsets on the platform, among which the defaultcharset.java.nio.charset.CharsetgetCharset()java.nio.charset.CharsetgetDefaultCharset()Retrieves the default Charsetstatic java.nio.charset.CharsetgetDefaultSystemCharset()Retrieve the default charset of the system.booleangetEnforce8Bit()Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.java.io.BufferedReadergetReader()Gets aBufferedReader(indeed aLineNumberReader) from theFilespecified in the constructor ofCharsetToolkitusing the charset discovered or the default charset if an 8-bitCharsetis encountered.booleanhasUTF16BEBom()Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).booleanhasUTF16LEBom()Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).booleanhasUTF8Bom()Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).voidsetDefaultCharset(java.nio.charset.Charset defaultCharset)Defines the defaultCharsetused in case the buffer represents an 8-bitCharset.voidsetEnforce8Bit(boolean enforce)If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII.Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Constructor Details
-
CharsetToolkit
public CharsetToolkit(java.io.File file) throws java.io.IOExceptionConstructor of theCharsetToolkitutility class.- Parameters:
file- of which we want to know the encoding.- Throws:
java.io.IOException
-
-
Method Details
-
setDefaultCharset
public void setDefaultCharset(java.nio.charset.Charset defaultCharset)Defines the defaultCharsetused in case the buffer represents an 8-bitCharset.- Parameters:
defaultCharset- the defaultCharsetto be returned if an 8-bitCharsetis encountered.
-
getCharset
public java.nio.charset.Charset getCharset() -
setEnforce8Bit
public void setEnforce8Bit(boolean enforce)If US-ASCII is recognized, enforce to return the default encoding, rather than US-ASCII. It might be a file without any special character in the range 128-255, but that may be or become a file encoded with the defaultcharsetrather than US-ASCII.- Parameters:
enforce- a boolean specifying the use or not of US-ASCII.
-
getEnforce8Bit
public boolean getEnforce8Bit()Gets the enforce8Bit flag, in case we do not want to ever get a US-ASCII encoding.- Returns:
- a boolean representing the flag of use of US-ASCII.
-
getDefaultCharset
public java.nio.charset.Charset getDefaultCharset()Retrieves the default Charset -
getDefaultSystemCharset
public static java.nio.charset.Charset getDefaultSystemCharset()Retrieve the default charset of the system.- Returns:
- the default
Charset.
-
hasUTF8Bom
public boolean hasUTF8Bom()Has a Byte Order Marker for UTF-8 (Used by Microsoft's Notepad and other editors).- Returns:
- true if the buffer has a BOM for UTF8.
-
hasUTF16LEBom
public boolean hasUTF16LEBom()Has a Byte Order Marker for UTF-16 Low Endian (ucs-2le, ucs-4le, and ucs-16le).- Returns:
- true if the buffer has a BOM for UTF-16 Low Endian.
-
hasUTF16BEBom
public boolean hasUTF16BEBom()Has a Byte Order Marker for UTF-16 Big Endian (utf-16 and ucs-2).- Returns:
- true if the buffer has a BOM for UTF-16 Big Endian.
-
getReader
public java.io.BufferedReader getReader() throws java.io.FileNotFoundExceptionGets aBufferedReader(indeed aLineNumberReader) from theFilespecified in the constructor ofCharsetToolkitusing the charset discovered or the default charset if an 8-bitCharsetis encountered.- Returns:
- a
BufferedReader - Throws:
java.io.FileNotFoundException- if the file is not found.
-
getAvailableCharsets
public static java.nio.charset.Charset[] getAvailableCharsets()Retrieves all the availableCharsets on the platform, among which the defaultcharset.- Returns:
- an array of
Charsets.
-