Class ScriptUtils
This class is designed as a final utility class and cannot be instantiated.
It uses ICU4J ArabicShaping and
Transliterator for shaping and transliteration.
- Author:
- Chakib Daii
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringANSI escape sequence to clear the screen.static final StringRegular expression matching Arabic diacritic marks in Unicode.static final StringLanguage code for Arabic.private static final char[]Arabic alphabet letters used for transliteration to Latin letters.static final LocaleLocale instance representing Arabic language.static StringCustom transliteration rules defined as a multi-line string.static final ResourceBundleResourceBundle loaded with custom transliteration rules for Arabic.a key set of custom transliteration rules for Arabic.static final StringDefault country code used in Arabic locale.A set of reserved words used by the ICU (International Components for Unicode) transliteration and normalization APIs.private static final StringRegular expression used to split identifiers into components based on transitions between uppercase letters, digits, and lowercase letters.static final StringICU Transliterator ID for Latin-to-Arabic and Arabic-to-Latin transliteration.private static final char[]Latin uppercase letters used as transliteration equivalents for Arabic letters.static final StringEscape code to set Left-To-Right (LTR) text direction in compatible terminals.static ThreadLocal<com.ibm.icu.text.NumberFormat>A reusableNumberFormatinstance configured for the Arabic locale.static final StringEscape code to set Right-To-Left (RTL) text direction in compatible terminals.private static BooleanCached flag indicating whether Arabic text reshaping should be applied for the current environment.Cache of precompiledMatcherinstances for text processing, keyed by the input text string.static final PatternPattern to detect lines in multiline text, capturing line content and newline characters. -
Constructor Summary
ConstructorsModifierConstructorDescriptionprivatePrivate constructor to prevent instantiation. -
Method Summary
Modifier and TypeMethodDescriptionprivate static StringaddPadding(StringBuilder inputSb, int terminalWidth) Adds padding spaces to the givenStringBuilderinput to align the text to the specified terminal width.private static StringaddPadding(String input, int padding) Adds padding spaces to the left or right of the input to reach the specified padding length.static StringapplyBiFunction(String input, boolean print, ThrowingBiFunction<String, Boolean, String> function) Applies a bi-function to each line in the input text.static StringapplyFunction(String input, ThrowingFunction<String, String> function) Applies a function to each line in the input text.Splits the given string into consecutive substrings of the specified size.static booleancontainsArabicLetters(String text) Checks if the given text contains any Arabic characters.static StringConverts an input string from Arabic characters and digits to their Latin and Ascii equivalents.private static StringPads the input text to align within the terminal width, adjusting for overflow.private static StringPads the input text to fit the specified terminal width, splitting it into multiple lines if necessary.private static StringBuilderdoPadText(List<String> lines, String word, StringBuilder currentLine, int terminalWidth, boolean print) Splits a list of words into lines that fit the terminal width, adding padding if needed.private static StringPerforms Arabic shaping and bidirectional reordering on a single input line.getRawHexBytes(char[] charArray) Returns a list of pairs representing the Unicode code points (in hex) and characters from the given character array.getRawHexBytes(String text) Converts the givenStringinto a list of pairs, where each pair contains the Unicode hexadecimal representation of a character and the character itself.private static MatchergetTextMatcher(String input) Retrieves a cachedMatcherfor the given input string using theTEXT_MULTILINE_PATTERNpattern.static booleanisArabicChar(int cp) Checks if the given Unicode code point belongs to the Arabic Unicode script.static booleanisArabicCharCp(int cp) Checks if the given Unicode code point is an Arabic character.static booleanisArabicIndicDigit(char ch) Checks whether a character is an Arabic-Indic digit (٠ to ٩).static booleanisArabicText(String text) Checks if the given text consists entirely of Arabic characters.static booleanisAsciiDigit(int ch) Checks whether a character is a Ascii digit (0-9).static booleanisLatinLetter(char ch) Checks whether a character is a Latin letter (A-Z or a-z).static booleanisMultiline(String input) Checks if the given input string contains multiple lines.static StringnumberToString(Number number) Converts aNumberinto a string using formatting rules, replacing the standard Ascii decimal separator with a comma (U+066C), and optionally converting Ascii digits (0–9) to Arabic-Indic digits (٠–٩).static StringPads the input text to align it within the terminal width.parseRules(String rules) Parses a set of transformation rules from a string into a map.static StringremoveDiacritics(String text) Removes Arabic diacritic marks from the given Arabic text.static StringApplies Arabic shaping and bidirectional reordering to the input text.static booleanDetermines whether Arabic text reshaping should be applied for the current runtime environment.splitIdentifier(String input) Splits an identifier string into constituent parts based on various naming conventions.static StringtransliterateScript(com.ibm.icu.text.Transliterator transliterator, boolean removeDiacritics, String word) Transliterates a single word using the given Transliterator.static String[]transliterateScript(String transliteratorID, boolean removeDiacritics, String customRules, String... text) Transliterates the given text(s) from Latin script to Arabic or vice versa, using the specified ICU Transliterator ID and optional custom rules.static String[]transliterateScript(String transliteratorID, String... text) Transliterates one or more strings using the specified transliterator ID.static String[]transliterateScript(String transliteratorID, String customRules, String... text) Transliterates one or more strings using the specified transliterator ID and custom rules.static StringtransliterateScriptLetterByLetter(String transliteratorID, String textInput) Transliterates the input text letter by letter using the specified transliterator ID.static String[]transliterateToArabicScript(boolean removeDiacritics, String... text) Transliterates one or more strings to Arabic script.static String[]transliterateToArabicScript(boolean removeDiacritics, String customRules, String... text) Transliterates one or more strings to Arabic script using provided custom rules.static String[]transliterateToArabicScript(String... text) Transliterates one or more strings to Arabic script.static String[]transliterateToArabicScript(String customRules, String... text) Transliterates one or more strings to Arabic script using the provided custom rules.static String[]transliterateToArabicScriptDefault(boolean removeDiacritics, String... text) Transliterates one or more strings to Arabic script using default custom rules.static String[]transliterateToArabicScriptDefault(String... text) Transliterates one or more strings to Arabic script using default custom rules.static StringTransliterates the given text to Arabic script letter by letter.
-
Field Details
-
RTL_DIRECTION
Escape code to set Right-To-Left (RTL) text direction in compatible terminals.- See Also:
-
LTR_DIRECTION
Escape code to set Left-To-Right (LTR) text direction in compatible terminals.- See Also:
-
ARABIC_DIACRITICS_REGEX
Regular expression matching Arabic diacritic marks in Unicode.- See Also:
-
ANSI_ESCAPE
ANSI escape sequence to clear the screen.- See Also:
-
LATIN_ARABIC_TRANSLITERATION_ID
ICU Transliterator ID for Latin-to-Arabic and Arabic-to-Latin transliteration.- See Also:
-
ARABIC_LANGUAGE
Language code for Arabic.- See Also:
-
DEFAULT_ARABIC_LANGUAGE_COUNTRY
Default country code used in Arabic locale.- See Also:
-
ARABIC_LOCALE
Locale instance representing Arabic language. -
CUSTOM_RULES_BUNDLE
ResourceBundle loaded with custom transliteration rules for Arabic. -
TEXT_MULTILINE_PATTERN
Pattern to detect lines in multiline text, capturing line content and newline characters. -
CUSTOM_RULES_KEYS
a key set of custom transliteration rules for Arabic. -
IDENTIFIER_SPLIT_REGEX
Regular expression used to split identifiers into components based on transitions between uppercase letters, digits, and lowercase letters.For example:
- "JSONTo" → "JSON", "To"
- "userAccount" → "user", "Account"
- "IPv6" → "IPv", "6"
- "6Parser" → "6", "Parser"
- See Also:
-
ARABIC_LETTERS
private static final char[] ARABIC_LETTERSArabic alphabet letters used for transliteration to Latin letters.The characters are mapped positionally (index by index) to uppercase Latin letters. This list includes 26 Arabic letters starting from 'ا' to 'ه', and is intended to be used for character-by-character mapping to Ascii base encoding (e.g., base 11 to base 36 systems).
Examples of mapping:
- 'ا' → 'A'
- 'ب' → 'B'
- 'ت' → 'C' ...
- 'ه' → 'Z'
-
LATIN_LETTERS
private static final char[] LATIN_LETTERSLatin uppercase letters used as transliteration equivalents for Arabic letters.Each letter corresponds to an Arabic letter by position in the
ARABIC_LETTERSarray. This mapping supports systems like base-36 encodings or custom symbolic notations using Arabic letters.Examples of mapping:
- 'A' → 'ا'
- 'B' → 'ب'
- 'C' → 'ت' ...
- 'Z' → 'ه'
-
ICU_RESERVED_WORDS
A set of reserved words used by the ICU (International Components for Unicode) transliteration and normalization APIs. These words have special meaning in ICU transliteration rules and Unicode transformations.Examples of usage contexts include:
- Transliteration rule syntax (e.g., "::NFD;" or "::Latin-ASCII;")
- Normalization forms (e.g., "NFC", "NFD", "NFKC", "NFKD")
- Unicode script and block identifiers (e.g., "Latin", "Greek", "Han")
- Keywords in rule definitions (e.g., "use", "import", "function")
This set can be used to:
- Validate user-defined transliteration rules
- Highlight or flag reserved words in editors or tools
- Prevent conflicts in custom ICU rule definitions
- See Also:
-
TransliteratorNormalizer2- ICU Transliteration Guide
-
NUMBER_FORMAT
A reusableNumberFormatinstance configured for the Arabic locale.This formatter uses Arabic locale conventions for decimal and grouping separators, and may render numbers using Arabic-Indic digits (e.g., ٠١٢٣٤٥٦٧٨٩), depending on JVM settings and font support.
Note:
NumberFormatinstances are not thread-safe. If this formatter is used across multiple threads, synchronize access or create a new instance viaNumberFormat.getNumberInstance(ARABIC).- See Also:
-
Locale.forLanguageTag(String)NumberFormat.getNumberInstance(Locale)
-
CUSTOM_RULES
Custom transliteration rules defined as a multi-line string. Each rule maps Latin script sequences to their corresponding Arabic script sequences. For example, "com > كوم" transliterates "com" to Arabic "كوم". -
TEXT_MATCHER_CACHE
Cache of precompiledMatcherinstances for text processing, keyed by the input text string. Used to improve performance by avoiding repeated compilation of patterns. -
SHOULD_RESHAPE
Cached flag indicating whether Arabic text reshaping should be applied for the current environment.
-
-
Constructor Details
-
ScriptUtils
private ScriptUtils()Private constructor to prevent instantiation. Always throws aNaftahBugErrorwhen called.
-
-
Method Details
-
parseRules
Parses a set of transformation rules from a string into a map.The input string should contain one rule per line in the format:
source > target;Each line:
- Is stripped of leading/trailing whitespace
- Ignores empty lines
- Removes trailing semicolons
- Splits on the first occurrence of the
'>'character
Example input:
a > b; c > d;Will result in a map:
{ "a" -> "b", "c" -> "d" }- Parameters:
rules- A string containing one or more transformation rules separated by newlines- Returns:
- A map of source-to-target transformations
-
isMultiline
Checks if the given input string contains multiple lines.- Parameters:
input- the input string to check- Returns:
- true if the input contains one or more newline characters; false otherwise
-
getTextMatcher
Retrieves a cachedMatcherfor the given input string using theTEXT_MULTILINE_PATTERNpattern. If a matcher for the input already exists in the cache, it is reset and returned; otherwise, a new matcher is created, cached, reset, and returned.This caching mechanism improves performance by reusing matcher instances for repeated input strings.
- Parameters:
input- the input string to create or retrieve a matcher for- Returns:
- a reset
Matcherinstance ready for matching against the input
-
applyBiFunction
public static String applyBiFunction(String input, boolean print, ThrowingBiFunction<String, Boolean, String> function) Applies a bi-function to each line in the input text.If the input is multiline, applies the function to each line individually, preserving line separators. Otherwise, applies the function once to the whole input.
- Parameters:
input- the input text (possibly multiline)print- if true, the result is printed to the console; if false, the result is returnedfunction- a bi-function taking a line and the print flag, returning the processed line- Returns:
- the processed text if
printis false; otherwise, null
-
applyFunction
Applies a function to each line in the input text.If the input is multiline, applies the function to each line individually, preserving line separators. Otherwise, applies the function once to the whole input.
- Parameters:
input- the input text (possibly multiline)function- a function taking a line and returning the processed line- Returns:
- the processed text with all lines processed by the function
-
shape
Applies Arabic shaping and bidirectional reordering to the input text.- Parameters:
input- the input Arabic text- Returns:
- the shaped and reordered text suitable for visual rendering in terminals
-
doShape
Performs Arabic shaping and bidirectional reordering on a single input line.- Parameters:
input- the input Arabic text- Returns:
- the shaped and reordered text
- Throws:
com.ibm.icu.text.ArabicShapingException- if an error occurs during shaping
-
padText
Pads the input text to align it within the terminal width.If
printis true, prints the padded text; otherwise, returns it.- Parameters:
input- the input text to padprint- if true, print the padded text; else return it- Returns:
- the padded text if
printis false; otherwise null
-
doPadText
Pads the input text to align within the terminal width, adjusting for overflow.- Parameters:
input- the input text to padprint- if true, prints the padded lines; else returns them as a single string- Returns:
- the padded text if
printis false; otherwise null
-
doPadText
private static StringBuilder doPadText(List<String> lines, String word, StringBuilder currentLine, int terminalWidth, boolean print) Splits a list of words into lines that fit the terminal width, adding padding if needed. If printing is enabled, lines are printed directly to the console; otherwise, they are collected in a list.- Parameters:
lines- the list to store padded lines (ignored if printing)word- the current word to addcurrentLine- the StringBuilder holding the current lineterminalWidth- the width of the terminal for paddingprint- whether to print lines immediately or store in list- Returns:
- a new StringBuilder starting with the current word for the next line
-
doPadText
Pads the input text to fit the specified terminal width, splitting it into multiple lines if necessary. Lines are either printed directly or returned as a joined string depending on theprintflag.- Parameters:
input- the input text to padterminalWidth- the width of the terminalprint- if true, prints padded lines; otherwise returns them as a single string- Returns:
- the padded text as a string if
printis false; otherwise null
-
addPadding
Adds padding spaces to the givenStringBuilderinput to align the text to the specified terminal width. The padding is calculated as the difference between the terminal width and the current length of the input.If any exception occurs during padding calculation, the original input string is returned without modification.
- Parameters:
inputSb- theStringBuildercontaining the text to padterminalWidth- the total width of the terminal to align the text to- Returns:
- a
Stringwith added padding spaces to align the text, or the original text if padding cannot be applied
-
addPadding
Adds padding spaces to the left or right of the input to reach the specified padding length.Padding is appended on the right if the input contains Arabic characters; otherwise on the left.
- Parameters:
input- the input textpadding- the number of spaces to add- Returns:
- the padded string
-
chunk
Splits the given string into consecutive substrings of the specified size.Each chunk is created using
String.substring(int, int). The last chunk may be shorter thansizeif the input string's length is not evenly divisible by the chunk size.Examples:
chunk("abcdef", 2)→["ab", "cd", "ef"]chunk("abcde", 2)→["ab", "cd", "e"]
- Parameters:
input- the string to split; must not benullsize- the size of each chunk; must be greater than zero- Returns:
- a list containing the resulting substrings, in order
- Throws:
IllegalArgumentException- ifsizeis less than 1
-
removeDiacritics
Removes Arabic diacritic marks from the given Arabic text.- Parameters:
text- the Arabic text possibly containing diacritics- Returns:
- the Arabic text with diacritics removed
-
transliterateScript
public static String[] transliterateScript(String transliteratorID, boolean removeDiacritics, String customRules, String... text) Transliterates the given text(s) from Latin script to Arabic or vice versa, using the specified ICU Transliterator ID and optional custom rules.- Parameters:
transliteratorID- the ICU Transliterator ID to useremoveDiacritics- whether to remove diacritics after transliterationcustomRules- optional custom transliteration rules; may be nulltext- one or more strings to transliterate- Returns:
- an array of transliterated strings in the same order
-
transliterateScript
public static String transliterateScript(com.ibm.icu.text.Transliterator transliterator, boolean removeDiacritics, String word) Transliterates a single word using the given Transliterator.- Parameters:
transliterator- the ICU Transliterator instance to useremoveDiacritics- whether to remove diacritics after transliterationword- the input word to transliterate- Returns:
- the transliterated word
-
transliterateScriptLetterByLetter
Transliterates the input text letter by letter using the specified transliterator ID.- Parameters:
transliteratorID- the ICU Transliterator ID to usetextInput- the input text to transliterate- Returns:
- the transliterated text
-
transliterateScript
Transliterates one or more strings using the specified transliterator ID. Diacritics are not removed.- Parameters:
transliteratorID- the ICU Transliterator IDtext- the input strings- Returns:
- transliterated strings
-
transliterateScript
public static String[] transliterateScript(String transliteratorID, String customRules, String... text) Transliterates one or more strings using the specified transliterator ID and custom rules. Diacritics are not removed.- Parameters:
transliteratorID- the ICU Transliterator IDcustomRules- custom transliteration rulestext- the input strings- Returns:
- transliterated strings
-
transliterateToArabicScript
Transliterates one or more strings to Arabic script. Diacritics are removed by default.- Parameters:
removeDiacritics- whether to remove diacritics after transliterationtext- the input strings- Returns:
- transliterated Arabic script strings
-
transliterateToArabicScriptDefault
Transliterates one or more strings to Arabic script using default custom rules. Diacritics are removed by default.- Parameters:
removeDiacritics- whether to remove diacritics after transliterationtext- the input strings- Returns:
- transliterated Arabic script strings
-
transliterateToArabicScript
public static String[] transliterateToArabicScript(boolean removeDiacritics, String customRules, String... text) Transliterates one or more strings to Arabic script using provided custom rules. Diacritics are removed by default.- Parameters:
removeDiacritics- whether to remove diacritics after transliterationcustomRules- custom transliteration rulestext- the input strings- Returns:
- transliterated Arabic script strings
-
transliterateToArabicScript
Transliterates one or more strings to Arabic script. Diacritics are removed by default.- Parameters:
text- the input strings- Returns:
- transliterated Arabic script strings
-
transliterateToArabicScriptDefault
Transliterates one or more strings to Arabic script using default custom rules. Diacritics are removed by default.- Parameters:
text- the input strings=- Returns:
- transliterated Arabic script strings
-
transliterateToArabicScript
Transliterates one or more strings to Arabic script using the provided custom rules. Diacritics are removed by default.- Parameters:
customRules- custom transliteration rulestext- the input strings- Returns:
- transliterated Arabic script strings
-
transliterateToArabicScriptLetterByLetter
Transliterates the given text to Arabic script letter by letter.- Parameters:
text- the input text- Returns:
- the transliterated Arabic script text
-
shouldReshape
public static boolean shouldReshape()Determines whether Arabic text reshaping should be applied for the current runtime environment.Arabic reshaping is required on platforms or terminal environments that do not perform proper contextual shaping and bidirectional rendering (such as Windows consoles, WSL environments, or real xterm-based terminals on Unix systems).
The result is computed once and cached in
SHOULD_RESHAPEto avoid repeated OS and terminal capability checks.- Returns:
trueif Arabic reshaping should be applied (Windows, WSL, or Unix running inside a real xterm);falseotherwise
-
containsArabicLetters
Checks if the given text contains any Arabic characters.This method returns
trueif at least one character in the string is identified as an Arabic character according toisArabicChar(int).- Parameters:
text- the text to check; may benull(treated as empty)- Returns:
trueif the text contains one or more Arabic characters,falseotherwise
-
isArabicText
Checks if the given text consists entirely of Arabic characters.This method returns
trueonly if every character in the string is an Arabic character according toisArabicChar(int).- Parameters:
text- the text to check; may benull(treated as empty)- Returns:
trueif all characters in the text are Arabic,falseotherwise
-
isArabicCharCp
public static boolean isArabicCharCp(int cp) Checks if the given Unicode code point is an Arabic character.- Parameters:
cp- the Unicode code point- Returns:
- true if the code point is in Arabic Unicode blocks, false otherwise
-
isArabicChar
public static boolean isArabicChar(int cp) Checks if the given Unicode code point belongs to the Arabic Unicode script.- Parameters:
cp- the Unicode code point- Returns:
- true if the code point belongs to the Arabic Unicode script, false otherwise
-
getRawHexBytes
Returns a list of pairs representing the Unicode code points (in hex) and characters from the given character array.- Parameters:
charArray- the array of characters to analyze- Returns:
- list of pairs with Unicode code point hex strings and character strings
-
getRawHexBytes
Converts the givenStringinto a list of pairs, where each pair contains the Unicode hexadecimal representation of a character and the character itself.- Parameters:
text- the input string to process- Returns:
- a list of pairs of the form ("U+XXXX", "char"), representing each character's Unicode code point and character
-
splitIdentifier
Splits an identifier string into constituent parts based on various naming conventions. It handles underscores, dashes, whitespace, camelCase, PascalCase, acronyms, and digits.Example: - "userAccount" → ["user", "Account"] - "IPv6Address" → ["IPv", "6", "Address"] - "snake_case-name" → ["snake", "case", "name"]
- Parameters:
input- the identifier string to split- Returns:
- a list of strings representing the split components of the identifier
-
convertArabicToLatinLetterByLetter
Converts an input string from Arabic characters and digits to their Latin and Ascii equivalents.This method supports:
- Arabic letters mapped one-to-one to Latin uppercase letters (A-Z).
- Arabic-Indic digits (٠-٩) mapped to Ascii digits (0-9).
- Latin letters (A-Z, a-z) and Ascii digits (0-9) passed through unchanged.
Any unsupported character will cause a
NaftahBugErrorto be thrown.- Parameters:
text- the input string containing Arabic characters and/or digits- Returns:
- the Latin-equivalent string after transliteration
- Throws:
NaftahBugError- if the input contains unsupported characters
-
isLatinLetter
public static boolean isLatinLetter(char ch) Checks whether a character is a Latin letter (A-Z or a-z).- Parameters:
ch- the character to check- Returns:
trueif the character is a Latin letter;falseotherwise
-
isAsciiDigit
public static boolean isAsciiDigit(int ch) Checks whether a character is a Ascii digit (0-9).- Parameters:
ch- the character to check- Returns:
trueif the character is an Ascii digit;falseotherwise
-
isArabicIndicDigit
public static boolean isArabicIndicDigit(char ch) Checks whether a character is an Arabic-Indic digit (٠ to ٩).- Parameters:
ch- the character to check- Returns:
trueif the character is an Arabic digit;falseotherwise
-
numberToString
Converts aNumberinto a string using formatting rules, replacing the standard Ascii decimal separator with a comma (U+066C), and optionally converting Ascii digits (0–9) to Arabic-Indic digits (٠–٩).If the system property
naftah.number.arabicIndic.activeis set totrue, this method will convert each Ascii digit to its Arabic-Indic equivalent. Otherwise, digits remain unchanged.This method does not use locale-aware formatting; it operates directly on the string representation of the number returned by
Object.toString().- Parameters:
number- the number to convert; must not benull- Returns:
- a string representing the number with a decimal separator, and optionally Arabic-Indic digits
- Throws:
NullPointerException- ifnumberisnull- See Also:
-