Class ScriptUtils

java.lang.Object
org.daiitech.naftah.utils.script.ScriptUtils

public final class ScriptUtils extends Object
Utility class providing various methods for handling Arabic text processing, including text shaping, bidi reordering, transliteration, diacritics removal, padding for terminal display, and detection of Arabic characters.

This class is designed as a final utility class and cannot be instantiated.

It uses ICU4J ArabicShaping and Transliterator for shaping and transliteration.

Author:
Chakib Daii
  • Field Details

    • RTL_DIRECTION

      public static final String RTL_DIRECTION
      Escape code to set Right-To-Left (RTL) text direction in compatible terminals.
      See Also:
    • LTR_DIRECTION

      public static final String LTR_DIRECTION
      Escape code to set Left-To-Right (LTR) text direction in compatible terminals.
      See Also:
    • ARABIC_DIACRITICS_REGEX

      public static final String ARABIC_DIACRITICS_REGEX
      Regular expression matching Arabic diacritic marks in Unicode.
      See Also:
    • ANSI_ESCAPE

      public static final String ANSI_ESCAPE
      ANSI escape sequence to clear the screen.
      See Also:
    • LATIN_ARABIC_TRANSLITERATION_ID

      public static final String LATIN_ARABIC_TRANSLITERATION_ID
      ICU Transliterator ID for Latin-to-Arabic and Arabic-to-Latin transliteration.
      See Also:
    • ARABIC_LANGUAGE

      public static final String ARABIC_LANGUAGE
      Language code for Arabic.
      See Also:
    • DEFAULT_ARABIC_LANGUAGE_COUNTRY

      public static final String DEFAULT_ARABIC_LANGUAGE_COUNTRY
      Default country code used in Arabic locale.
      See Also:
    • ARABIC_LOCALE

      public static final Locale ARABIC_LOCALE
      Locale instance representing Arabic language.
    • CUSTOM_RULES_BUNDLE

      public static final ResourceBundle CUSTOM_RULES_BUNDLE
      ResourceBundle loaded with custom transliteration rules for Arabic.
    • TEXT_MULTILINE_PATTERN

      public static final Pattern TEXT_MULTILINE_PATTERN
      Pattern to detect lines in multiline text, capturing line content and newline characters.
    • CUSTOM_RULES_KEYS

      public static final Set<String> CUSTOM_RULES_KEYS
      a key set of custom transliteration rules for Arabic.
    • IDENTIFIER_SPLIT_REGEX

      private static final String IDENTIFIER_SPLIT_REGEX
      Regular expression used to split identifiers into components based on transitions between uppercase letters, digits, and lowercase letters.

      For example:

      • "JSONTo" → "JSON", "To"
      • "userAccount" → "user", "Account"
      • "IPv6" → "IPv", "6"
      • "6Parser" → "6", "Parser"

      See Also:
    • ARABIC_LETTERS

      private static final char[] ARABIC_LETTERS
      Arabic alphabet letters used for transliteration to Latin letters.

      The characters are mapped positionally (index by index) to uppercase Latin letters. This list includes 26 Arabic letters starting from 'ا' to 'ه', and is intended to be used for character-by-character mapping to Ascii base encoding (e.g., base 11 to base 36 systems).

      Examples of mapping:

      • 'ا' → 'A'
      • 'ب' → 'B'
      • 'ت' → 'C'
      • ...
      • 'ه' → 'Z'
    • LATIN_LETTERS

      private static final char[] LATIN_LETTERS
      Latin uppercase letters used as transliteration equivalents for Arabic letters.

      Each letter corresponds to an Arabic letter by position in the ARABIC_LETTERS array. This mapping supports systems like base-36 encodings or custom symbolic notations using Arabic letters.

      Examples of mapping:

      • 'A' → 'ا'
      • 'B' → 'ب'
      • 'C' → 'ت'
      • ...
      • 'Z' → 'ه'
    • ICU_RESERVED_WORDS

      private static final Set<String> ICU_RESERVED_WORDS
      A set of reserved words used by the ICU (International Components for Unicode) transliteration and normalization APIs. These words have special meaning in ICU transliteration rules and Unicode transformations.

      Examples of usage contexts include:

      • Transliteration rule syntax (e.g., "::NFD;" or "::Latin-ASCII;")
      • Normalization forms (e.g., "NFC", "NFD", "NFKC", "NFKD")
      • Unicode script and block identifiers (e.g., "Latin", "Greek", "Han")
      • Keywords in rule definitions (e.g., "use", "import", "function")

      This set can be used to:

      • Validate user-defined transliteration rules
      • Highlight or flag reserved words in editors or tools
      • Prevent conflicts in custom ICU rule definitions
      See Also:
    • NUMBER_FORMAT

      public static volatile ThreadLocal<com.ibm.icu.text.NumberFormat> NUMBER_FORMAT
      A reusable NumberFormat instance configured for the Arabic locale.

      This formatter uses Arabic locale conventions for decimal and grouping separators, and may render numbers using Arabic-Indic digits (e.g., ٠١٢٣٤٥٦٧٨٩), depending on JVM settings and font support.

      Note: NumberFormat instances are not thread-safe. If this formatter is used across multiple threads, synchronize access or create a new instance via NumberFormat.getNumberInstance(ARABIC).

      See Also:
    • CUSTOM_RULES

      public static String CUSTOM_RULES
      Custom transliteration rules defined as a multi-line string. Each rule maps Latin script sequences to their corresponding Arabic script sequences. For example, "com > كوم" transliterates "com" to Arabic "كوم".
    • TEXT_MATCHER_CACHE

      private static Map<String,Matcher> TEXT_MATCHER_CACHE
      Cache of precompiled Matcher instances for text processing, keyed by the input text string. Used to improve performance by avoiding repeated compilation of patterns.
    • SHOULD_RESHAPE

      private static Boolean SHOULD_RESHAPE
      Cached flag indicating whether Arabic text reshaping should be applied for the current environment.
  • Constructor Details

    • ScriptUtils

      private ScriptUtils()
      Private constructor to prevent instantiation. Always throws a NaftahBugError when called.
  • Method Details

    • parseRules

      public static Map<String,String> parseRules(String rules)
      Parses a set of transformation rules from a string into a map.

      The input string should contain one rule per line in the format:

      
       source > target;
       

      Each line:

      • Is stripped of leading/trailing whitespace
      • Ignores empty lines
      • Removes trailing semicolons
      • Splits on the first occurrence of the '>' character

      Example input:

      
       a > b;
       c > d;
       

      Will result in a map:

      
       {
         "a" -> "b",
         "c" -> "d"
       }
       
      Parameters:
      rules - A string containing one or more transformation rules separated by newlines
      Returns:
      A map of source-to-target transformations
    • isMultiline

      public static boolean isMultiline(String input)
      Checks if the given input string contains multiple lines.
      Parameters:
      input - the input string to check
      Returns:
      true if the input contains one or more newline characters; false otherwise
    • getTextMatcher

      private static Matcher getTextMatcher(String input)
      Retrieves a cached Matcher for the given input string using the TEXT_MULTILINE_PATTERN pattern. If a matcher for the input already exists in the cache, it is reset and returned; otherwise, a new matcher is created, cached, reset, and returned.

      This caching mechanism improves performance by reusing matcher instances for repeated input strings.

      Parameters:
      input - the input string to create or retrieve a matcher for
      Returns:
      a reset Matcher instance ready for matching against the input
    • applyBiFunction

      public static String applyBiFunction(String input, boolean print, ThrowingBiFunction<String,Boolean,String> function)
      Applies a bi-function to each line in the input text.

      If the input is multiline, applies the function to each line individually, preserving line separators. Otherwise, applies the function once to the whole input.

      Parameters:
      input - the input text (possibly multiline)
      print - if true, the result is printed to the console; if false, the result is returned
      function - a bi-function taking a line and the print flag, returning the processed line
      Returns:
      the processed text if print is false; otherwise, null
    • applyFunction

      public static String applyFunction(String input, ThrowingFunction<String,String> function)
      Applies a function to each line in the input text.

      If the input is multiline, applies the function to each line individually, preserving line separators. Otherwise, applies the function once to the whole input.

      Parameters:
      input - the input text (possibly multiline)
      function - a function taking a line and returning the processed line
      Returns:
      the processed text with all lines processed by the function
    • shape

      public static String shape(String input)
      Applies Arabic shaping and bidirectional reordering to the input text.
      Parameters:
      input - the input Arabic text
      Returns:
      the shaped and reordered text suitable for visual rendering in terminals
    • doShape

      private static String doShape(String input) throws com.ibm.icu.text.ArabicShapingException
      Performs Arabic shaping and bidirectional reordering on a single input line.
      Parameters:
      input - the input Arabic text
      Returns:
      the shaped and reordered text
      Throws:
      com.ibm.icu.text.ArabicShapingException - if an error occurs during shaping
    • padText

      public static String padText(String input, boolean print)
      Pads the input text to align it within the terminal width.

      If print is true, prints the padded text; otherwise, returns it.

      Parameters:
      input - the input text to pad
      print - if true, print the padded text; else return it
      Returns:
      the padded text if print is false; otherwise null
    • doPadText

      private static String doPadText(String input, boolean print)
      Pads the input text to align within the terminal width, adjusting for overflow.
      Parameters:
      input - the input text to pad
      print - if true, prints the padded lines; else returns them as a single string
      Returns:
      the padded text if print is false; otherwise null
    • doPadText

      private static StringBuilder doPadText(List<String> lines, String word, StringBuilder currentLine, int terminalWidth, boolean print)
      Splits a list of words into lines that fit the terminal width, adding padding if needed. If printing is enabled, lines are printed directly to the console; otherwise, they are collected in a list.
      Parameters:
      lines - the list to store padded lines (ignored if printing)
      word - the current word to add
      currentLine - the StringBuilder holding the current line
      terminalWidth - the width of the terminal for padding
      print - whether to print lines immediately or store in list
      Returns:
      a new StringBuilder starting with the current word for the next line
    • doPadText

      private static String doPadText(String input, int terminalWidth, boolean print)
      Pads the input text to fit the specified terminal width, splitting it into multiple lines if necessary. Lines are either printed directly or returned as a joined string depending on the print flag.
      Parameters:
      input - the input text to pad
      terminalWidth - the width of the terminal
      print - if true, prints padded lines; otherwise returns them as a single string
      Returns:
      the padded text as a string if print is false; otherwise null
    • addPadding

      private static String addPadding(StringBuilder inputSb, int terminalWidth)
      Adds padding spaces to the given StringBuilder input to align the text to the specified terminal width. The padding is calculated as the difference between the terminal width and the current length of the input.

      If any exception occurs during padding calculation, the original input string is returned without modification.

      Parameters:
      inputSb - the StringBuilder containing the text to pad
      terminalWidth - the total width of the terminal to align the text to
      Returns:
      a String with added padding spaces to align the text, or the original text if padding cannot be applied
    • addPadding

      private static String addPadding(String input, int padding)
      Adds padding spaces to the left or right of the input to reach the specified padding length.

      Padding is appended on the right if the input contains Arabic characters; otherwise on the left.

      Parameters:
      input - the input text
      padding - the number of spaces to add
      Returns:
      the padded string
    • chunk

      private static List<String> chunk(String input, int size)
      Splits the given string into consecutive substrings of the specified size.

      Each chunk is created using String.substring(int, int). The last chunk may be shorter than size if the input string's length is not evenly divisible by the chunk size.

      Examples:

      • chunk("abcdef", 2)["ab", "cd", "ef"]
      • chunk("abcde", 2)["ab", "cd", "e"]
      Parameters:
      input - the string to split; must not be null
      size - the size of each chunk; must be greater than zero
      Returns:
      a list containing the resulting substrings, in order
      Throws:
      IllegalArgumentException - if size is less than 1
    • removeDiacritics

      public static String removeDiacritics(String text)
      Removes Arabic diacritic marks from the given Arabic text.
      Parameters:
      text - the Arabic text possibly containing diacritics
      Returns:
      the Arabic text with diacritics removed
    • transliterateScript

      public static String[] transliterateScript(String transliteratorID, boolean removeDiacritics, String customRules, String... text)
      Transliterates the given text(s) from Latin script to Arabic or vice versa, using the specified ICU Transliterator ID and optional custom rules.
      Parameters:
      transliteratorID - the ICU Transliterator ID to use
      removeDiacritics - whether to remove diacritics after transliteration
      customRules - optional custom transliteration rules; may be null
      text - one or more strings to transliterate
      Returns:
      an array of transliterated strings in the same order
    • transliterateScript

      public static String transliterateScript(com.ibm.icu.text.Transliterator transliterator, boolean removeDiacritics, String word)
      Transliterates a single word using the given Transliterator.
      Parameters:
      transliterator - the ICU Transliterator instance to use
      removeDiacritics - whether to remove diacritics after transliteration
      word - the input word to transliterate
      Returns:
      the transliterated word
    • transliterateScriptLetterByLetter

      public static String transliterateScriptLetterByLetter(String transliteratorID, String textInput)
      Transliterates the input text letter by letter using the specified transliterator ID.
      Parameters:
      transliteratorID - the ICU Transliterator ID to use
      textInput - the input text to transliterate
      Returns:
      the transliterated text
    • transliterateScript

      public static String[] transliterateScript(String transliteratorID, String... text)
      Transliterates one or more strings using the specified transliterator ID. Diacritics are not removed.
      Parameters:
      transliteratorID - the ICU Transliterator ID
      text - the input strings
      Returns:
      transliterated strings
    • transliterateScript

      public static String[] transliterateScript(String transliteratorID, String customRules, String... text)
      Transliterates one or more strings using the specified transliterator ID and custom rules. Diacritics are not removed.
      Parameters:
      transliteratorID - the ICU Transliterator ID
      customRules - custom transliteration rules
      text - the input strings
      Returns:
      transliterated strings
    • transliterateToArabicScript

      public static String[] transliterateToArabicScript(boolean removeDiacritics, String... text)
      Transliterates one or more strings to Arabic script. Diacritics are removed by default.
      Parameters:
      removeDiacritics - whether to remove diacritics after transliteration
      text - the input strings
      Returns:
      transliterated Arabic script strings
    • transliterateToArabicScriptDefault

      public static String[] transliterateToArabicScriptDefault(boolean removeDiacritics, String... text)
      Transliterates one or more strings to Arabic script using default custom rules. Diacritics are removed by default.
      Parameters:
      removeDiacritics - whether to remove diacritics after transliteration
      text - the input strings
      Returns:
      transliterated Arabic script strings
    • transliterateToArabicScript

      public static String[] transliterateToArabicScript(boolean removeDiacritics, String customRules, String... text)
      Transliterates one or more strings to Arabic script using provided custom rules. Diacritics are removed by default.
      Parameters:
      removeDiacritics - whether to remove diacritics after transliteration
      customRules - custom transliteration rules
      text - the input strings
      Returns:
      transliterated Arabic script strings
    • transliterateToArabicScript

      public static String[] transliterateToArabicScript(String... text)
      Transliterates one or more strings to Arabic script. Diacritics are removed by default.
      Parameters:
      text - the input strings
      Returns:
      transliterated Arabic script strings
    • transliterateToArabicScriptDefault

      public static String[] transliterateToArabicScriptDefault(String... text)
      Transliterates one or more strings to Arabic script using default custom rules. Diacritics are removed by default.
      Parameters:
      text - the input strings=
      Returns:
      transliterated Arabic script strings
    • transliterateToArabicScript

      public static String[] transliterateToArabicScript(String customRules, String... text)
      Transliterates one or more strings to Arabic script using the provided custom rules. Diacritics are removed by default.
      Parameters:
      customRules - custom transliteration rules
      text - the input strings
      Returns:
      transliterated Arabic script strings
    • transliterateToArabicScriptLetterByLetter

      public static String transliterateToArabicScriptLetterByLetter(String text)
      Transliterates the given text to Arabic script letter by letter.
      Parameters:
      text - the input text
      Returns:
      the transliterated Arabic script text
    • shouldReshape

      public static boolean shouldReshape()
      Determines whether Arabic text reshaping should be applied for the current runtime environment.

      Arabic reshaping is required on platforms or terminal environments that do not perform proper contextual shaping and bidirectional rendering (such as Windows consoles, WSL environments, or real xterm-based terminals on Unix systems).

      The result is computed once and cached in SHOULD_RESHAPE to avoid repeated OS and terminal capability checks.

      Returns:
      true if Arabic reshaping should be applied (Windows, WSL, or Unix running inside a real xterm); false otherwise
    • containsArabicLetters

      public static boolean containsArabicLetters(String text)
      Checks if the given text contains any Arabic characters.

      This method returns true if at least one character in the string is identified as an Arabic character according to isArabicChar(int).

      Parameters:
      text - the text to check; may be null (treated as empty)
      Returns:
      true if the text contains one or more Arabic characters, false otherwise
    • isArabicText

      public static boolean isArabicText(String text)
      Checks if the given text consists entirely of Arabic characters.

      This method returns true only if every character in the string is an Arabic character according to isArabicChar(int).

      Parameters:
      text - the text to check; may be null (treated as empty)
      Returns:
      true if all characters in the text are Arabic, false otherwise
    • isArabicCharCp

      public static boolean isArabicCharCp(int cp)
      Checks if the given Unicode code point is an Arabic character.
      Parameters:
      cp - the Unicode code point
      Returns:
      true if the code point is in Arabic Unicode blocks, false otherwise
    • isArabicChar

      public static boolean isArabicChar(int cp)
      Checks if the given Unicode code point belongs to the Arabic Unicode script.
      Parameters:
      cp - the Unicode code point
      Returns:
      true if the code point belongs to the Arabic Unicode script, false otherwise
    • getRawHexBytes

      public static List<com.ibm.icu.impl.Pair<String,String>> getRawHexBytes(char[] charArray)
      Returns a list of pairs representing the Unicode code points (in hex) and characters from the given character array.
      Parameters:
      charArray - the array of characters to analyze
      Returns:
      list of pairs with Unicode code point hex strings and character strings
    • getRawHexBytes

      public static List<com.ibm.icu.impl.Pair<String,String>> getRawHexBytes(String text)
      Converts the given String into a list of pairs, where each pair contains the Unicode hexadecimal representation of a character and the character itself.
      Parameters:
      text - the input string to process
      Returns:
      a list of pairs of the form ("U+XXXX", "char"), representing each character's Unicode code point and character
    • splitIdentifier

      public static List<String> splitIdentifier(String input)
      Splits an identifier string into constituent parts based on various naming conventions. It handles underscores, dashes, whitespace, camelCase, PascalCase, acronyms, and digits.

      Example: - "userAccount" → ["user", "Account"] - "IPv6Address" → ["IPv", "6", "Address"] - "snake_case-name" → ["snake", "case", "name"]

      Parameters:
      input - the identifier string to split
      Returns:
      a list of strings representing the split components of the identifier
    • convertArabicToLatinLetterByLetter

      public static String convertArabicToLatinLetterByLetter(String text)
      Converts an input string from Arabic characters and digits to their Latin and Ascii equivalents.

      This method supports:

      • Arabic letters mapped one-to-one to Latin uppercase letters (A-Z).
      • Arabic-Indic digits (٠-٩) mapped to Ascii digits (0-9).
      • Latin letters (A-Z, a-z) and Ascii digits (0-9) passed through unchanged.

      Any unsupported character will cause a NaftahBugError to be thrown.

      Parameters:
      text - the input string containing Arabic characters and/or digits
      Returns:
      the Latin-equivalent string after transliteration
      Throws:
      NaftahBugError - if the input contains unsupported characters
    • isLatinLetter

      public static boolean isLatinLetter(char ch)
      Checks whether a character is a Latin letter (A-Z or a-z).
      Parameters:
      ch - the character to check
      Returns:
      true if the character is a Latin letter; false otherwise
    • isAsciiDigit

      public static boolean isAsciiDigit(int ch)
      Checks whether a character is a Ascii digit (0-9).
      Parameters:
      ch - the character to check
      Returns:
      true if the character is an Ascii digit; false otherwise
    • isArabicIndicDigit

      public static boolean isArabicIndicDigit(char ch)
      Checks whether a character is an Arabic-Indic digit (٠ to ٩).
      Parameters:
      ch - the character to check
      Returns:
      true if the character is an Arabic digit; false otherwise
    • numberToString

      public static String numberToString(Number number)
      Converts a Number into a string using formatting rules, replacing the standard Ascii decimal separator with a comma (U+066C), and optionally converting Ascii digits (0–9) to Arabic-Indic digits (٠–٩).

      If the system property naftah.number.arabicIndic.active is set to true, this method will convert each Ascii digit to its Arabic-Indic equivalent. Otherwise, digits remain unchanged.

      This method does not use locale-aware formatting; it operates directly on the string representation of the number returned by Object.toString().

      Parameters:
      number - the number to convert; must not be null
      Returns:
      a string representing the number with a decimal separator, and optionally Arabic-Indic digits
      Throws:
      NullPointerException - if number is null
      See Also: