How to strip a Hebrew text of vowels and punctuation in AppleScript

applescriptinternationalization

Take the first several verses of Genesis, in Hebrew, for example:

בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃

וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃

וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)

(That (פ) for some reason isn't formatting properly in the blockquote, but it does in my text file.)

Now, I'd like to strip this text of all characters except for the standard 27-letter Hebrew alphabet אבגדהוזחטיכךלמםנןסעפףצץקרשת, plus line breaks (which Script Editor automatically parses as \n) and line and paragraph breaks (: and (פ) or (ס)). You'll notice on several lines that there are hyphens – those should be replaced with a space. Some lines also contain | – those should be replaced with a single . When done, it should look like:

בראשית ברא אלהים את השמים ואת הארץ׃

והארץ היתה תהו ובהו וחשך על פני תהום ורוח אלהים מרחפת על פני המים׃

ויאמר אלהים יהי אור ויהי אור׃

וירא אלהים את האור כי טוב ויבדל אלהים בין האור ובין החשך׃

ויקרא אלהים לאור יום ולחשך קרא לילה ויהי ערב ויהי בקר יום אחד׃ (פ)

I tried something simple at first – set the Hebrew alphabet plus , (, and ) to a list, set x to the length of the inputted string, then do a repeat for each character of the string: if it's on the list, then append it to the output; if it's a -, append to the output; if it's a \ and the next one is a n, append \n to the output; and if there are two spaces in a row, delete the second.

I logged the output and got some gibberish:

(*אאית   א    ים  ת     ם   ת    ץץץץץץץץ    ה  הה   הה       ללללי    ם         ים     ת  ללללי    םםםםםאאר    ים   י   ר    ייייררררררא    ים  תתתתתר  ייייב     ל    ים  ין    ר   ין           א    ים    אאא   ם         א    ה    ייייב    ייייר   ם   דד (פ)*)

which seems to be every letter in the passage without a vowel, duplicated in the event that the following letter(s) do. (My mistake on the repeats – wrote the repeat loop poorly.) But that it skips over consonants that also have vowels is what left me wondering.

So I did a test:

set charNum to ASCII number "בְּ"
log charNum
set charNum to ASCII number "ב"
log charNum
-->result: (*63*) (*63*)

Although in the text editor, vowels and the like are separate characters overlaid on the previous ones, Script Editor doesn't see it that way, and sees בְּ and ב as the same letter. And yet, when comparing it to my list, it doesn't recognize the character and skips it.

So how can I strip the vowels and the like from the letters while not doing an if-loop for any possible letter and vowel combination?

Best Answer

ASCII number is deprecated and doesn't work correctly with unicode text, use id of someCharacter:

set charNum to id of "בְּ" -- this return id of 3 characters because "בְּ" is a composed character
log charNum
set charNum to id of "ב"
log charNum
-->result: 
(*1489, 1456, 1468*)
(*1489*)

So, I do not know how to do this in pure AppleScript.

But, you can use a perl command in a do shell script:

-- The text look not good in this code block, but it will be correct after the compilation of the script
set theString to "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃

וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃

וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)"


return do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' <<< " & quoted form of theString

Here is a brief explanation of the perl script

the -CSD option : the output and the error will be in UTF-8, the input is assumed to be in UTF-8
s~\\p{NonspacingMark}~~og : Remove non spacing marks
s~־|׀~ ~g : Replace all ־ and ׀ by a space
s~ +~ ~g : Replace multiple spaces in a row by one space

If your AppleScript read the text from a file, you can use perl to read the file:

do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' < " & quoted form of posix path of pathOfTheTextFile

The encoding of the file must be utf8.

Another solution is to use a Cocoa-AppleScript:

        use framework "Foundation"
        use scripting additions
        -- The text look not good in this code block, but it will be correct after the compilation of the script
        set theString to "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃

וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃

וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)"

        return stripString(theString)

        on stripString(t)
            set sourceString to current application's NSMutableString's stringWithString:t
            set myOpt to current application's NSRegularExpressionSearch
            set theSuccess to sourceString's applyTransform:(current application's NSStringTransformStripCombiningMarks) |reverse|:false range:(current application's NSMakeRange(0, (sourceString's |length|))) updatedRange:(missing value)
            if theSuccess then
                -- *** Replace all "־" and "׀" by a space, each character must be separated by a vertical bar character, e.g. "a|d|z"
                sourceString's replaceOccurrencesOfString:"־|׀" withString:" " options:myOpt range:(current application's NSMakeRange(0, (sourceString's |length|)))

                -- **** Replace multiple spaces in a row by one space
                sourceString's replaceOccurrencesOfString:" +" withString:" " options:myOpt range:(current application's NSMakeRange(0, (sourceString's |length|)))
                return sourceString as string -- convert the NSString object to an AppleScript's string
            end if
            return "" -- else, the transform was not applied
        end stripString

According to the commentary:

For a droplet, the script need an on open handler, like this:

on open theseFiles
    repeat with f in theseFiles
        set cleanText to do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f
        -- do something with that cleanText
    end repeat
end open

If you want to do an in-place editing (the perl script need the -i option + '.some name extension'):

This will create backup of each file (it add ".bak" after the name)

on open theseFiles
    repeat with f in theseFiles -- ***  create a backup and edit the file in-place ***
        do shell script "perl -i'.bak' -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f
    end repeat
end open

If you don't want a backup of each file (the perl script need the -i option + ''), like this:

-- ***  edit the file in-place without backup***
do shell script "perl -i'' -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f

Related Solutions

MacOS – What are these ???????????? characters, and how can I use them

They are mathematical alphanumeric characters, which look like glyph variants of basic Latin letters but have been encoded as separate characters, due to their special use in mathematical notations. Italic, bold face, and even use of a sans-serif form vs. serif form may carry an essential difference of meaning in mathematics. For example, a bold italic “a” may denote a vector, in a context when a normal-weight italic “a” denotes a scalar variable. Normally, such distinctions are made with styling or with markup, but the mathematical alphanumeric characters let you make the distinction in plain text, when desired or needed.

The shapes of these character vary by font, even though the basic idea allows less glyph variation than for normal letters. A mathematical italic letter can still take different shapes. So no, they do not look the same to everyone else. See e.g. some samples of mathematical italic a in different fonts.

Moreover, not everyone sees them at all. Few fonts contain them, and it is quite possible that someone is using a computer where no font has them.

So it’s a matter of characters, not fonts. And these characters “are intended for use only in mathematical or technical notation, and not in nontechnical text” (Unicode Standard, chapter 15, page 481).

They are not used much, but people might be using them without knowing what happens. If you use a sufficiently new version of Microsoft Word and enter a formula, using the formula mode, and type an “a”, Word will actually convert the character to mathematical italic a.

Normally, you cannot type these characters directly. You would need to use a character picker like the one mentioned, or some input method based on the Unicode number of a character. But it is possible to create a keyboard driver that lets you type these characters using normal keyboard keys and some special keys – or to programmatically convert normal characters to these characters, as Word does.

There is nothing Apple-specific about this. Input methods vary by system and software, of course, like for other characters.

Applescript not returning string

on run
    set cat to value of variable "cat" of front workflow (*I have already defined the variable elsewhere in the workflow*)
    set num to {}
    set myString to ""
    tell application "Contacts"
        repeat with i in cat
            set inGroup to group i
            set phoneProps to value of phones of inGroup's people
            repeat with i in phoneProps
                try
                    set myString to myString & "\n" & first item of i (*gets first number only*)
                on error
                    set myString to myString & "\n" & "blank field" (*covers for empty phone number, otherwise would halt on error*)
                end try
            end repeat
        end repeat
        return myString
    end tell  
end run

Best Answer

Related Solutions

MacOS – What are these ???????????? characters, and how can I use them

Applescript not returning string

Related Question