Take the first several verses of Genesis, in Hebrew, for example:
בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃
וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃
וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃
וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃
וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)
(That (פ)
for some reason isn't formatting properly in the blockquote, but it does in my text file.)
Now, I'd like to strip this text of all characters except for the standard 27-letter Hebrew alphabet אבגדהוזחטיכךלמםנןסעפףצץקרשת
, plus line breaks (which Script Editor automatically parses as \n
) and line and paragraph breaks (:
and (פ)
or (ס)
). You'll notice on several lines that there are hyphens – those should be replaced with a space. Some lines also contain |
– those should be replaced with a single . When done, it should look like:
בראשית ברא אלהים את השמים ואת הארץ׃
והארץ היתה תהו ובהו וחשך על פני תהום ורוח אלהים מרחפת על פני המים׃
ויאמר אלהים יהי אור ויהי אור׃
וירא אלהים את האור כי טוב ויבדל אלהים בין האור ובין החשך׃
ויקרא אלהים לאור יום ולחשך קרא לילה ויהי ערב ויהי בקר יום אחד׃ (פ)
I tried something simple at first – set the Hebrew alphabet plus ,
(
, and )
to a list, set x
to the length of the inputted string, then do a repeat for each character of the string: if it's on the list, then append it to the output; if it's a -
, append to the output; if it's a
\
and the next one is a n
, append \n
to the output; and if there are two spaces in a row, delete the second.
I logged the output and got some gibberish:
(*אאית א ים ת ם ת ץץץץץץץץ ה הה הה ללללי ם ים ת ללללי םםםםםאאר ים י ר ייייררררררא ים תתתתתר ייייב ל ים ין ר ין א ים אאא ם א ה ייייב ייייר ם דד (פ)*)
which seems to be every letter in the passage without a vowel, duplicated in the event that the following letter(s) do. (My mistake on the repeats – wrote the repeat loop poorly.) But that it skips over consonants that also have vowels is what left me wondering.
So I did a test:
set charNum to ASCII number "בְּ"
log charNum
set charNum to ASCII number "ב"
log charNum
-->result: (*63*) (*63*)
Although in the text editor, vowels and the like are separate characters overlaid on the previous ones, Script Editor doesn't see it that way, and sees בְּ and ב as the same letter. And yet, when comparing it to my list, it doesn't recognize the character and skips it.
So how can I strip the vowels and the like from the letters while not doing an if-loop for any possible letter and vowel combination?
Best Answer
ASCII number
is deprecated and doesn't work correctly with unicode text, useid of someCharacter
:So, I do not know how to do this in pure AppleScript.
But, you can use a perl command in a
do shell script
:Here is a brief explanation of the perl script
-CSD
option : the output and the error will be in UTF-8, the input is assumed to be in UTF-8s~\\p{NonspacingMark}~~og
: Remove non spacing markss~־|׀~ ~g
: Replace all־
and׀
by a spaces~ +~ ~g
: Replace multiple spaces in a row by one spaceIf your AppleScript read the text from a file, you can use perl to read the file:
The encoding of the file must be utf8.
Another solution is to use a Cocoa-AppleScript:
According to the commentary:
For a droplet, the script need an
on open handler
, like this:If you want to do an in-place editing (the perl script need the
-i
option +'.some name extension'
):This will create backup of each file (it add ".bak" after the name)
If you don't want a backup of each file (the perl script need the
-i
option +''
), like this: