MacOS – How to introduce a list object that is stored in a .txt file to an AppleScript

applescriptmacosperformancetext;

I have an AppleScript .scpt file, triggered by a key combination in FastScripts.app, that functions as a thesaurus. The script looks up the selected word in a pre-formatted list, and, if the word is found in this list, it displays the synonyms of this word to the user¹.

This list is contained in a plain text (.txt) file. The list is already formatted in the AppleScript list format. I would like my .scpt file to be able to accept this text as a true list².

It is important to note that the .txt file contains 2.5 million words³.

This is why I am not simply copying the contents of the .txt file into the .scpt file itself, despite the fact that the text file is 100% static and will never be altered. Inserting the text directly into my script would bring with it considerable lag and sluggishness as I edit and compile my .scpt file in Script Editor.app.

Script Editor.app froze every time that I tried to read the .txt file. The problem is that Script Editor reads into the memory a given text file in its entirety, instead of streaming the contents in a more efficient manner. So, I broke this text file up into 10 smaller text files⁴, each new .txt file containing about 250,000 words.

At 250,000 words, of course, the text files are still extremely large (by any standard).

Here is a (severely condensed) example of what the contents of each text file looks like:

{{"exaltation","accolade","adulation","advance","advancement"},{"exalted","winnowing","winsome"},{"exam","audition","blue book","examen","examination","final","examination","test","trial","tripos","viva","written","written examination"},{"examination","Pap test","Socratic method","airing","analysis","anatomic diagnosis","appraisal","work-up","written","written examination"},{"examine","air","analyze","appraise","archetype","asleep","assess","canvass","case"},{"examiner","analyst","analyzer","asker"},{"examining","analytic","examinational","exploratory"},{"example","admonishment","admonition","alarm","archetype"},{"exasperate","bedevil","vex","work up","worry"},{"exasperated","aggravated","amplified","angry","annoyed"},{"exasperating","annoying","bothering","bothersome"}}

As you can see, the contents of the text file is a nested list⁵ that is organized in the same way that AppleScript formats a list. Each text file contains no line breaks or paragraphs.

I am looking for a method to get this list into my AppleScript, with as little latency as possible⁶. This is why I pre-formatted it. So, speed is key.

_Footnotes:

^1. _{My thesaurus script is similar to the built-in thesaurus feature that exists in Microsoft Word. One notable difference is that my script works system-wide.}

^2. _{By true list, I mean that I can call, for example, item 12 of this list later on in my AppleScript.}

^3. _{My source for the thesaurus data is Grady Ward's "Moby" Thesaurus. I found this database from this answer: Looking for Thesaurus Data – Stack Overflow}

^4. _{I had to use Hex Fiend.app to cut from the text file and paste into a new text file. I could not edit the file in TextEdit.app, without TextEdit freezing on me.}

^5. _{The outer list contains each thesaurus entry. The inner lists contain all of the synonyms for that entry. The first item of each inner list is the entry title. Both the outer list and each inner list are ordered alphabetically (with the exception of the first word of each inner list, because, again, this word is the entry title).}

^6. _{I understand that even the fastest method will still have several seconds of latency, since the text file is so large.}

Best Answer

Obviously, I do not know the total scope of what you're doing or how you have other things coded, as you have not supplied all the details and code, however, I would take a different approach.

I downloaded the Moby Thesaurus from the linked page in your question and preformed the following actions on it.

Extracted the contents of the mthes.tar.Z file.
Opened the ./mthes/mobythes.aur file in TextWrangler and noticed two things to change.
- Change the line endings from Classic Mac (CR) to Unix (LF).
- Removed unwanted trailing commas from 6 lines.

Note that while I could make these changes in TextWrangler, nonetheless I prefer to use Terminal, and did so using the following command:

tr "\r" "\n" < mobythes.aur | sed -E 's/[,]{1,}$//' > mobythes.txt

Which took but literally a second to do (as I actually prefaced the above command with time, out of curiosity). With the mobythes.aur file having now been processed, saved to mobythes.txt and copied to my Documents folder, I will use this new plain CSV file as is, to query the search string for a match to the first field of each record and return the record, sans the first field, as a list to choose from in AppleScript. I found this method to be extremely fast, while searching for "zoom" the last record in the CSV file, it took but a second to return and create the list for that record on the fly.

In AppleScript Editor I use the following code to test against the plain CSV file as a single file containing the 30,260 lines with 2.5 million synonyms and related words.

set AppleScript's text item delimiters to ""
set theMobyThesaurus to POSIX path of (path to documents folder) & "mobythes.txt"

set theSearchString to the text returned of (display dialog "Find synonyms for:" default answer "" buttons {"Cancel", "Search"} default button 2 with title "Search Moby Thesaurus")

if theSearchString is not equal to "" then

    try
        set theSearchResults to (do shell script "grep -i -m 1 '^" & theSearchString & ",' " & theMobyThesaurus)
    on error
        display dialog "No match for \"" & theSearchString & "\" available." buttons {"OK"} default button 1
        return
    end try

    if theSearchResults is not equal to "" then
        set AppleScript's text item delimiters to ","
        set theSynonymsList to items 2 thru -1 of text items of theSearchResults as list
        set AppleScript's text item delimiters to ""

        choose from list theSynonymsList with prompt "Choose a synonym for: " & linefeed & theSearchString
        if the result is not false then
            set theChosenWord to (item 1 of the result)
        end if
    end if

end if

In this example, assuming a search match was made and nothing canceled, then the theChosenWord variable now contains what was chosen from the displayed list and can be processed further as needed/wanted.

Note that this is of course strictly example code for testing purposes and will need to be adapted to your use case scenario while incorporating appropriate error handling as needed.

I believe this is going to be the fastest way while leaving the Moby Thesaurus as a single CSV file, and is probably faster then whatever methods you tried thus far.

Related Solutions

How to Convert A .TXT File To A .CSV File Without Losing Any Of The Content Of That Document

I am simply trying to convert a .txt file to a .csv, without losing the contents of that document

You can just change the extension .txt into .csv.

'Comma separated value' files are just plain text files with the data columns separated by comma's (or ; or tab). 'txt' files are meant for text, and 'csv' files are ment for data storage, but it are both just text files.

Add on: The data looks organised like a dictionary type (or like a multidimensional array). A dictionary can not be converted to a 2D dataframe structure (like csv or excel) as it can have more dimensions than 2. What you want is not possible. But replacing csv with txt makes it readable though...

This is unreadable text: @1?舋H????A?!A?̃?-ƅm??I?؅?uҸ?

MacOS – How to use AppleScript to create a new Rich Text Format file from clipboard text

This answer has been updated to reflect rubik's sphere's misunderstanding between what Rich Text is (what was originally asked about) and what's actually being worked with from Google Chrome, being HTML. (See comments moved to chat.)

I'm leaving the original answer as is, and below this new content, as it does technically answer the original question as asked. It also contains relevant information to the overall process of dealing with Clipboard content in the context of the original and modified question.

The code below is example code to be tested and run within (Apple) Script Editor, as aside from the first sentence in the question, no explicit and specific usage within Automator has, yet, been given. The code may need to be edited to work within the unknown usage in Automator. That said, as written, if the entire script below was placed by itself into a Run AppleScript action, by itself, in Automator... it works as is. If using only segments of the code, some changes to the existing code could need to be made.

The code below contains sufficient commenting so as to understand overall what the script is doing.

--        # 
--        # Change the 'New RichText Filename.rtf' name to the wanted filename for the target file.
--        # Make sure you leave the double-quotes even if the filename does not contain spaces!
--        # Note that '(path to desktop as text)' can also be modified as needed, e.g. changed to,
--        # '(path to documents folder as text)' or the entire segment after 'set theRichTextFileName to'
--        # can be a fully qualified POSIX pathname, e.g.: set theRichTextFileName to "/path/to/filename.rtf"
--        #  

set theRichTextFileName to POSIX path of (path to desktop as text) & "New RichText Filename.rtf"


--    #        THE REMAINING CODE SHOULD NOT NEED TO BE MODIFIED.
--    # 
--    #     Note: This code, as is, works as written and intended when run from within (Apple) Script Editor.
--    #     Some AppleScript code when wrapped in Automator may not work the same as in (Apple) Script Editor.
--    #     In cases where is does not work from a Run AppleScript action in Automator, editing will be required.
--    # 
--    #     This AppleScript code preforms the following actions, sans errors caught during File I/O operations.
--    # 
--    # 1. See it the target file exists and prompts to be overwritten if it does. If yes is selected, it continues.
--    # 
--    # 2. If the Clipboard contains RTF content, writes it to the target file using plain AppleScript.
--    # 
--    # 3. If the Clipboard contains HTML content, writes it to the target file as RTF using a 'do shell script' command.
--    # 
--    # 4. If the Clipboard does not contain any RTF/HTML content, notify the user. 
--
--        #    Notes: The 'do shell script' makes use of the following:
--        #
--        #    'osascript' to get the HTML content from the Clipboard.
--        #    The content is a Hex stream within a data wrapper and
--        #    'awk' will be used to remove/replace the data wrapper.
--        #
--        #    'awk' to remove the data wrapper from 'osascript' output
--        #    replacing it with proper HTML opening/closing Tags to
--        #    ensure it actually gets processed by 'textutil' after 'xxd'.
--        #    Without the HTML opening/closing Tags 'textutil' does not 
--        #    properly, within limits, convert the HTML Clipboard content to RTF.
--        #
--        #    'xxd' to convert the Hex data from 'osascript/awk' to ASCII text.
--        #
--        #    'textutil' to convert the ASCII text HTML from 'xxd' to RTF
--        #    formatted data and write it to the target file.


tell application "Finder"

    if exists theRichTextFileName as POSIX file then
        tell current application
            display dialog "The file \"" & theRichTextFileName & "\" already exists!" & "\n\n" & "Do you want to overwrite the file?" buttons {"No", "Yes"} default button 1 with title "File Already Exists..." with icon caution

            if the button returned of the result is "No" then
                return
            else
                tell application "Finder"
                    delete the file (theRichTextFileName as POSIX file)
                end tell
            end if
        end tell
    end if

    tell current application

        --    # Find out what class types are available for the Clipboard content
        --    # and use this information to determine which action will be taken.

        set cbInfo to get (clipboard info) as string

        if cbInfo contains "RTF" then

            try
                set richTextfromClipboard to get the clipboard as «class RTF »
            on error eStr number eNum
                display dialog eStr & " number " & eNum buttons {"OK"} default button 1 with icon caution
                return
            end try
            try
                set fileHandle to open for access theRichTextFileName with write permission
                write richTextfromClipboard to fileHandle
                close access fileHandle
            on error eStr number eNum
                display dialog eStr & " number " & eNum buttons {"OK"} default button 1 with title "File I/O Error..." with icon caution
                try
                    close access fileHandle
                end try
            end try

        else if cbInfo contains "HTML" then

            try
                do shell script "osascript -e 'try' -e 'get the clipboard as «class HTML»' -e 'end try' | awk '{sub(/«data HTML/, \"3C68746D6C3E\") sub(/»/, \"3C2F68746D6C3E\")} {print}' | xxd -r -p | textutil -convert rtf -stdin -stdout > " & quoted form of theRichTextFileName
            on error eStr number eNum
                display dialog eStr & " number " & eNum buttons {"OK"} default button 1 with icon caution
            end try

        else

            display dialog "The Clipboard does not contain\nany usable RTF/HTML content!" buttons {"OK"} default button 1 with title "No RTF/HTML Content on Clipboard..." with icon caution

        end if

    end tell

end tell

Original answer to the original question asked:

To get Rich Text content from the Clipboard into a file using AppleScript, it's a bit more complex then a simple do shell script command.

The example AppleScript code below will, if the target file doesn't already exist and if Rich Text content exists on the Clipboard, write it to a file. It will have all the attributes as the RichText content on the Clipboard has, as it did when copied to the Clipboard.

Open Script Editor and copy and paste the code below into a new Untitled document and then run it from Script Editor, reviewing the output in Events/Replies. Run it a couple of times, with and without Rich Text content on the Clipboard and with and without the existence of the file, on the hard drive, defined by set theRichTextFileName ... at the start of the script.

You'll see the code makes sure the file doesn't exist, so as not to overwrite an exiting file of the target name and location and if the Clipboard doesn't contain Rich Text content, it displays a message for that too.

Now if using this in an Automator Service e.g., where Service receives selected rich text, then the code can be modified not to trap for an error if Rich Text content isn't on the Clipboard as the service will not appear on the Services menu if Rich Text is not selected in a document. Also if you want to overwrite the target file at its designated location the code around that can be removed too. I'll give those code examples as well.

Example code to paste into Script Editor for testing and review:

set theRichTextFileName to POSIX path of (path to documents folder as text) & "New RichText Filename.rtf"

tell application "Finder"
    if exists theRichTextFileName as POSIX file then
        tell current application
            display dialog "The file \"" & theRichTextFileName & "\" already exists!" buttons {"OK"} default button 1 with title "File Already Exists..." with icon caution
        end tell
    else
        tell current application
            try
                set richTextfromClipboard to get the clipboard as «class RTF »
            on error eStr number eNum
                display dialog eStr & " number " & eNum buttons {"OK"} default button 1 with title "No Rich Text Content on Clipboard..." with icon caution
                return
            end try
            try
                set fileHandle to open for access theRichTextFileName with write permission
                write richTextfromClipboard to fileHandle
                close access fileHandle
            on error eStr number eNum
                display dialog eStr & " number " & eNum buttons {"OK"} default button 1 with title "File I/O Error..." with icon caution
                try
                    close access fileHandle
                end try
                return
            end try
        end tell
    end if
end tell

Example code to use in an Automator Service e.g., where Service receives selected rich text:

set theRichTextFileName to POSIX path of (path to documents folder as text) & "New RichText Filename.rtf"

tell application "Finder"
    if exists theRichTextFileName as POSIX file then
        tell current application
            display dialog "The file \"" & theRichTextFileName & "\" already exists!" buttons {"OK"} default button 1 with title "File Already Exists..." with icon caution
        end tell
    else
        tell current application
            set richTextfromClipboard to get the clipboard as «class RTF »
            set fileHandle to open for access theRichTextFileName with write permission
            write richTextfromClipboard to fileHandle
            close access fileHandle
        end tell
    end if
end tell

Example code to use in an Automator Service e.g., where Service receives selected rich text and overwrites existing target file:

set theRichTextFileName to POSIX path of (path to documents folder as text) & "New RichText Filename.rtf"

tell current application
    set richTextfromClipboard to get the clipboard as «class RTF »
    set fileHandle to open for access theRichTextFileName with write permission
    write richTextfromClipboard to fileHandle
    close access fileHandle
end tell

The image below is of an example Automator Service that creates the New RichText Filename.rtf file from selected Rich Text from the Create Rich Text file from Clipboard service on the Services Context menu (by right-click) or Application_name > Services > menu, when Rich Text is selected in a document.

Now these are just examples and additional logic can be coded to tailor it to meet ones needs. As an example, code could be added to automatically increment an existing file's name so as not to overwrite it, or prompt for a new file name and complete the operation vs. aborting with a message that the file already exists, etc.

Update for use with a do shell script command:

If you really want to do it using a do shell script command, then use the following code while replacing /path/to/new rich text file.rtf with a valid path filename to a location you have write permissions. Note not to remove the \" before and after /path/to/new rich text file.rtf in the actual command as this handles the path filename if containing spaces. If the path filename does not contain spaces, then the \" before and after /path/to/new rich text file.rtf does not need to be used.

do shell script "osascript -e 'try' -e 'get the clipboard as «class RTF »' -e 'end try' | awk '{print substr($0, 12, length($0)-13)}' | xxd -r -p > \"/path/to/new rich text file.rtf\""

Here's the command line showing as wrapped text, for easier viewing:

do shell script "osascript -e 'try' -e 'get the clipboard as «class RTF »' -e 'end try' | awk '{print substr($0, 12, length($0)-13)}' | xxd -r -p > \"/path/to/new rich text file.rtf\""

While one can copy and paste (recommend) the code, nonetheless here's how to type the Double Angle Quotation Marks, which are also available under Parentheses in (Special) Characters, e.g. optioncommandT in TextEdit.

Note: Keep in mind the spacing in «class RTF » is meant to be offset in this use case.
- Typing Double Angle Quotation Marks
  - « Left Double Angle Quotation Mark, press: option\
  - » Right Double Angle Quotation Mark, press: shiftoption\

Note that, as written, this do shell script command overwrites the output file if it already exists, without prompting! It will either be a zero length file if the Clipboard does not contain any Rich Text content, otherwise the file will be of necessary length to contain the Rich Text content from the Clipboard. Obviously, additional logic could be coded into the osascript command however if you need more complexity then this you're better off using the method first presented in this answer. Or using an external script being called by do shell script command that handles all the necessary logic and error handling based on the complexity of the overall conditions this will be applied under.

Why such complexity vs. do shell script "pbpaste > /path/to/clipboard-file.rtf"?

That's a good question and while pbpaste does have the -Prefer {txt | rtf | ps} option, nonetheless pbpaste -Prefer rtf may not output Rich Text even if it exists on the Clipboard. Or what it does output, if not ASCII Text, will not be a form of Rich Text that is understood, by e.g TextEdit, and or will not contain all the Rich Text attributes if any, that the Clipboard content contains.

This makes it necessary to get the Rich Text content on Clipboard in a different manner and why get the clipboard as «class RTF » is being used instead. When using a do shell script command with this, it requires additional processing to utilize the data returned, as it's in a data wrapper when returned and not immediately usable, thus requires further processing.

As an example, Hello World! in Rich Text on the Clipboard can look like this in ASCII Text:

{\rtf1\ansi\ansicpg1252\cocoartf1187\cocoasubrtf400
{\fonttbl\f0\fnil\fcharset0 ComicSansMS;}
{\colortbl;\red255\green255\blue255;}
\pard\tx720\tx1440\tx2160\tx2880\tx3600\tx4320\tx5040\tx5760\tx6480\tx7200\tx7920\tx8640\pardirnatural

\f0\b\fs36 \cf0 Hello World!}

The above format unfortunately is not easily usable, if at all, in AppleScript and I believe that's why being grabbed as Hex encoded data is required.

I'm basing this on the fact that even though the get clipboard info command for the Clipboard content in this example returns {«class RTF », 265} among the info returned, nonetheless while the ASCII Text form of this Rich Text content is 265 bytes long it is returned in Hex format at over twice the bytes with the data wrapper. The fact that it's returned in Hex by get the clipboard as «class RTF » in both the Script Editor or using osascript support this supposition.

Here's the same Hello World! in Rich Text on the Clipboard in Hex:

7B5C727466315C616E73695C616E7369637067313235325C636F636F61727466313138375C636F636F617375627274663430300A7B5C666F6E7474626C5C66305C666E696C5C66636861727365743020436F6D696353616E734D533B7D0A7B5C636F6C6F7274626C3B5C7265643235355C677265656E3235355C626C75653235353B7D0A5C706172645C74783732305C7478313434305C7478323136305C7478323838305C7478333630305C7478343332305C7478353034305C7478353736305C7478363438305C7478373230305C7478373932305C7478383634305C7061726469726E61747572616C0A0A5C66305C625C66733336205C6366302048656C6C6F20576F726C64217D

However, what's returned by get the clipboard as «class RTF » for the Hello World! example above is:

«data RTF 7B5C727466315C616E73695C616E7369637067313235325C636F636F61727466313138375C636F636F617375627274663430300A7B5C666F6E7474626C5C66305C666E696C5C66636861727365743020436F6D696353616E734D533B7D0A7B5C636F6C6F7274626C3B5C7265643235355C677265656E3235355C626C75653235353B7D0A5C706172645C74783732305C7478313434305C7478323136305C7478323838305C7478333630305C7478343332305C7478353034305C7478353736305C7478363438305C7478373230305C7478373932305C7478383634305C7061726469726E61747572616C0A0A5C66305C625C66733336205C6366302048656C6C6F20576F726C64217D»

The Hex encoded string is in a «data RTF » wrapper which needs to be removed before converting the Hex encoded content to ASCII Text to be written to a disk file by the use of I/O Redirection, e.g.>, in the do shell script command example above.

So, the output of osascript -e 'try' -e ' get the clipboard as «class RTF »' -e 'end try' gets piped (|) to awk where it creates a substring, printing only the Hex encoded content itself, not the data wrapper portion, as that would not be processed properly by xxd in the next step of the process.

It then has to be piped (|) to xxd for conversion to ASCII Text to be written to a disk file using I/O Redirection, e.g.>, to the target path filename.

The image below is of Clipboard Viewer toggling between ASCII Text and Hex encoding views, showing Hello World! copied from a Rich Text Document, the one used in this example.

I hope this provides a better understanding of how AppleScript works with Rich Text content on the Clipboard, as either way conversion from a Hex encoded string to ASCII Text has to take place and this is being done transparently in the original example code while requiring additional processing outside of AppleScript code being processed by osascript when using the do shell script command in this context.

Best Answer

Related Solutions

How to Convert A .TXT File To A .CSV File Without Losing Any Of The Content Of That Document

MacOS – How to use AppleScript to create a new Rich Text Format file from clipboard text

Related Question