How to remove unwanted text from a string

applescript

I have an applescript that returns the title from a website, the only issue is, it also contains lots of unwanted HTML (I think?). Most of the time, I can overcome this by removing the common characters using the following code.

on CharacterRemover(inputString, ReplaceChar)

set TID to AppleScript's text item delimiters
set AppleScript's text item delimiters to ReplaceChar
set pieces to text items of inputString -- break string apart at commas
set AppleScript's text item delimiters to "" -- or whatever replaces the comma
set inputString to pieces as text -- put string back together using whatever
set AppleScript's text item delimiters to TID

return inputString

end CharacterRemover

set FirstTitle to "<!-- react-text: 45 -->“<!-- /react-text --><!-- 
react-text: 46 -->Megan Fox<!-- /react-text --><!-- react-text: 47 -- 
>”<!-- /react-text -->" --the format of the returned title
set FirstTitle to CharacterRemover(FirstTitle, "-")
set FirstTitle to CharacterRemover(FirstTitle, ">")
set FirstTitle to CharacterRemover(FirstTitle, "<")
set FirstTitle to CharacterRemover(FirstTitle, "!")
set FirstTitle to CharacterRemover(FirstTitle, "/")
set FirstTitle to CharacterRemover(FirstTitle, "reacttext")
set FirstTitle to CharacterRemover(FirstTitle, ":")
set FirstTitle to CharacterRemover(FirstTitle, "”")
set FirstTitle to CharacterRemover(FirstTitle, "“")

set z to 0

repeat 10 times
set FirstTitle to CharacterRemover(FirstTitle, z)
set z to z + 1
end repeat

set FirstTitle to CharacterRemover(FirstTitle, " ")

display dialog FirstTitle

However, since this code removes the numbers, when I get titles such as

<!-- react-text: 477 -->“<!-- /react-text --><!-- react-text: 478 -->iPhone 8<!-- /react-text --><!-- react-text: 479 -->”<!-- /react-text -->

it returns as "iPhone" instead of "iPhone 8"

edit: on the website "higherorlower.com" I am using javascript "document.getElementsByClassName" to return the title of the given search amount

any ideas to overcome this?

Best Answer

I would advise you look at (and, if you wish, feedback about) the method you're using to retrieve the information from the website, as the best and most reliable option would be to use a different method such that you don't have to deal with the ReactJS comments at all.

If you'd included that part of your AppleScript along with the rest, it might have been a chance to solve your problem at its source.

Nonetheless, here's one method of stripping the tags from your text strings, though by no means the only method, nor necessarily the most graceful or efficient. But it's reasonably clean and, presuming the tags are all simple ReactJS comment tags, it will do a reliable job.

    set string1 to "<!-- react-text: 45 -->“<!-- /react-text --><!-- \nreact-text: 46 -->Megan Fox<!-- /react-text --><!-- react-text: 47 -- \n>”<!-- /react-text -->"
    
    set string2 to "<!-- react-text: 477 -->“<!-- /react-text --><!-- react-text: 478 -->iPhone 8<!-- /react-text --><!-- react-text: 479 -->”<!-- /react-text -->"
    
    stripTags from string1 --> "“Megan Fox”"
    stripTags from string2 --> "“iPhone 8”"
    --------------------------------------------------------------------------------
    to stripTags from s as text
        local s
        
        # Eliminate linebreaks and join to form one line of text
        set the text item delimiters to {null, linefeed, return}
        set s to the text items of s as text
        
        # Use bash to isolate all the various tags within the string
        # Note: not suitable for tags with irregular content, such as
        # any that unexpectedly contain '<' or '>' as part of their
        # text content.  However, that shouldn''t be an issue here.
        do shell script "egrep -io -e '<[^>]+>' <<<" & the quoted form of s
        
        # Use the tags as a basis for elimination using AS's TIDs
        set the text item delimiters to {null} & paragraphs of the result
        set s to the text items of s as text
        
        return s
    end stripTags

string1 is a copy of your variable FirstTitle, including the line breaks that it contained (I'm not sure whether these were in their intentionally or an artefact of when you copied your script over into the browser); their presence or absence doesn't affect the efficacy of my script, but merely necessitated the two lines at the start of the stripTags handler that gets rid of them.

string2 is the text you supplied at the bottom of your question.

I've shown the output of each of these following processing. I retained the double so-called "smart"-quotes that are part of the string and lie outwith the tags; I did see that you had opted to eliminate them, but their presence here—merely for demonstration purposes—are a nice visual reassurance that the script targets only the tags, and preserves the text in between. I hope you don't mind if I leave those smart-quotes for you to deal with as you wish.

Let me know if you have any queries.

ADDED 2018-05-12:

@cjeccjec Thank you for updating the website information with the correct URL. Tip for next time: include the code you're using to get the title. It'll be easier for people to help you and it will attract more help as well.

Luckily, this problem seems quite straightforward. Using getElementsByClassName() is a good idea, and you even managed to identify the classname of interest, term-keyword__keyword. Well done.

The elements assigned to that classname are <p> elements. They do have a title property, but it's empty, so I suspect that's not what you're using nor what you're after at all.

They also have a property called textContent, which, as it suggests, returns the text contained within the element, i.e. the labels of the items being compared in this game. I believe that's what you're after, and it's completely free from ReactJS tags.

This code returns an array of the textContent properties from the three loaded p.term-keyword__keyword tags on the site at any one time: the two currently visible and in play being compared; and one off-screen to the right waiting to scroll into view for the next comparison.

    Array.from(document
              .getElementsByClassName('term-keyword__keyword'),
               e=>e.textContent.slice(1,-1)
              );

I also took the liberty of slicing off the quotes from the beginning and end of the texts.

Incorporate this into AppleScript like so:

    tell application "Safari" to set labels ¬
        to do JavaScript "Array.from(document" & ¬
        ".getElementsByClassName('term-keyword__keyword')," & ¬
        "e=>e.textContent.slice(1,-1));" in the front document
    
    --> {"Microsoft Word", "Moobs", "Malaysia"}
    
    item 2 of labels --> "Moobs"

Those were the results that I got returned whilst playing the game. I was trying to guess whether "Microsoft Word" or "Moobs" had more internet searches, which I got correct; then "Malaysia" scrolled into view as I already knew it would.

Using this method, you don't need to strip any ReactJS tags away, nor any quote marks.