Firefox – Copying text from YouTube to Clipboard introduces dashes

character encodingclipboardfirefoxspecial charactersyoutube

Here's an example of a link I found on YouTube in the comments section of a video.

gnu.org/distros/free-distros.h­tml

This is the way it shows up in the comment.

If I highlight this link and copy to clipboard (ctrl+c), then go to a new browser tab and paste it (ctrl+v) in the address bar, then this is how it shows up.

gnu.org/distros/free-distros.h­tml

It looks the same, right? But if I hit Enter I get an error.

404 – Page Not Found

The page you were looking for could not be found on the GNU web
server.

If you followed a link that turned out to be broken, and the page with
the broken link mentions an explicit address to which to report bugs,
please use that address.

The URL also changes to the following.

http://www.gnu.org/distros/free-distros.h%C2%ADtml%EF%BB%BF

If I remove %C2%ADtml%EF%BB%BF and type in tml so that I get back the address http://www.gnu.org/distros/free-distros.html and then hit Enter, well now it works, and the page loads.

I thought to myself that this is very strange so I tried pasting the same text from clipboard to a plain text editor (notepad) and this is what I got.

gnu.org/distros/free-distros.h­-tml

How was the dash between h and tml introduced? This is why I was getting the 404 error. But the URL appears correctly when pasted to the address bar. Is this some kind of hidden character perhaps?

Also, if I go back to YouTube and highlight the link, I can see that there is a bump on the last three letters. The highlighting is taller around "tml". You can see that in the screen capture below.

screen1

screen2

Why is this happening? What's going on? Could it be that Google is somehow intentionally salting the link?

Update

If I paste into Notepad++ (version 6.3) I get following.

gnu.org/distros/free-distros.h­tml?

If I try to paste into the address bar of the Google Chrome browser, there appears to be some kind of hidden character at the end of the URL. See scree capture below.

screen3

That's not a white space. It's something else… something alien! Something from planet X?

Note: The vertical line at the end is not the one I mean, that's just the text input cursor blinking.

Update 2

Inspecting the html code in Firefox by using the element inspection tool.

screen4

Why is there a square within the opening wbr tag?

Update 3

The "square" appears to be the soft hyphen character entity. Here follows the actual source code of this particular line.

<p>gnu.org/distros/free-distros.h<wbr>&shy;tml</p>

The soft hyphen is the &shy; you see here. HTML tags, such as or i.e. for bold text, are not selectable. When you highlight a text of a web page in a browser, you are not selecting the HTML tags. Nothing within <> is shown.

So it seems that soft hyphen is the root cause of the copy and paste issue. It is not displayed on the web page, but it is selected when you highlight the text.

Update 4

This is what it looks like when I paste the URL into Microsoft Word 2010 and view hidden characters.

screen5

To move the text cursor from .|html to .ht|ml requires pressing the arrow key three times. You can tell by the image above why that is. It's because of this hidden character. With the cursor in front of that strange looking character, pressing Alt+X shows 0068. With the cursor behind that character, and in front of the letter T reveals nothing at all. The 0068 is just the Unicode code page for the letter H.

Best Answer

Yes it is a nuisance.

There are two hiphens The normal one \u2D, and the funny one. The funny one is used sometimes within youtube comments. \u00AD and comes up as hidden.

Paste into notepad(to remove formatting) and also, notepad shows it, and then into MS Word(or just in Ms Word do paste special..unformatted unicode), put your cursor to the right of the hiphen, or any character, and press ALT-x and you see the ASCII or unicode code for it.

This may seem strange. Be aware that there are a few characters with two different types. A type you use usually which is within the 0-7F range, and a type people tend to not use much or at all, which is >7F. The two types of spaces(a normal one and another called the non-breaking space, ascii code 160 \uA0 which can be of use). There two types of pipes 7C and A6 The A6 one is just asking for problems as it causes failures on the command line. And two types of hiphens, the second one you see, behaves funny too, as youtube comments sometimes use it and hide it and don't display it as a hiphen.

Another funny character I see which is used by youtube in comments is \uFEFF You can run notepad2(download it), choose file..encoding..UTF-8 then paste the text in, and search for \uFEFF replacing with nothing, (check the box that says transform backslashes).

Similarly you can open notepad2, search for \u00AD (that funny hiphen) and replace it with a regular hiphen. Editpad free might be able to do it, though I use the pro version for its regex support.

I'd note that charmap doesn't copy the funny hiphen correctly. (So if you want to experiment and you choose copy and paste it into a piece of software and it displays funny, blame charmap), but it copies fine(as in with the character) from your link in my browser(chrome). Better if the character wasn't there though, it is a nuisance! But you can see the ascii code of it in Ms Word, and you can search and remove it in notepad2

You see from charmap it(\u00AD) is called the "soft Hiphen" (i'm just glad they didn't hiphenate that title!)

In the pic I used Ms Word and did ALT-x

enter image description here

Related Question