How do I find this character(by unicode search) in notepad++ ﻁ
If I go to charmap
and I pick this character
I type FEC1 in the unicode search box and hit ENTER and it finds the character
I look it up on fileformat.info
http://www.fileformat.info/info/unicode/char/fec1/index.htm
UTF-8 (hex) 0xEF 0xBB 0x81 (efbb81) UTF-16 (hex) 0xFEC1 (fec1)
If I enter the character into the search box literally then it finds it
But I can't see what unicode to search for to find it
I'd like to be able to search for it in both UTF-8 and UTF-16
[\uFEC1] seems to find the character, but it finds more than that character
Now, if I throw a few FEC9s in there, then I see [\uFEC1] seems to find them too
So, how do I search for \uFEC1 and only that. And i'm interested in searching for it by its UTF-8 code too
Best Answer
To search by Unicode codepoints using UTF-16 you'd use (
\x{FEC1}
), and it works whether the file is encoded with UTF-8 or UTF-16.Bear in mind you wouldn't need to search by the UTF-8 code, because you can search by the UTF-16 code. But to address the part of your question that asks how do you search for that character by the UTF-8 code...
You can't. Well, you sort of can, but it's a hideous hack and you really shouldn't.
The obvious thing to try would be to search for
\xef\xbb\x81
in your UTF-8 encoded document, but that doesn't work. (Note there's no{}
here: Notepad++ expects either\xNN
for 2 hex digits, or\x{NNNN}
for 4 hex digits). That's because Notepad++ doesn't actually search for byte values, it searches for Unicode codepoints. So you can search for the codepoint U+FEC1, but not for the UTF-8 bytes 0xEF 0xBB 0x81, because Notepad++ "hides" the encoding details from you. (Because in nearly every scenario, someone editing a text file will care far more about finding the actual character than about finding the UTF-8 bytes.)There's another trick you might try, which is to take that UTF-8 encoded file and choose the
Encoding → Encode in ANSI
menu option, at which pointﻁﻁﻉﻁﻉﻁﻉ
appears to becomeï»ï»ï»‰ï»ï»‰ï»ï»‰
. (I say "appears to become" rather than "becomes" because... well, read on.) This is because it has taken the UTF-8 text of your file, and reinterpreted it as "ANSI" (which is a terrible encoding name because it's completely wrong, and should really be called "Windows-1252", but that's a different question). (By the way, the reason thatﻁﻁﻉﻁﻉﻁﻉ
looks backwards in my text than the way it does in your screenshot: that's because Notepad++ doesn't care that Arabic is written right-to-left, so it shows the characters left-to-right in the order they were pasted into the file. But your browser does care about presenting Arabic in proper right-to-left order, the first two letters of that string (ﻁﻁ
) appear on the right-hand side of the string, not on the left-hand side as they seem to in Notepad++). Digressions aside, here's why this will be helpful. In the "ANSI" (really Windows-1252) encoding, each byte is a single character, and so now you're going to be able to search by individual bytes. Now, if you search for\xef\xbb\x81
(which doesn't need to be a regular expression, just an "Extended" search), it will find the characters. Sort of. It will look like it's highlighting the two charactersï»
, but it's really highlighting three characters:ï
,»
, and an "invisible"0x81
character that doesn't really exist. (Because there is no character at the0x81
point in Windows-1252 encoding: see for yourself.) And now you see why I said "appears to become" -- because your UTF-8 encoded text has really becomeï»_ï»_ﻉï»_ﻉï»_ﻉ
, where_
represents an "invisible" character that doesn't officially exist in the Windows-1252 codepage. Anyway, now that you've found the sequence of three characters with the byte values 0xEF, 0xBB, and 0x81 in Windows-1252, and Notepad++ has highlighted them, you can choose theEncoding → Encode in UTF-8
menu option, and your text will convert itself back to UTF-8, while Notepad++ will keep the highlight in the same place -- and thus, you'll find that oneﻁ
character has been highlighted.So why do I say that you really shouldn't do this? Because the only reason that it works is that Notepad++ didn't do the right thing when you switched codepages. The right thing to do when you find a missing character is to complain, or insert a character like the Unicode replacement character
�
(or a simple?
if you're in a legacy codepage that doesn't have�
in it), or do something so that the user will know they had an invalid character in their text. Errors should never be silently ignored, and having a0x81
value in Windows-1252 text is an error. The only reason this trick works is because Notepad++ does the wrong thing with invalid characters (that is, it ignores them). So you really shouldn't rely on this trick: with any update to Notepad++, it could change its undocumented (and wrong) behavior, and start putting proper replacement characters in wrongly-encoded text, at which point this trick would fail. Stick to searching for real Unicode codepoints, and you'll be much better off.By the way, the reason why your original attempt (
[\uFEC1]
) failed is because, according to Notepad++'s regular expression syntax,\u
means "an uppercase letter". (Remember that in regular expressions, brackets represent "any of these characters"). The docs further say, "See note about lower case [sic] letters," and the note about lowercase letters says "this will fall back on "a word character" if the "Match case" search option is off." As it is in your screenshot. Therefore, the regex[\uFEC1]
is searching for "any word character, or F, or E, or C, or 1" -- which matches every single character in your sample text.Phew, that turned out to be a very long answer for what I said would be "very simple". I hope this helps you understand Unicode a bit better; if so, the hour I spent typing this up will have been worth it.