Bash Scripting – How to Grep for Unicode in a Bash Script

greplinuxopensslscripting

if grep -q "�" out.txt
    then
        echo "working"
    else
        cat out.txt
fi

Basically, if the file "out.txt" contains "�" anywhere in the file I would like it to echo "working" AND if the file "out.txt" does NOT contain "�" anywhere in the file then I would like it to cat out.txt

EDIT: So here's what I'm doing. I'm trying to brute force an openssl decrypt.

openssl enc returns 0 on success, non-zero otherwise. Note: you will get false positives because AES/CBC can only determine if "decryption works" based on getting the padding right. So the file decrypts but it will not be the correct password so it will have gibberish in it. A common character in the gibberish is "�". So I want the do loop to keep going if the output contains "�".

Heres my git link https://github.com/Raphaeangelo/OpenSSLCracker
Heres the script

while read line
do
openssl aes-256-cbc -d -a -in $1 -pass pass:$line -out out.txt 2>out.txt >/dev/null && printf "==================================================\n"
if grep -q "�" out.txt
    then
        :
    else
        cat out.txt &&
            printf "\n==================================================" &&
            printfn"\npassword is $line\n" &&
            read -p "press return key to continue..." < /dev/tty; 
fi
done < ./password.txt

its still showing me output with the � charicter in it

Best Answer

grep is the wrong tool for the job.

You see the � U+FFFD REPLACEMENT CHARACTER not because it’s literally in the file content, but because you looked at a binary file with a tool that is supposed to handle only text-based input. The standard way to handle invalid input (i.e., random binary data) is to replace everything that is not valid in the current locale (most probably UTF-8) with U+FFFD before it hits the screen.

That means it is very likely that a literal \xEF\xBF\xBD (the UTF-8 byte sequence for the U+FFFD character) never occurs in the file. grep is completely right in telling you, there is none.

One way to detect whether a file contains some unknown binary is with the file(1) command:

$ head -c 100 /dev/urandom > rubbish.bin
$ file rubbish.bin
rubbish.bin: data

For any unknown file type it will simply say data. Try

$ file out.txt | grep '^out.txt: data$'

to check whether the file really contains any arbitrary binary and thus most likely rubbish.

If you want to make sure that out.txt is a UTF-8 encoded text file only, you can alternatively use iconv:

$ iconv -f utf-8 -t utf-16 out.txt >/dev/null

Related Solutions

Bash – Merge Two Lists While Removing Duplicates

I think

sort file1 file2 | uniq
aaaaaa
bbbbbb
cccccc
mmmmmm
nnnnnn
yyyyyy
zzzzzz

will do what you want.

Additional Documentation: uniq sort

How to find all files containing various strings from a long list of string combinations

Since agrep seems not to be present in your system, have a look in this alternative based on sed and awk to apply grep with and operation from patterns read by a local file.

PS: Since you use osx i'm not sure if the awk version you have will support bellow usage.

awk can simulate grep with AND operation of multiple patterns in this usage:
awk '/pattern1/ && /pattern2/ && /pattern3/'

So you could transform your pattern file from this:

$ cat ./tmp/d1.txt
"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

To this:

$ sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' ./tmp/d1.txt
/surveillance data/ && /surveillance technology/ && /cctv camera/
/social media/ && /surveillance techniques/ && /enforcement agencies/
/social control/ && /surveillance camera/ && /social security/
/surveillance data/ && /security guards/ && /social networking/
/surveillance mechanisms/ && /cctv surveillance/ && /contemporary surveillance/

PS: You can redirect the output to another file by using >anotherfile in the end , or you can use the sed -i option to make in-place changes in the same search terms pattern file.

Then you just need to feed awk with awk-formatted patterns from this pattern file :

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt #d1.txt = my test pattern file

You could also not transform patterns in your original pattern file by applying sed in each line of this original pattern file like this:

while IFS= read -r line;do 
  line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line")
  awk "$line" *.txt
done <./tmp/d1.txt

Or as one-liner:

$ while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt

Above commands return the correct AND results in my test files that look like this:

$ cat d2.txt
This guys over there have the required surveillance technology to do the job.
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.

$ cat d3.txt
All surveillance data are locked.
All surveillance data are locked and guarded by security guards.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

Results:

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt
#or while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

Update:
Above awk solution prints the contents of matching txt files.
If you want to display the filenames instead of the contents, then use the following awk where necessary:

awk "$line""{print FILENAME}" *.txt

Best Answer

Related Solutions

Bash – Merge Two Lists While Removing Duplicates

How to find all files containing various strings from a long list of string combinations

Related Question