Text Processing – Removing Control Characters from Script Output

colorsescape-charactersterminaltext processingtypescript

I can use the "script" command to record an interactive session at the command line. However, this includes all control characters and colour codes. I can remove control characters (like backspace) with "col -b", but I can't find a simple way to remove the colour codes.

Note that I want to use the command line in the normal way, so don't want to disable colours there – I just want to remove them from the script output. Also, I know can play around and try find a regexp to fix things up, but I am hoping there is a simpler (and more reliable – what if there's a code I don't know about when I develop the regexp?) solution.

To show the problem:

spl62 tmp: script
Script started, file is typescript
spl62 lepl: ls
add-licence.sed  build-example.sh  commit-test         push-docs.sh
add-licence.sh   build.sh          delete-licence.sed  setup.py
asn              build-test.sh     delete-licence.sh   src
build-doc.sh     clean             doc-src             test.ini
spl62 lepl: exit
Script done, file is typescript
spl62 tmp: cat -v typescript
Script started on Thu 09 Jun 2011 09:47:27 AM CLT
spl62 lepl: ls^M
^[[0m^[[00madd-licence.sed^[[0m  ^[[00;32mbuild-example.sh^[[0m  ^[[00mcommit-test^[[0m         ^[[00;32mpush-docs.sh^[[0m^M
^[[00;32madd-licence.sh^[[0m   ^[[00;32mbuild.sh^[[0m          ^[[00mdelete-licence.sed^[[0m  ^[[00msetup.py^[[0m^M
^[[01;34masn^[[0m              ^[[00;32mbuild-test.sh^[[0m     ^[[00;32mdelete-licence.sh^[[0m   ^[[01;34msrc^[[0m^M
^[[00;32mbuild-doc.sh^[[0m     ^[[00;32mclean^[[0m             ^[[01;34mdoc-src^[[0m             ^[[00mtest.ini^[[0m^M
spl62 lepl: exit^M

Script done on Thu 09 Jun 2011 09:47:29 AM CLT
spl62 tmp: col -b < typescript 
Script started on Thu 09 Jun 2011 09:47:27 AM CLT
spl62 lepl: ls
0m00madd-licence.sed0m  00;32mbuild-example.sh0m  00mcommit-test0m         00;32mpush-docs.sh0m
00;32madd-licence.sh0m   00;32mbuild.sh0m          00mdelete-licence.sed0m  00msetup.py0m
01;34masn0m              00;32mbuild-test.sh0m     00;32mdelete-licence.sh0m   01;34msrc0m
00;32mbuild-doc.sh0m     00;32mclean0m             01;34mdoc-src0m             00mtest.ini0m
spl62 lepl: exit

Script done on Thu 09 Jun 2011 09:47:29 AM CLT

Best Answer

The following script should filter out all ANSI/VT100/xterm control sequences for (based on ctlseqs). Minimally tested, please report any under- or over-match.

#!/usr/bin/env perl
## uncolor — remove terminal escape sequences such as color changes
while (<>) {
    s/ \e[ #%()*+\-.\/]. |
       \e\[ [ -?]* [@-~] | # CSI ... Cmd
       \e\] .*? (?:\e\\|[\a\x9c]) | # OSC ... (ST|BEL)
       \e[P^_] .*? (?:\e\\|\x9c) | # (DCS|PM|APC) ... ST
       \e. //xg;
    print;
}

Known issues:

  • Doesn't complain about malformed sequences. That's not what this script is for.
  • Multi-line string arguments to DCS/PM/APC/OSC are not supported.
  • Bytes in the range 128–159 may be parsed as control characters, though this is rarely used. Here's a version which parses non-ASCII control characters (this will mangle non-ASCII text in some encodings including UTF-8).
#!/usr/bin/env perl
## uncolor — remove terminal escape sequences such as color changes
while (<>) {
    s/ \e[ #%()*+\-.\/]. |
       (?:\e\[|\x9b) [ -?]* [@-~] | # CSI ... Cmd
       (?:\e\]|\x9d) .*? (?:\e\\|[\a\x9c]) | # OSC ... (ST|BEL)
       (?:\e[P^_]|[\x90\x9e\x9f]) .*? (?:\e\\|\x9c) | # (DCS|PM|APC) ... ST
       \e.|[\x80-\x9f] //xg;
    print;
}
Related Question