How to remove specific numbers from a txt file with SED or AWK

awksedtext processing

I work on a company that will not let me install any software on my computers and I run awful windows there.

I need to clean a lot of texts I copy from the intranet and save as a txt file. So I have to use sed and/or awk online live editors, like this or this

These texts are like this

01

010010-26.2010.501.0026  fafas fasdf asdf asdfsadf asdfasd fasd asasdff

fdfsadf adsf adsf asdf asdfas fadsf asdfa

02

0011-15.2016.501.0012  fafas fasdf asdf asdfsadf asdfasd fasd asasdff
asdfasd fasd asasdff
asdfasd fasd asasdff
0011-125.2013.501.0012
asdfasd fasd asasdff

See the numbers like 0011-15.2016.501.0012 this is what I want. I do not care for the rest but I want to create a new clean text with all these numbers, one per line. In the previous example, I need a text with

010010-26.2010.501.0026
0011-15.2016.501.0012
0011-125.2013.501.0012

the .501. is always present, in all numbers, as the 4th group.

I have tried this command on the sed online editor

's/\([0-9]*\-[0-9]*\.[0-9]*\.501\.[0-9]*\)/\1/'

Not working.

Best Answer

It does work, but you don't change anything, or rather change it to what it was. But with very small modification of this code you can get what you want:

sed -n 's/\([0-9]*\-[0-9]*\.[0-9]*\.501\.[0-9]*\).*/\1/p'

Notice three things:

-n switch, it means to not print anything by default
.* at the end of the group selected with (...)
p as a last command means print this line

Result:

010010-26.2010.501.0026
0011-15.2016.501.0012
0011-125.2013.501.0012

BTW, you can simplify a little by adding -E and using extended regular expression, i.e. get rid of backslashes in front of capturing groups:

sed -E -n 's/([0-9]*-[0-9]*\.[0-9]*\.501\.[0-9]*).*/\1/p'

Both ways work on mentioned webpage.

Related Solutions

Sed – How to Parse File to Extract 3-Digit Numbers in Group

awk '
    $1 == "Group" {printf("\\section{%s%d}\n", $1, $2); next}
    {for (i=1; i<=NF; i++) 
        if ($i ~ /^[0-9][0-9][0-9]$/) {
            printf("\\Testdetails{%d}\n", $i)
            break
        }
    }
'

Update based on comment:

awk '
    $1 == "Group" {printf("\\section{%s %d}\n", $1, $2); next}
    {
      title = sep = ""
      for (i=1; i<=NF; i++) 
        if ($i ~ /^[0-9][0-9][0-9]$/) {
          printf("\\subsection{%s} \\Testdetails{%d}\n", title, $i)
          break
        }
        else {
          title = title sep $i
          sep = FS
        }
    }
'

Grab text out of vtt file

Since your file appears to consist of a sequence of records separated by one or more blank lines, I'd suggest trying something based on the paragraph modes of either awk or perl.

For example, if you always need to strip off the first two lines, like

1
00:00:00.096 --> 00:00:05.047

you could split into newline-delimited fields within blank-separated paragraphs and skip the first two fields using either

awk -vRS= -vORS= -F'\n' '{for(j=3;j<=NF;j++) print $j; print " "}' file.vtt

perl -F'\n' -00ne 'print join("", @F[2..$#F]), " "' file.vtt

If you can't rely on there being a fixed number of fields (lines) to be removed, then it's fairly easy to add a regular expression test - a little easier in perl since it allows us to grep directly on arrays rather than writing an explicit loop. For example, to split into blank-separated records and then print only those fields (lines) having at least one sequence of at least 3 alphabetic characters, you could use

perl -F'\n' -00ane '
  print join("", grep { /[[:alpha:]]{3}/ } @F), " "
' file.vtt

If you want to exclude the WEBVTT string you can simply skip the first record, i.e.

perl -F'\n' -00ane '
  print join("", grep { /[[:alpha:]]{3}/ } @F), " " if $. > 1
  ' file.vtt

It will be down to you to choose a suitable regex that capture the wanted lines and excludes the unwanted ones. You can add an END block in either awk or perl if you want to add a final newline to the concatenated output.

NOTE: since (based on the discussion in comments) your files appear to have DOS-style CRLF line endings, you will need to deal with those - either by modifying the field and record separators in the above commands accordingly, or by stripping out the CRs first e.g.

sed 's/\r$//' file.vtt | 
  perl -F'\n' -00ane '
    print join("", grep { /[[:alpha:]]{3}/ } @F), " " if $. > 1
  '
you're the four functions if you would of management first of all you have the planning the planning stages basically you were choosing appropriate  organizational goals and courses action to best achieve those goals steeldriver@xenial-vm:~/test/$

Best Answer

Related Solutions

Sed – How to Parse File to Extract 3-Digit Numbers in Group

Grab text out of vtt file

Related Question