Grab text out of vtt file

grepjsonregular expressionsedtext processing

vtt files look like this:

WEBVTT

1
00:00:00.096 --> 00:00:05.047
you're the four functions if you would of 
management first of all you have the planning

2
00:00:06.002 --> 00:00:10.079
the planning stages basically you were choosing appropriate 
 organizational goals and courses

3
00:00:11.018 --> 00:00:13.003
action to best achieve those goals

I need just the text, like this:

you're the four functions if you would of management first of all you have the planning the planning stages basically you were choosing appropriate organizational goals and courses action to best achieve those goals

on ubuntu I tried:

cat file.vtt | grep -v [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9][[:space:]][[:punct:]][[:punct:]][[:punct:]][[:space:]][0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]

that gives me:

WEBVTT

1
you're the four functions if you would of 
management first of all you have the planning

2
the planning stages basically you were choosing appropriate 
 organizational goals and courses

3
action to best achieve those goals

but I can't figure out how to do the rest. what I want to replace is

\n[0-9]+\n\n with space but I can't figure out how to make sed or grep do that.

how do I get with basic / portable (eg generally preinstalled in ubuntu, centos, etc, eg grep, sed, or tr command) to just the raw text with the subtitle timing removed, and all in one line (no newlines)?

NOTE: this has to work for other language characters like chinese hindi arabic, so preferably no [a-z] type matches but instead remove the timing lines which are very consistent in format. Also don't blindly remove any numbers as text can contain numbers

NOTE 2: ultimate goal is to have the text safe for a json value , so all special chars removed and double quotes escaped, but that's sort of beyond the scope of this question

Best Answer

Since your file appears to consist of a sequence of records separated by one or more blank lines, I'd suggest trying something based on the paragraph modes of either awk or perl.

For example, if you always need to strip off the first two lines, like

1
00:00:00.096 --> 00:00:05.047

you could split into newline-delimited fields within blank-separated paragraphs and skip the first two fields using either

awk -vRS= -vORS= -F'\n' '{for(j=3;j<=NF;j++) print $j; print " "}' file.vtt

or

perl -F'\n' -00ne 'print join("", @F[2..$#F]), " "' file.vtt

If you can't rely on there being a fixed number of fields (lines) to be removed, then it's fairly easy to add a regular expression test - a little easier in perl since it allows us to grep directly on arrays rather than writing an explicit loop. For example, to split into blank-separated records and then print only those fields (lines) having at least one sequence of at least 3 alphabetic characters, you could use

perl -F'\n' -00ane '
  print join("", grep { /[[:alpha:]]{3}/ } @F), " "
' file.vtt

If you want to exclude the WEBVTT string you can simply skip the first record, i.e.

perl -F'\n' -00ane '
  print join("", grep { /[[:alpha:]]{3}/ } @F), " " if $. > 1
  ' file.vtt

It will be down to you to choose a suitable regex that capture the wanted lines and excludes the unwanted ones. You can add an END block in either awk or perl if you want to add a final newline to the concatenated output.


NOTE: since (based on the discussion in comments) your files appear to have DOS-style CRLF line endings, you will need to deal with those - either by modifying the field and record separators in the above commands accordingly, or by stripping out the CRs first e.g.

sed 's/\r$//' file.vtt | 
  perl -F'\n' -00ane '
    print join("", grep { /[[:alpha:]]{3}/ } @F), " " if $. > 1
  '
you're the four functions if you would of management first of all you have the planning the planning stages basically you were choosing appropriate  organizational goals and courses action to best achieve those goals steeldriver@xenial-vm:~/test/$
Related Question