vtt files look like this:
WEBVTT
1
00:00:00.096 --> 00:00:05.047
you're the four functions if you would of
management first of all you have the planning
2
00:00:06.002 --> 00:00:10.079
the planning stages basically you were choosing appropriate
organizational goals and courses
3
00:00:11.018 --> 00:00:13.003
action to best achieve those goals
I need just the text, like this:
you're the four functions if you would of management first of all you have the planning the planning stages basically you were choosing appropriate organizational goals and courses action to best achieve those goals
on ubuntu I tried:
cat file.vtt | grep -v [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9][[:space:]][[:punct:]][[:punct:]][[:punct:]][[:space:]][0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]
that gives me:
WEBVTT
1
you're the four functions if you would of
management first of all you have the planning
2
the planning stages basically you were choosing appropriate
organizational goals and courses
3
action to best achieve those goals
but I can't figure out how to do the rest. what I want to replace is
\n[0-9]+\n\n
with space but I can't figure out how to make sed or grep do that.
how do I get with basic / portable (eg generally preinstalled in ubuntu, centos, etc, eg grep, sed, or tr command) to just the raw text with the subtitle timing removed, and all in one line (no newlines)?
NOTE: this has to work for other language characters like chinese hindi arabic, so preferably no [a-z] type matches but instead remove the timing lines which are very consistent in format. Also don't blindly remove any numbers as text can contain numbers
NOTE 2: ultimate goal is to have the text safe for a json value , so all special chars removed and double quotes escaped, but that's sort of beyond the scope of this question
Best Answer
Since your file appears to consist of a sequence of records separated by one or more blank lines, I'd suggest trying something based on the paragraph modes of either
awk
orperl
.For example, if you always need to strip off the first two lines, like
you could split into newline-delimited fields within blank-separated paragraphs and skip the first two fields using either
or
If you can't rely on there being a fixed number of fields (lines) to be removed, then it's fairly easy to add a regular expression test - a little easier in
perl
since it allows us togrep
directly on arrays rather than writing an explicit loop. For example, to split into blank-separated records and then print only those fields (lines) having at least one sequence of at least 3 alphabetic characters, you could useIf you want to exclude the
WEBVTT
string you can simply skip the first record, i.e.It will be down to you to choose a suitable regex that capture the wanted lines and excludes the unwanted ones. You can add an
END
block in eitherawk
orperl
if you want to add a final newline to the concatenated output.NOTE: since (based on the discussion in comments) your files appear to have DOS-style
CRLF
line endings, you will need to deal with those - either by modifying the field and record separators in the above commands accordingly, or by stripping out theCR
s first e.g.