Bash command that outputs the result of a previous pipe

bashpipesedtext processing

I've got a text file outputted from a WebSpider. The Spider extracts all sentences from a given list of URLs. What I need to do is then process this file and find all lines that contain more than 65 characters and then determine the language of each line. I've got it working in a one liner (my bash scripting skills are non existent).

sed -n '/^.\{65\}/p' www.mbl.is | langid --line | grep is

langid is a python module that identifies languages and provides a number associated with how likely it is this language. To install just run:

pip install langid

or visit https://github.com/saffsd/langid.py, for more information. Now what I need to do is print the line that is piped into the langid command, that contains 'is', hence the grep. Below is a sample output of my current command:

('is', -288.34235095977783)
('is', -168.52833652496338)
('is', -255.30311250686646)
('is', -254.8700122833252)
('is', -664.7349543571472)
('is', -169.40936374664307)
('is', -315.0590629577637)
('is', -323.49001693725586)
('is', -281.2222490310669)
('is', -198.52733993530273)
('is', -152.1551775932312)
('is', -66.93532514572144)
('is', -231.61306524276733)
('is', -254.00042057037354)
('is', -322.7330708503723)
('is', -151.84487915039062)

EDIT: as per terdon♦ comment

Command:

sed -n '/^.\{65\}/p' www.mbl.is

Output:

Eftir stutt stopp i hofudborginni sem okkur heilt yfir leist agaetlega a var kominn timi a ad graeja visa fyrir Vietnam.    1
I gaer, paskadag, eyddum vid thvi deginum i ad koma okkur fyrir a Back Home, gerdum god kaup a Petaling Street (chinatown) og forum i paskaeggjaleit.   1
Vid, temmilega nyvoknud, stigum ut ur rutunni thar sem klassisku leigubilstjornarnir standa fyrir utan ad berjast um folk i bilana sina.    1
Vid forum med Boraj og Tino og leigdum okkur hljodeinangrad einkaherbergi med ollu innifoldu i klukkutima, fyrir taepa 20 dollara (1/4 af manadarlaunum theirra!) - fullt af bjor, starfsmadur med okkur allan timan og steiktar poddur i snakk med idyfum. 1
Vid ludarnir i "Good morning Vietnam" bolunum okkar umkringd moldriku folki klaett i italkst fra toppi og nidur.    1
Vid aetlum tho rett ad vona ad foreldrar okkar sjai ser faert ad geyma eins og eitt alvoru paskaegg handa hvoru okkar?  1
Hinsvegar var okkur bent a tyndu perluna, Mai Chau, sem hefur allt sem Sapa hefur upp a ad bjoda, nema thu dregur turismann fra.    1
Thetta var audvitad allt saman hreinasta lygi en vid letum okkur hafa thad og gistum eina nott a thessu annars agaeta hoteli.   1
Individual truth is constantly evolving, and a truth seeker must be willing to give up last week's major truth for whatever new discovery the innermost self reveals.   1
Um kvoldid forum vid svo oll saman ad borda vid mekong ana og attum mjog gott kvold saman.  1
Tha segja teir enn fremur ad bandarikjamenn hafi i raun verid ad reyna ad hindra frekari utbreidslu kommunisma i SA-Asiu, svo ad stridid var i raun bara einn stor misskilningur.   1

Command:

sed -n '/^.\{65\}/p' www.mbl.is | langid --line

Output:

('en', -193.52840971946716)
('en', -445.4644522666931)
('en', -158.1918339729309)
('en', -220.16202330589294)
('en', -596.61936211586)
('en', -379.3824007511139)
('en', -150.61454391479492)
('en', -379.3824007511139)
('en', -270.56594038009644)
('en', -446.9800910949707)
('en', -702.9869554042816)
('en', -208.84209847450256)
('en', -345.15056800842285)
('en', -321.2763195037842)
('en', -209.9769265651703)
('en', -144.31591272354126)
('en', -208.40711855888367)
('en', -161.14595460891724)
('en', -180.95807218551636)
('is', -151.84487915039062)
('en', -32.042465686798096)
('no', -73.23809719085693)
('lb', -194.81272649765015)
('et', -80.76274251937866)
('en', -129.17673206329346)
('en', -95.43238878250122)
('da', -30.086124420166016)

Is this possible to do in a one liner or would it be best to write a script. I can do it in python, but it's regex modules are painful and need to change the character variable quickly depending on the input file and change the grep to different language codes easily. Plus I thought this would be a good time to start my journey in bash scripting, bash commands are awesome, and I can assume that so is bash scripting (I've just got to get my head around the semantics and syntax, lots of $ signs)

Best Answer

You can do that with a while loop:

while read l; do
  [ ${#l} -gt 65 ] && \
    echo "$l" | langid --line | grep -q "is" && \
    echo "$l"
done <file

  • read l read the input line by line and store the current line in the variable $l.
  • [ ${#l} -gt 65 ] if the line contains more than 65 characters.
    • echo "$l" | langid --line | grep -q "is" process the line and grep for the language, notice with -q, grep will be silent. We just want to check if the string is there, no output.
    • echo "$l" If the string is there, print the original line.
  • <file use the contents of file as input.

Edit: The above runs the langid command on each line, this is very slow. If you want it to run in one transit (faster) use this:

awk 'FNR==NR{a[NR]=$0}
  FNR!=NR&&$1~"is"{print a[FNR]}' \
<(sed -n '/^.\{65\}/p' file) \
<(sed -n '/^.\{65\}/p' file | langid --line)
  • awk processes two "files":
    • The output of sed -n '/^.\{65\}/p' file: All sentences with 65 or more characters.
    • The output of sed -n '/^.\{65\}/p' file | langid --line which processes all lines with 65 or more characters in one transit.
  • Inside awk:
    • FNR==NR applies in the first "file"
    • a[NR]=$0 Fill an array with the line number as index
    • FNR!=NR&&$1~"is" applies to the second "file" and checks if the line contains the string is
    • print a[FNR] if thats true, print the corresponding line in the prevously created array a which contains the original sentence.
Related Question