Pipe – Read Until Pipe is Closed

pipe

I'm working on a homework assignment right now for an introduction to Operating Systems and am having quite a bit of fun, but confusion at the same time. I'm working on piping at the moment; my bit of code below here.

Originally, my code looked like this:

// Child process - write
if (fork() == 0) {
    fprintf(stderr, "Child\r\n");
    close(1);
    dup(p[1]);
    close(p[0]);
    close(p[1]);
    runcmd(pcmd->left);
// Parent process - read
} else {
    wait(0);
    close(0);
    dup(p[0]);
    close(p[0]);
    close(p[1]);
    fprintf(stderr, "Parent\r\n");
    runcmd(pcmd->right);
}

My thought process towards this was that the parent would wait until the child was terminated, then read from the pipe and that was it. I posted this code to my instructor on our discussion page and he told me that there were several issues with the code, one of which was:

  1. The parent process could hang infinitely if the child was running over a long enough input that it blocks the pipe.

He mentioned that the correct implementation therefore (in regards to wc), would be to use a blocking read command, which would wait on the pipe until data was available, and then begin reading until the pipe has closed.

I tried looking around for some way to "read" from the pipe the moment it had data in it, but was unsure of how to go around it. In the end, in an effort to try to solve the issue with the possibility of waiting forever on a blocked pipe, I had the parent and child run simultaneously in parallel, but that may mean that the reading process may terminate first and not read in all the data before write has finished. How would I go about addressing the issue?

    int p[2];
    pipe(p);
    // Child process - read
    if (fork() == 0) {
        fprintf(stderr, "Start child\r\n");
        close(0);
        dup(p[0]);
        close(p[0]);
        close(p[1]);
        fprintf(stderr, "Child\r\n");
        runcmd(pcmd->right);
    // Parent process - write
    } else {
        fprintf(stderr, "Start parent\r\n");
        close(1);
        dup(p[1]);
        close(p[0]);
        close(p[1]);
        fprintf(stderr, "Parent\r\n");
        runcmd(pcmd->left);
   }

Edit: I also tried the read command, but was unsure of how to actually use it since it requires the buffer, and also the expected size to read in (?). I'm uncertain of how to retrieve either of those when you don't know the size of the incoming data.

Best Answer

Piping is simple.  You’re making it hard on yourself by jumping into the pool at the deep end.  (Or perhaps it’s your instructor’s fault for not guiding you better.)

To become more comfortable with pipes, I suggest that you write two trivially simple programs:

  1. One that just writes some text to the standard output and exits.  It can be something simple — “The quick brown fox jumps over the lazy dog.”, “Lorem ipsum dolor sit amet, consectetur adipiscing elit, …”, a short string (maybe even a single character) repeated many times — whatever you want.  Use printf, write, fprintf(stdout, …), or whatever other function(s) you like.

    To test this program, just run it from a shell prompt.  It should display the chosen text and exit (return you to your shell prompt).

  2. And one that just reads text from the standard input and writes it to standard output.  Use getc, gets, read, or whatever other function(s) you like.  Exit when you get end-of-file.  Check the man page for whatever function you use to see how it indicates end-of-file.

    To test this program, create a text file (called something like jon_file.txt) and put some text into it.  You can do this quickly by saying something like echo "Hello world" > jon_file.txt, or you can use an editor.  Then type prog2 < jon_file.txt.  It should display the contents of the file and exit (return you to your shell prompt).

Don’t call pipe, dup, or anything fancy — not even open or close.  (Do include whatever debugging and/or auditing code you want to ensure that you understand what is happening when.)  And then run prog1 | prog2.  If you’ve done it correctly, you’ll get the output you expect.

Now try to “break” it by adding sleep calls to the programs.  If you break it, let me know how you did it.  It should be almost impossible — unless you make one program (or both) sleep for longer than you’re willing to sit and wait, you’ll always get prog2 to output all the data that prog1 writes.

And in case the above example doesn’t make it clear: having the parent and child (or, in general, the processes on both sides of a pipe) run “simultaneously” is the right thing to do.1  The reading program won’t “terminate first” just because there is no data in the pipe currently.  As you should have learned from the above exercise, if a program tries to read from a pipe that has no data in it currently, the read system call will force the program to wait until data arrive.  The reading program won’t terminate until there are no data left in the pipe and no more coming, ever.2  (At this point, read will return an end-of-file.)  The “no more data coming ever” condition is indicated by the writing program closing the pipe (or exiting, which is equivalent, because exit calls close on all open file descriptors).

I don’t understand why you’re sweating the read system call at this point — although, if you don’t know how to use it yet, that confirms my suspicion that your instructor is presenting material out of logical order.  (I assume that you mean the read system call and not the read command.)  The only way your program makes sense is if runcmd(pcmd->right) is something that reads from standard input by some method (like our prog2 program, above).  It looks like your program is just doing the function of the shell — setting up the pipes, and then letting the programs run.  At that level, there’s no reason for your program (to the extent that you have shown it to us) to do any I/O (reading or writing).
__________
1 Related reading: In what order do piped commands run?
2 Of course this is an oversimplification.  As you will learn soon, if you haven’t already, you can design the reading program to terminate when there is no data in the pipe currently — but that’s not the default behavior.  Or you can design the reading program to terminate under any number of other conditions — e.g., if it reads a q from the pipe.  Or it could be killed by a signal, etc…


I’m looking back at this answer six months later, and I see that I really didn’t address the entire question;  I covered the second half, but not the first.  So, continuing from the above,

  1. Modify the first program to write a lot of data — at least 100,000 (105) or 102400 (210×102) characters — to stdout.  Also, if you haven’t already done this, modify it to write some on-going status information to stderr.  This can be something very simple; e.g., one “.” to stderr for every 1000 (or 1024) characters to stdout, and “!\n” to stderr when it’s done.

    To test this, run prog1 > /dev/null.  If you followed my suggestion (above), you should see 100 dots (.), followed by ! and a newline.  If you don’t have any calls to sleep() or other time-consuming functions in prog1, this output should come fairly quickly.

    Then run prog1 | wc -c.  It should display your stderr status information, as mentioned above, followed by 100000 or 102400 or however many bytes you wrote to stdout.  (This will be the output from wc -c, reporting how many bytes it read from its stdin (the pipe).)

  2. Modify the second program to sleep 10 or 20 seconds before it starts reading.

    To test this, run prog2 < jon_file.txt again.  Obviously it should pause for the amount of time you specified in your sleep(), and then display the contents of the file and exit (return you to your shell prompt).

Now run prog1 | prog2 > /dev/null.  But, before you do that, you might want to try to guess what will happen.

    ︙

    ︙

    ︙

I expect that it will print some dots — maybe 8, maybe 64 or 65, maybe some other number — and then the pause, and then the rest of the dots, and the !.  This is because prog1 can start writing immediately, even if prog2 isn’t reading yet.  The pipe can hold the data until prog2 is ready to start reading — but only up to a point.  The pipe has a buffering limit.  This may be 8000 (or 8192), 64000 (or 65536), or some other number.  When the pipe is full, the system will force prog1 to wait.  When prog2 starts reading, it drains the pipe; this makes room for the pipe to hold more data, and so prog1 is allowed to start writing again.

If you don’t see the above behavior at first, try increasing the numbers: 200,000 bytes, 30 seconds, etc.

So your teacher was partly right when he criticized the first draft of your program.  (Or, perhaps, he was exactly right, and you misquoted him.)  As you understand, that version of the program waited for the runcmd(pcmd->left) program (the pipe writer) to finish, and then it would start runcmd(pcmd->right) (the pipe reader).  But what it the left program outputs 100,000 bytes?  It will fill the pipe and then wait until it can write some more.  But it won’t be able to write more until “somebody” reads from the pipe and drains the storage buffer.  But the main program won’t start the pipe reader until the pipe writer has finished.  Everybody is waiting for somebody else to do something, which they won’t do until the first guy has done something.  (“I’ll give you the jewel as soon as you give me the money.”  /  “No, I’ll give you the money after you give me the jewel.”)  So, yeah; bottom line: if data stopped moving through the pipe because it was full and no process was reading from it, then both processes would hang infinitely.

This sort of situation is known casually, culturally, as a Catch-22.  In computer science, it is formally called a deadlock, informally called a deadly embrace.

Related Question