Linux – Why is writing to a existing file faster than write a new empty file

filesfilesystemslinuxperformance

I use MappedByteBuffer to write file in linux.

File file = new File("testFile");
RandomAccessFile raf = new RandomAccessFile(file, "rw");
FileChannel fc = raf.getChannel();
MappedByteBuffer mbf = fc.map(FileChannel.MapMode.READ_WRITE, 0, file.length());
mbf.put(buffer);

If testFile has write 500MB, and write the 500MB data twice, it takes 1s.but when I rm testFile, write the 500MB data, it takes 4s.

Why override a file is faster than write a new file? How can I write a new file as faster as override a file?

Best Answer

Whether overwriting or creating a new file is faster depends on the filesystem type. Many filesystems overwrite file data in place; then overwriting is faster because it only requires writing the data, whereas creating a new file requires first allocating space and then writing the data in the newly allocated space. I wouldn't expect a large difference though. Some filesystems don't overwrite an existing block (to allow a write to be undone), and then overwriting an existing file is done by writing the new data followed by deleting the old data. I wouldn't expect a large difference in either case though.

The underlying layers can have similar effects to make one operation more costly than the other. For example, overwriting on a system that keeps snapshots keeps the old data around so that the snapshot can be restored. Flash media can only be erased in bulk so new data is written to free sectors but overwriting some data eventually does lead to it being freed which takes time.

By far the thing with the biggest effect on read and write timing is buffering and caching. Make sure that you're doing your benchmarks in a known cache configuration (you should probably flush the disk cache before starting each benchmarked operation) and ends with buffers all written (finish by calling sync) unless you want to measure warm-cache/buffers timings. For example, doing two consecutive writes where the first write only writes to memory buffers won't cost much more than doing a single write.

In any case, if it takes 4s to do the operation you want then it takes 4s. There's no magical way to make it 4 times faster.

Related Solutions

Linux – Why is dd using direct slower writing to disk than to a file

This difference undoubtedly comes down to one thing: caching.

It will be really difficult to pin down where, especially from userland, but all Linux kernels buffer (cache) filesystem writes, unless you perform the tricks to get synchronous writes. That is, the kernel will save the data dd sends to a file somewhere in kernel memory. The kernel probably uses file system code to do this. Some time in the future, the kernel will schedule a disk block to go out to the disk. That will happen "asynchronously", sometime after the kernel tells dd that the write finished.

The reason for this is that moving bytes over a bus and into a disk drive, and then on to the disk platters is much slower than even copying from user to kernel memory. Ordinarily, programs don't care too much that the data they just "wrote" won't make it to the disk for a while. Hardware reliability is high enough that the data makes it to platter almost always.

That's the simple answer, but once you've got reads/writes/deletes all buffered up in the kernel, the file system code can take advantage of short file lifetimes by never writing out the data of files that get deleted before they make it to disk. The file system code can group writes to take advantage of disk blocks larger than a group of writes and consolidate them into one write. There's tons of optimizations that can be done in most file systems.

Bash Grep Performance – Why ‘tac file | grep foo’ is Faster than ‘grep foo < <(tac file)'

The construction <(tac file) causes to shell to:

Create a pipe with a name
- On systems such as Linux and SysV which have /dev/fd, a regular pipe is used, and /dev/fd/<the-file-descriptor-of-the-pipe> is used as the name.
- On other systems, a named pipe is used, which requires creating an actual file entry on disk.
Launch the command tac file and connect it to one end of the pipe.
Replace the whole construction on the command line with the name of the pipe.

After the replacement, the command line becomes:

grep whatever < /tmp/whatever-name-the-shell-used-for-the-named-pipe

And then grep is executed, and it reads its standard input (which is the pipe), reads it, and searches for its first argument in that.

So the end result is the same as with...

tac file | grep whatever

...in that the same two programs are launched and a pipe is still used to connect them. But the <( ... ) construction is more convoluted because it involves more steps and may involve a temporary file (the named pipe).

The <( ... ) construct is an extension, and is not available in the standard POSIX bourne shell nor on platforms that do not support /dev/fd or named pipes. For this reason alone, because the two alternatives being considered are exactly equivalent in functionality, the more portable command | other-command form is a better choice.

The <( ... ) construction should be slower because of the additional convolution, but it's only in the startup phase and I don't expect the difference to be easily measurable.

NOTE: On Linux SysV platforms, < ( ... ) does not use named pipes but instead uses regular pipes. Regular pipes (indeed all file descriptors) can be referred to by the special named /dev/fd/<file-descriptor-number so that's what the shell uses as a name for the pipe. In this way it avoids creating a real named pipe with a bona fide temporary filename in the real filesystem. Although the /dev/fd trick is what was used to implement this feature when it originally appears in ksh, it is an optimization: on platforms that don't support this, a regular named pipe in the real filesystem is used as described above.

ALSO NOTE: To describe the syntax as <<( ... ) is misleading. In fact it's <( ... ), which is replaced with the name of a pipe, and then the other < character which prefixes the whole thing is separate from this syntax and it's the regular well-known syntax for redirecting input from a file.

Best Answer

Related Solutions

Linux – Why is dd using direct slower writing to disk than to a file

Bash Grep Performance – Why ‘tac file | grep foo’ is Faster than ‘grep foo < <(tac file)'

Related Question