Using File Date/Time as Metadata: Reliable

filenamesfilesfilesystemstimestamps

Background: I have groups of files in their own directory which I merge into one file in order of their filename. I call them t1.txt, t2.txt, t3.txt... I merge them in order of the integer.

Situation: For various reasons, I want to get away from the filename as the metadata for later file merge ops.

Action: I'm thinking of moving to a file merging system that orders the file merge by the date/time of the file creation (obviously, I'll have to create the files in order of later merge).

Question:

  1. Will date/time sorted file merging be reliable? Are there hidden gothchas? Some of the files will be created only tenths of a second apart, or less–is this an achilles heal?

  2. Is there something different I should consider for ordering merges.

Date/time seems elementary to me. OTH, what seems simple and straight forward at first often ends up being more complicated than envisioned. So I ask.

Best Answer

Most Unix systems don't track file creation times. They track a file's modification time, which is updated each time the file is written to. If the files are written sequentially when they are created (i.e. the first file is fully written before the second file is created) and not modified later, then the order of the modification times will be the same as the order of the file creations, but in more complex scenarios, this may not be the same.

In addition to the modification time (mtime), there are two other file timestamps on any Unix systems: the access time (atime) and the inode change time (ctime). The access time is updated when the file is read, but some systems (in particular Linux by default) don't always update it for performance reasons. The inode change time is updated when some metadata about the file changes (name, permissions, etc.; also when the file is written to, but not when it's read, even if the atime changes). Neither the atime nor the ctime would be useful to you.

Many historical Unix systems tracked file timestamps with a resolution of one second. Modern Unix systems often have a better resolution, but this requires that several actors pay attention to it:

  • The kernel you're using must support this finer time resolution.
  • The filesystem must be able to store this finer time resolution.
  • Any component in the chain (e.g. NFS server for a file on NFS) must support this finer time resolution.
  • Any tool used to copy the files around (archiver, network synchronizer, …) must be able to preserve the finer time resolution, not just the seconds.
  • The application reading the file times must take sub-second resolution into account. Classic Unix programming interfaces don't support sub-second resolution on file timestamps, so applications need to use a relatively modern API (standardized in POSIX:2008 — still relatively recent as its adoption was not very fast).

Even if everybody in the chain supports nanosecond timestamps, files will only have distinct timestamps if they're actually created more than one clock tick apart — just because the kernel records nanoseconds doesn't guarantee that it will notice that more than one nanosecond has passed between the two file creations: reading the clock takes time, so it isn't done all the time. If you have a single thread opening the file, writing data and closing the file before moving on to the next file, then I think in practice any existing system system that does record sub-second resolution will write different time stamps, but you are taking a small risk. (When different threads are writing to files, even with microsecond resolution, timestamp collisions are possible — but usually in that case you wouldn't be able to rely on the ordering for anything.)

So it's possible, and it's reliable as long as computers don't get much faster than they are now, providing that all the tools you use do support sub-second resolution. But you are at the mercy of a clock glitch or of a tool you hadn't reviewed for subsecond timestamp support. I recommend relying on file names, there's less that can go wrong.

Related Question