Shell – How to manage huge amount of files in shell

filesystemspythonshell

$ ls
./dir_with_huge_amount_of_files/errors/

Suppose a directory is full of pictures with unix timestamps, I mean a lot measured in many GBs or even more. Shell-commands like ls will get overflow-style warnings because they are not designed to work with millions (or more) of pictures. How can I manage such huge amount of files? If, for example, I want to find the picture in the middle (according to the timestamp in the name and creation time), is there some file-system that offers a built-in search feature? Which commands would you use? I tried the comfortable ls and find with necessary flags but they were either very slow or generated warnings so I am thinking that either I need better file-system or db or something like that to pre-index the pictures. I basically need one array to which inodes of the photos should be placed in chronological order. How to do that? Later, metadata with unix-timestamps could be added.

[Update]

There is a serious flaw in current answers, people just post sort-of-answers without empirical tests. If they had tested their suggestions, they would probably fail. Hence, I created you a command-line tool by which you can create the sandbox to create the huge amount of files and test your suggestions like with 1e7 amount of files. It can take a long time to generate the files so be patient. If someone knows quicker way to do this, please edit the code. Type python code.py --help to get the help. Have fun!

Usage Example to create a lot of dirred files

$ ls ./data2
ls: ./data2: No such file or directory
$ python testFill.py -n 3 -d 7                                                 
$ tree data2/                                                                  
data2/
|-- 0
|   |-- 1302407302636973
|   |-- 1302407302638022
|   `-- 1302407302638829
|-- 1
|   |-- 1302407302639604
|   |-- 1302407302641652
|   `-- 1302407302642399
|-- 2
|   |-- 1302407302643158
|   |-- 1302407302645223
|   `-- 1302407302646026
|-- 3
|   |-- 1302407302646837
|   |-- 1302407302649110
|   `-- 1302407302649944
|-- 4
|   |-- 1302407302650771
|   |-- 1302407302652921
|   `-- 1302407302653685
|-- 5
|   |-- 1302407302654423
|   |-- 1302407302656352
|   `-- 1302407302656992
`-- 6
    |-- 1302407302657652
    |-- 1302407302659543
    `-- 1302407302660156

7 directories, 21 files

Code testFill.py

# Author: hhh
# License: ISC license

import os, math, time, optparse, sys

def createHugeAmountOfFiles(fileAmount, dirAmount):
   counter = 0
   DENSITY = 1e7
   dir = "./data/"

   do = dir+str(counter)+"/"
   while (os.path.exists(do)):
      counter = counter+1
      do = dir+str(counter)+"/"

   os.mkdir(do)

   for d in range(int(dirAmount)):
      for f in range(int(fileAmount)):
         timeIt = int(time.time()*1e6)
         if (not os.path.exists(do)):
            os.mkdir(do)

         if (timeIt % DENSITY == 0):
            counter = counter+1
            do = dir+str(counter)+"/"

            if (not os.path.exists(do)):
               os.mkdir(do)


         do = dir+str(counter)+"/"
         if(not os.path.exists(do)):
            os.mkdir(do)

         f = open(do+str(timeIt), 'w')
         f.write("Automatically created file to test Huge amount of files.")
         f.close()
      counter = counter +1


def ls(dir):
   for root, dirs, files in os.walk("./data/"+dir):
      print(files)

def rm(dir):
   for root, dirs, files in os.walk("./data/"+dir):
      for f in files:
         os.remove("./data/"+dir+"/"+f)


def parseCli():
   parser = optparse.OptionParser()
   parser.add_option("-f", "--file", dest="filename",
                     help="Location to remove files only in ./Data.", metavar="FILE")
   parser.add_option("-n", "--number", dest="number",
                     help="Number of files to generate", metavar="NUMBER")
   parser.add_option("-r", "--remove", dest="remove",
                     help="Data -dir content to remove", metavar="NUMBER")
   parser.add_option("-d", "--dir", dest="dir",
                     help="Amount of dirs to generate", metavar="NUMBER")
   parser.add_option("-q", "--quiet",
                     action="store_false", dest="verbose", default=True,
                     help="don't print status messages to stdout")

   return parser.parse_args()

def main():
   (options, args) = parseCli()

   if (options.filename):
      ls(options.filename)
   if (options.number and options.dir):
      createHugeAmountOfFiles(options.number, options.dir)
   if (options.remove):
      rm(options.remove)


main()

Best Answer

Try a different shell. I'd recommend trying zsh for instance, and see if it allows more parameters.

If I understand correctly, part of the filename, is a UNIX timestamp. It might be advisable, to divide the files into folders. If the date/time format is a UNIX epoch number, put chunks of fractions of that number, say 10000's, in a separate folder.

If an ISO 8601 timestamp is part of the filename, simply divide by year, month or day.

cdparanoia

You can get a list of CD audio tracks using the command line tool, cdparanoia.

$ cdparanoia -sQ

Example

$ cdparanoia -sQ
cdparanoia III release 10.2 (September 11, 2008)


Table of contents (audio tracks only):
track        length               begin        copy pre ch
===========================================================
  1.    19497 [04:19.72]        0 [00:00.00]    OK   no  2
  2.    20253 [04:30.03]    19497 [04:19.72]    OK   no  2
  3.    29250 [06:30.00]    39750 [08:50.00]    OK   no  2
  4.    34263 [07:36.63]    69000 [15:20.00]    OK   no  2
  5.    28804 [06:24.04]   103263 [22:56.63]    OK   no  2
  6.    21582 [04:47.57]   132067 [29:20.67]    OK   no  2
  7.    33270 [07:23.45]   153649 [34:08.49]    OK   no  2
  8.    29919 [06:38.69]   186919 [41:32.19]    OK   no  2
  9.    29564 [06:34.14]   216838 [48:11.13]    OK   no  2
 10.    18993 [04:13.18]   246402 [54:45.27]    OK   no  2
 11.    35368 [07:51.43]   265395 [58:58.45]    OK   no  2
TOTAL  300763 [66:50.13]    (audio only)

You can count the number of lines that contain the string "OK" or that start with a number, for e.g. 1..

$ cdparanoia -sQ |& grep OK | wc -l
11

-or-

$ cdparanoia -sQ |& grep -P "^\s+\d+\." | wc -l
11

cdda2wav

You can also use this command line tool, cdda2wav.

$ cdda2wav -J -D /dev/cdrom

Example

$ cdda2wav -J -D /dev/cdrom
Type: ROM, Vendor 'Optiarc ' Model 'DVD RW AD-7930H ' Revision '1.D1' MMC+CDDA
569344 bytes buffer memory requested, 4 buffers, 55 sectors
#icedax version 1.1.11, real time sched., soundcard, libparanoia support
AUDIOtrack pre-emphasis  copy-permitted tracktype channels
      1-11           no             yes     audio    2
Table of Contents: total tracks:11, (total time 66:50.13)
  1.( 4:19.72),  2.( 4:30.03),  3.( 6:30.00),  4.( 7:36.63),  5.( 6:24.04),
  6.( 4:47.57),  7.( 7:23.45),  8.( 6:38.69),  9.( 6:34.14), 10.( 4:13.18),
 11.( 7:51.43)

Table of Contents: starting sectors
  1.(       0),  2.(   19497),  3.(   39750),  4.(   69000),  5.(  103263),
  6.(  132067),  7.(  153649),  8.(  186919),  9.(  216838), 10.(  246402),
 11.(  265395), lead-out(  300763)
CDINDEX discid: gteZIAUq3lgZlEmNg_HnbWL6cJc-
CDDB discid: 0x950faa0b
CD-Text: detected
CD-Extra: not detected
Album title: 'Hour of the Olympics' [from Mary Pope Osborne]
Track  1: 'Viking Ships at Sunrise Ch 3'
Track  2: 'Viking Ships at Sunrise Ch 4'
Track  3: 'Viking Ships at Sunrise Ch 5-6'
Track  4: 'Viking Ships at Sunrise Ch 7-8'
Track  5: 'Viking Ships at Sunrise Ch 9-10'
Track  6: 'Hour of the Olympics Ch 1'
Track  7: 'Hour of the Olympics Ch 2-3'
Track  8: 'Hour of the Olympics Ch 4-5'
Track  9: 'Hour of the Olympics Ch 6-8'
Track 10: 'Hour of the Olympics Ch 9'
Track 11: 'Hour of the Olympics Ch 10'
No media catalog number present.
scanning for ISRCs: 11 ...
index scan: 11...

Again you can enlist the help of grep to get the number of tracks:

# traditional way to parse output
$ cdda2wav -J -D /dev/cdrom |& grep index | sed 's#.*: \(.*\)...#\1#g'
11

# use grep's PCRE capability
$ cdda2wav -J -D /dev/cdrom |& grep -Po '(?<=index scan: )\d+' | tail -1
11

References

HowTo: Linux Rip and Encode Audio CDs

Best Answer

Related Solutions

Filesystems – Append Huge Files Without Copying

Python – Reading amount of tracks from an audio CD

cdparanoia

Example

cdda2wav

Example

References

Related Question