Ubuntu – How to count files with a particular extension, and the directories they are in

bashcommand linefilesfindscripts

I want to know how many regular files have the extension .c in a large complex directory structure, and also how many directories these files are spread across. The output I want is just those two numbers.

I've seen this question about how to get the number of files, but I need to know the number of directories the files are in too.

  • My filenames (including directories) might have any characters; they may start with . or - and have spaces or newlines.
  • I might have some symlinks whose names end with .c, and symlinks to directories. I don't want symlinks to be followed or counted, or I at least want to know if and when they are being counted.
  • The directory structure has many levels and the top level directory (the working directory) has at least one .c file in it.

I hastily wrote some commands in the (Bash) shell to count them myself, but I don't think the result is accurate…

shopt -s dotglob
shopt -s globstar
mkdir out
for d in **/; do
     find "$d" -maxdepth 1 -type f -name "*.c" >> out/$(basename "$d")
done
ls -1Aq out | wc -l
cat out/* | wc -l

This outputs complaints about ambiguous redirects, misses files in the current directory, and trips up on special characters (for example, redirected find output prints newlines in filenames) and writes a whole bunch of empty files (oops).

How can I reliably enumerate my .c files and their containing directories?


In case it helps, here are some commands to create a test structure with bad names and symlinks:

mkdir -p cfiles/{1..3}/{a..b} && cd cfiles
mkdir space\ d
touch -- i.c -.c bad\ .c 'terrible
.c' not-c .hidden.c
for d in space\ d 1 2 2/{a..b} 3/b; do cp -t "$d" -- *.c; done
ln -s 2 dirlink
ln -s 3/b/i.c filelink.c

In the resulting structure, 7 directories contain .c files, and 29 regular files end with .c (if dotglob is off when the commands are run) (if I've miscounted, please let me know). These are the numbers I want.

Please feel free not to use this particular test.

N.B.: Answers in any shell or other language will be tested & appreciated by me. If I have to install new packages, no problem. If you know a GUI solution, I encourage you to share (but I might not go so far as to install a whole DE to test it) 🙂 I use Ubuntu MATE 17.10.

Best Answer

  • I haven't examined the output with symlinks but:

    find . -type f -iname '*.c' -printf '%h\0' |
      sort -z |
      uniq -zc |
      sed -zr 's/([0-9]) .*/\1 1/' |
      tr '\0' '\n' |
      awk '{f += $1; d += $2} END {print f, d}'
    
    • The find command prints the directory name of each .c file it finds.
    • sort | uniq -c will gives us how many files are in each directory (the sort might be unnecessary here, not sure)
    • with sed, I replace the directory name with 1, thus eliminating all possible weird characters, with just the count and 1 remaining
    • enabling me to convert to newline-separated output with tr
    • which I then sum up with awk, to get the total number of files and the number of directories that contained those files. Note that d here is essentially the same as NR. I could have omitted inserting 1 in the sed command, and just printed NR here, but I think this is slightly clearer.

    Up until the tr, the data is NUL-delimited, safe against all valid filenames.


    With zsh and bash, you can use printf %q to get a quoted string, which would not have newlines in it. So, you might be able to do something like:

    shopt -s globstar dotglob nocaseglob
    printf "%q\n" **/*.c | awk -F/ '{NF--; f++} !c[$0]++{d++} END {print f, d}'
    

    However, even though ** is not supposed to expand for symlinks to directories, I could not get the desired output on bash 4.4.18(1) (Ubuntu 16.04).

    $ shopt -s globstar dotglob nocaseglob
    $ printf "%q\n" ./**/*.c | awk -F/ '{NF--; f++} !c[$0]++{d++} END {print f, d}'
    34 15
    $ echo $BASH_VERSION
    4.4.18(1)-release
    

    But zsh worked fine, and the command can be simplified:

    $ printf "%q\n" ./**/*.c(D.:h) | awk '!c[$0]++ {d++} END {print NR, d}'
    29 7
    

    D enables this glob to select dot files, . selects regular files (so, not symlinks), and :h prints only the directory path and not the filename (like find's %h) (See sections on Filename Generation and Modifiers). So with the awk command we just need to count the number of unique directories appearing, and the number of lines is the file count.

  • Related Question