How to improve this bash shell script for turning hardlinks into symlinks


This shell script is mostly the work of other people. It has gone through several iterations, and I have tweaked it slightly while also trying to fully understand how it works. I think I understand it now, but I don't have confidence to significantly alter it on my own and risk losing data when I run the altered version. So I would appreciate some expert guidance on how to improve this script.

The changes I am seeking are:

  1. make it even more robust to any strange file names, if possible. It currently handles spaces in file names, but not newlines. I can live with that (because I try to find any file names with newlines and get rid of them).
  2. make it more intelligent about which file gets retained as the actual inode content and which file(s) become sym links. I would like to be able to choose to retain the file that is either a) the shortest path, b) the longest path or c) has the filename with the most alpha characters (which will probably be the most descriptive name).
  3. allow it to read the directories to process either from parameters passed in or from a file.
  4. optionally, write a long of all changes and/or all files not processed.

Of all of these, #2 is the most important for me right now. I need to process some files with it and I need to improve the way it chooses which files to turn into symlinks. (I tried using things like the find option -depth without success.)

Here's the current script:


# clean up known problematic files first.
## find /home -type f -wholename '*Icon*
## *' -exec rm '{}' \;

# Configure script environment
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
set -o nounset

# For each path which has multiple links
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# (except ones containing newline)
while IFS= read -r path_info
   #echo "DEBUG: path_info: '$path_info'"
   if [[ $last_inode != $inode ]]; then
       printf "ln -s\t'$path_to_keep'\t'$path'\n"
       rm "$path"
       ln -s "$path_to_keep" "$path"
done < <( find "$dir" -type f -links +1 ! -wholename '*
*' -printf '%i:%p\n' | sort --field-separator=: )

# Warn about any excluded files
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
buf=$( find "$dir" -type f -links +1 -path '*
*' )
if [[ $buf != '' ]]; then
    echo 'Some files not processed because their paths contained newline(s):'$'\n'"$buf"

exit 0

Best Answer


One simple change to not die on file names that start on - is to add -- (means "now all options have been given, only positional arguments left") before the file name arguments start, e.g.

rm -- "$path"
ln -s -- "$path_to_keep" "$path"

and so on.


To count alpha ("alphanumeric" is probably what you really want) characters in a file name you could do

numberofalnum=$(printf -- "$path" | tr -cd [:alnum:] | wc -m)

To count path depth, you could try to just count occurences of '/' in the filename. A caveat could be that /home///daniel is equivalent with /home/daniel, but find won't output unnecessary multiple slashes, so it will be alright.

depth=$(printf -- "$path" | tr -cd / | wc -m)

One could also collapse multiple slashes by running tr -s / after printf. Combining -s, -c and -d in this way in a single invocation is not really possible, it seems.

In this case, since find is already used in this way in the script, just adding a : separated field in the -printf output with %d will print the depth directly, as noted below in comment.


To read directories as arguments from the command line, see this minimal snippet:

while [ $# -ne 0 ]; do
    printf -- 'Argument %d: %s\n' "${i}" "${1}"

($i is just a counter to show you what is happening)

If you wrap your logic in such a while loop, you can access the first argument as ${1}, then use shift which pops the first item off the argument list, and then iterate again and now ${1} is the originally second argument. Do this while the argument count $# is not 0.


To read the arguments from a file, wrap it instead like

while read line; do
    printf -- 'Argument %d: %s\n' "${i}" "${line}"
done < "${1}"

Tip: instead of just increasing indent and wrapping the whole file logic that way, create functions of the current logic and call them at the end of the script. This will easily enable you to choose between either giving directories as arguments or reading them from a file without duplicating code in your script.



printf 'My descriptive log message for path %s\n' "${path}" >> "${logfile}"

in the logic blocks where you have decided to take action or not. Set $logfile earlier to a wanted log path.

Related Question