How to remove duplicated files in a directory

bash

I downloaded a lot of images in a directory.
Downloader renamed files which already exist.
I also renamed some of the files manually.

a.jpg
b.jpg
b(2).jpg
hello.jpg      <-- manually renamed `b(3).jpg`
c.jpg
c(2).jpg
world.jpg      <-- manually renamed `d.jpg`
d(2).jpg
d(3).jpg

How to remove duplicated ones? The result should be:

a.jpg
b.jpg
c.jpg
world.jpg

note: name doesn't matter. I just want uniq files.

Best Answer

bash 4.x

#!/bin/bash
declare -A arr
shopt -s globstar

for file in **; do
  [[ -f "$file" ]] || continue
   
  read cksm _ < <(md5sum "$file")
  if ((arr[$cksm]++)); then 
    echo "rm $file"
  fi
done

This is both recursive and handles any file name. Downside is that it requires version 4.x for the ability to use associative arrays and recursive searching. Remove the echo if you like the results.

gawk version

gawk '
  {
    cmd="md5sum " q FILENAME q
    cmd | getline cksm
    close(cmd)
    sub(/ .*$/,"",cksm)
    if(a[cksm]++){
      cmd="echo rm " q FILENAME q
      system(cmd)
      close(cmd)
    }
    nextfile
  }' q='"' *

Note that this will still break on files that have double-quotes in their name. No real way to get around that with awk. Remove the echo if you like the results.