How to De-Duplicate Your Personal Files

10/16/2024

A small and useful golang program that you can follow along and code yourself

AI generated pixel cats

Where did all these duplicates come from?

The duplicates grew one day at a time. One 'event' at a time.

You need some files on another computer, but you aren't sure which ones exactly. So, you do the easy thing. You copy a bunch and say to yourself, "I will just delete them later". During the project some of the files get updated. Of course you didn't keep notes of exactly which files you changed. It was due tomorrow, it was late, nobody does that even if it isn't late. Then you forget about that jump-drive for 6 months. When you find it again... You just copy everything into a new "jump-drive to sort" folder and delete everything off the jump-drive. "I will have to deal with it later" you say to yourself.

You get a new computer and copy all your documents over to the cool new machine. You still use the old one once in awhile though. Maybe you use it as a media center, maybe you let your wife have it, maybe your kids play games on it. After 5 or so years the old computer finally dies. The hard-drive is still good though. You know there are some new files that you want to save, but where are they specifically... Well, your new computer has 2TB of storage, so the quick thing for now is to copy everything over again. "Old HP Tower duplicate documents".

You start dating someone and start sharing photos. Maybe you set up a shared DropBox folder. A couple years later you get married. Eventually the old laptop is dying. You now have a ton of duplicated photos, music, and movies. The file structures are different though. "Mary's laptop to sort".

This is how it happens, at least in my case. It is very easy to create a copy of a file, a folder, a drive. But, to merge them back together is a herculean task.

Just go use some App

Yeah, I am sure there is some app you can find online. Just pay $10.99 per month and all your dreams will come true. Better yet, just use a chrome book, keep all your photos and music synced with Apple. Only do office document online with O365. I am sure it will be utopia. If you do most of your "computing" on your phone, then this article obviously isn't for you.

The transition of subscription based software is strange to watch. It isn't over by half either. But that isn't the topic for today.

Also, I think there needs to be more tutorials online that show you step by step how to do or make something that is actually useful. I remember there used to books for programming your Commodore 64. If you followed along, by the end you had a fully functional game. You didn't just end up with some classes about circles, squares, and bicycles.

I guess part of the trouble is the number of lines required to create something of interest of gone up exponentially. Hand making a text editor can't compete with just using Notepad++ or VS-Code which has a million-and-one "features" already bulging out of its reanimated corpse.

There is also a trust problem. People use programs from big name companies because they are big name companies. I for one wouldn't be willing to trust an *.exe from "joesCoolDeDuplicationSoftware.com", let alone trust the website with my credit card number.

Even for ones that I do know are useful, I don't want to pay for them. I used Beyond Compare at a previous job and know it is better than the likes of winmerge, but I still haven't bought a copy. I guess in a way, FOSS pulls the ladder up so that the next generation of programmers has a harder time making money selling software products directly to end users. How do we solve this? No clue. What am I trying to say? Just rambling.

First Step

Enough rambling, let's build something.

First, create a new directory, call it DeDupe. Open the folder and open the terminal. Use the following go command:

1	`go mod init dedupe`

Now open the directory with VS-Code.

Create a file called util.go. We are going to put some helper function in it.

First we need a function to convert a path into a filename safe string.

var pathToFilenameReplacer = strings.NewReplacer(":", "_", "/", "_", "\\", "_", ".", "_")

func ConvertPathToFilename(path string) string {
	//desiredPath := "D:/Documents/Photos"
	path = pathToFilenameReplacer.Replace(path)
	return path
}

Create another file called util_test.go and we will add a test.

func TestConvertPathToFilename(t *testing.T) {
	type args struct {
		path string
	}
	tests := []struct {
		name string
		args args
		want string
	}{
		// TODO: Add test cases.
		{name: "1", args: args{path: "D:\\bacon/time.bin"}, want: "D__bacon_time_bin"},
		{name: "2", args: args{path: "D:\\bacon"}, want: "D__bacon"},
	}
	for _, tt := range tests {
		t.Run(tt.name, func(t *testing.T) {
			got := ConvertPathToFilename(tt.args.path)
			require.Equal(t, tt.want, got)
		})
	}
}

Pretty simple, but it works.

Now, back in util.go we need a simple way to figure out if a file exists.

func DoesFileExist(filePath string) bool {
	_, err := os.Stat(filePath)
	if err == nil {
		return true
	}
	missing := errors.Is(err, os.ErrNotExist)
	return !missing
}

You give it a path, it returns a bool. easy.

Now the next part is key, but is still just a utility function. The easiest way to know if two files are the same is if they have the same hash. Golang has great hashes built in, so we can just use them. in this case I used SHA256. I also convert the hashes to strings for ease of use elsewhere in the code.

func HashFile(filePath string) []byte {
	f, err := os.Open(filePath)
	if err != nil {
		panic(err)
	}
	defer f.Close()

	h := sha256.New()
	if _, err := io.Copy(h, f); err != nil {
		panic(err)
	}

	ret := h.Sum(nil)
	return ret
}

func HashToBase64(byteHash []byte) string {
	return base64.StdEncoding.EncodeToString(byteHash)
}

The last utility we are going to add to this file is a way to save and load golang data structures. Taking the hash of files takes time. But if we assume that a file hasn't changed if a file path is the same, the size is the same, and the last modified time is the same then we don't need to re-compute the hash. But that means saving that path, size, time, and hash information into a file. Of course you could use JSON or the like, but I got used to being able to just pickle things with python when the python script was just going to be talking to a future run of itself. Golang has a similar feature called gob. And we can use gob with generics to save any struct easily as long as interfaces aren't involved.

func SaveGob[T any](data T, filePath string) error {
	file, err := os.Create(filePath)
	if err != nil {
		return err
	}
	defer file.Close()

	enc := gob.NewEncoder(file)
	err = enc.Encode(&data)
	if err != nil {
		return err
	}

	return nil
}

func LoadGob[T any](filePath string) (T, error) {
	file, err := os.Open(filePath)
	if err != nil {
		var ret T
		return ret, err
	}
	defer file.Close()

	var ret T

	dec := gob.NewDecoder(file)
	err = dec.Decode(&ret)
	if err != nil {
		var ret T
		return ret, err
	}

	return ret, nil
}

With these two function, you can save a structure to a file, and then load it the next time the program runs.

Force the golang extension to update the import statement by manually saving the file.

then go to the console and run go mod tidy, go build ./... and go test ./.... If there are no problems, then we can move on.

Traverse The Files

The next thing to do is to take a starting directory and to recursively find everything within it that is of interest.

Create a new file called getFiles.go.

The first thing we need is a structure to hold information about each file we find. It will need the following:

the full path
file size
modified time
the Hash

We can turn it into a structure with the following code:

type FileInfo struct {
	Path         string
	Size         int
	ModifiedTime time.Time
	Hash         string // base64 encoding of SHA256
}

Now, in its simplest form, a function walking through the directories is pretty small and can be done with the following code:

func GetAllFilesInDirRecursive(
	targetDirPath string,
) map[string]*FileInfo { // a map with the file's full path as the key. It holds a pointer to a FileInfo struct.
	directoriesToSearch := []string{targetDirPath}
	results := make(map[string]*FileInfo, 10000)
	for len(directoriesToSearch) > 0 {
		currentDirectory := directoriesToSearch[len(directoriesToSearch)-1]
		directoriesToSearch = directoriesToSearch[:len(directoriesToSearch)-1]
		entries, err := os.ReadDir(currentDirectory)
		if err != nil {
			log.Printf("ERROR: %s, %v\n", currentDirectory, err)
			continue
		}
		for _, entry := range entries {
			newPathPart := entry.Name()
			fullPath := filepath.Join(currentDirectory, newPathPart)

			isDir := entry.IsDir()
			if isDir {
				directoriesToSearch = append(directoriesToSearch, fullPath)
			} else {
				info, err := entry.Info()
				if err != nil {
					log.Printf("ERROR: %s, %v\n", fullPath, err)
					continue
				}
				size := int(info.Size())
				modifiedTime := info.ModTime()

				// Try to get the hash from the PrevFileInfo
				hash := HashToBase64(HashFile(fullPath))

				newFile := FileInfo{
					Path:         fullPath,
					Size:         size,
					ModifiedTime: modifiedTime,
					Hash:         hash,
				}
				results[newFile.Path] = &newFile
			}
		}
	}

	return results
}

This has a couple problems. The code doesn't make sure the paths are uniform and consistent. But the biggest problem actually is that every time you run, you have to hash all the files again. That can take hours. You will also find that this is lacking features. It is much nicer to be able to exclude things like: hidden files, zero size files, and even specific directories.

Let's make a config structure to pass in to solve these problems.

type GetAllFilesInDirRecursiveConfig struct {
	TargetDirPath        string
	ExcludeZeroSizeFiles bool
	ExcludeHidden        bool
	FoldersToExclude     []string
	RegExExcludes        []*regexp.Regexp
	PrevFileInfo         map[string]*FileInfo // use the result from the last run to make it so the hash does not need to be computed again.
}

This struct even has a spot for arbitrary regular expression matching. I found that helpful for excluding album art files created by iTunes.

We can update the function to be this instead:

func GetAllFilesInDirRecursive(
	config GetAllFilesInDirRecursiveConfig,
) map[string]*FileInfo { // a map with the file's full path as the key. It holds a pointer to a FileInfo struct.
	// Clean the desired target directory
	var err error
	config.TargetDirPath, err = filepath.Abs(config.TargetDirPath)
	if err != nil {
		panic(err)
	}

	// clean the folders to exclude:
	foldersToExclude := make(map[string]bool)
	for _, p := range config.FoldersToExclude {
		p, err = filepath.Abs(p)
		if err != nil {
			panic(err)
		}
		foldersToExclude[p] = true
	}

	// if PrevFileInfo is nil, make it so we don't need a nil check later.
	if config.PrevFileInfo == nil {
		config.PrevFileInfo = make(map[string]*FileInfo)
	}

	fmt.Println(config.TargetDirPath)
	directoriesToSearch := []string{config.TargetDirPath}
	results := make(map[string]*FileInfo, 10000)

	for len(directoriesToSearch) > 0 {
		currentDirectory := directoriesToSearch[len(directoriesToSearch)-1]
		directoriesToSearch = directoriesToSearch[:len(directoriesToSearch)-1]
		entries, err := os.ReadDir(currentDirectory)
		if err != nil {
			log.Printf("ERROR: %s, %v\n", currentDirectory, err)
			continue
		}
		for _, entry := range entries {
			newPathPart := entry.Name()
			fullPath := filepath.Join(currentDirectory, newPathPart)

			if config.ExcludeHidden && strings.HasPrefix(newPathPart, ".") {
				// fmt.Println("skipping hidden item", fullPath)
				continue
			}

			for _, re := range config.RegExExcludes {
				if re.FindString(fullPath) != "" {
					// fmt.Println("RE found: ", fullPath)
					continue
				}
			}

			isDir := entry.IsDir()
			if isDir {
				if foldersToExclude[fullPath] {
					// fmt.Println("skipping excluded folder", fullPath)
					continue
				}
				directoriesToSearch = append(directoriesToSearch, fullPath)
			} else {
				info, err := entry.Info()
				if err != nil {
					log.Printf("ERROR: %s, %v\n", fullPath, err)
					continue
				}
				size := int(info.Size())
				if config.ExcludeZeroSizeFiles && size == 0 {
					// fmt.Println("skipping zero size file", fullPath)
					continue
				}
				modifiedTime := info.ModTime()

				// Try to get the hash from the PrevFileInfo
				var hash string
				prevEntry, ok := config.PrevFileInfo[fullPath]
				isPerfectMatch := ok && size == prevEntry.Size && modifiedTime.Equal(prevEntry.ModifiedTime)
				if isPerfectMatch {
					// fmt.Println("prev hash found", fullPath)
					hash = prevEntry.Hash
				} else {
					hash = HashToBase64(HashFile(fullPath))
				}

				newFile := FileInfo{
					Path:         fullPath,
					Size:         size,
					ModifiedTime: modifiedTime,
					Hash:         hash,
				}
				results[newFile.Path] = &newFile
			}
		}
	}

	return results
}

This is the same skeleton as the previous version but with additional checks for each of the exclusions. The paths are also normalized with filepath.Abs or filepath.Clean. And the final piece that speeds up all but the first run is re-using previously generated hashes.

Main

Create a new directory called cmd. Inside that, create find. Then create a file called main.go.

Inside we put most of it together.

select a target directory
check if there is a saved glob to load
setup the config structure with the desired exclusions
add any desired regular expressions
run GetAllFilesInDirRecursive (which can take a long time)
save the found files and computed hashes to a glob

package main

import (
	"fmt"
	"regexp"

	"dedupe"
)

func main() {
	desiredPath := `D:\Documents\Music`
	dataCacheFilePath := dedupe.ConvertPathToFilename(desiredPath) + ".data"

	var err error
	var oldFilesMap map[string]*dedupe.FileInfo
	if dedupe.DoesFileExist(dataCacheFilePath) {
		fmt.Println("Saved cache found:", dataCacheFilePath)
		oldFilesMap, err = dedupe.LoadGob[map[string]*dedupe.FileInfo](dataCacheFilePath)
		if err != nil {
			panic(err)
		}
	}

	config := dedupe.GetAllFilesInDirRecursiveConfig{
		TargetDirPath:        desiredPath,
		ExcludeZeroSizeFiles: true,
		ExcludeHidden:        true,
		PrevFileInfo:         oldFilesMap,
		FoldersToExclude: []string{
			"D:\\System Volume Information",
			"D:\\$RECYCLE.BIN",
			"D:\\Downloads",
			"D:\\Program Install Files",
		},
	}
	config.RegExExcludes = append(config.RegExExcludes, regexp.MustCompile(`.*AlbumArt_.*\.jpg`))
	config.RegExExcludes = append(config.RegExExcludes, regexp.MustCompile(`.*albumart_.*\.jpg`))
	config.RegExExcludes = append(config.RegExExcludes, regexp.MustCompile(`.*\\cmd\.bat`))
	config.RegExExcludes = append(config.RegExExcludes, regexp.MustCompile(`.*\\py\.bat`))

	fileInfoMap := dedupe.GetAllFilesInDirRecursive(config)

	err = dedupe.SaveGob(fileInfoMap, dataCacheFilePath)
	if err != nil {
		panic(err)
	}
}

I would suggest running this on a small directory first. On my PC, the speed is limited by the speed of my hard drive. Once completed, if you run the program again, it should immediately be done.

So... This program doesn't do what we want. It doesn't output information about duplicates yet. But... All the information is there and ready. We just have to crunch it.

DeDupe

Make a new file in the top level called deDupe.go.

We need to keep track of all the hashes in a map. Keeping track of what directories have duplicates will also be useful.

type DeDupe struct {
	DirPathCounts map[string][]*FileInfo // a map with the files base path as the key. Things only get added to this list if they are found to be a SHA duplicate
	FileHashMap   map[string][]*FileInfo // a map with the SHA hash as the key. each file with the key will be added to the array
	FilePathMap   map[string]*FileInfo   // a map with the file's full path as the key. It holds a pointer to a FileInfo struct.
}

Next is a factory function for building the struct.

func CreateDeDupeStruct() *DeDupe {
	expectedLength := 100000
	ret := &DeDupe{
		DirPathCounts: make(map[string][]*FileInfo, expectedLength),
		FileHashMap:   make(map[string][]*FileInfo, expectedLength),
		FilePathMap:   make(map[string]*FileInfo, expectedLength),
	}
	return ret
}

And finally, we take in the file information generated above and fill out the DeDupe struct.

func (s *DeDupe) AddFiles(files map[string]*FileInfo) {
	for _, file := range files {
		// Skip file paths we already have loaded
		_, ok := s.FilePathMap[file.Path]
		if ok {
			continue
		}

		s.FilePathMap[file.Path] = file

		foundHash, hashAlreadyPresent := s.FileHashMap[file.Hash]
		s.FileHashMap[file.Hash] = append(foundHash, file)
		
		if hashAlreadyPresent {
			// Add a duplicate count for the new file
			basePath := filepath.Dir(file.Path)
			s.DirPathCounts[basePath] = append(s.DirPathCounts[basePath], file)

			// If there was only 1 already in the hash array, then add a count for its directory as well
			if len(foundHash) == 1 {
				dirPath := filepath.Dir(foundHash[0].Path)
				s.DirPathCounts[dirPath] = append(s.DirPathCounts[dirPath], foundHash[0])
			}
		}
	}
}

Find The Largest Duplicates

Back in your main file, add the following to the end of the function:

	dup := dedupe.CreateDeDupeStruct()
	dup.AddFiles(fileInfoMap)

	FindLargestFile(dup)

Then add the FindLargestFile function to sort through the found duplicates and print out the largest ones.

func FindLargestFile(dup *dedupe.DeDupe) {
	type duplicateSearchStruct struct {
		file *dedupe.FileInfo
		size int
	}

	y := make([]duplicateSearchStruct, 0, len(dup.FilePathMap))
	for _, b := range dup.FilePathMap {
		hashList := dup.FileHashMap[b.Hash]
		if len(hashList) > 1 {
			y = append(y, duplicateSearchStruct{file: b, size: b.Size})
		}
	}

	sort.Slice(y, func(i, j int) bool {
		return y[i].size > y[j].size
	})

	fmt.Print("\n\n########## FOUND RESULTS ##########\n\n")
	for i := 0; i < len(y) && i < 100; i++ {
		fmt.Printf("%9d, %s\n", y[i].file.Size/1024/1024, y[i].file.Path)
	}
}

Running this will print out the 100 largest files in the chosen directory tree. The size is in MegaBytes.

Find The Folder With The Most Duplicates

Finding the folder with the most duplicates is pretty easy. If we take that top result (the folder with the most files that are duplicates) and try to find where they all are linked to, things get complicated.

func FindFolderWithMostDuplicates(dup *dedupe.DeDupe) {
	type mostDuplicateFolderStruct struct {
		path  string
		files []*dedupe.FileInfo
		count int
	}

	y := make([]mostDuplicateFolderStruct, 0, len(dup.DirPathCounts))
	for a, b := range dup.DirPathCounts {
		y = append(y, mostDuplicateFolderStruct{path: a, files: b})
	}

	sort.Slice(y, func(i, j int) bool {
		return len(y[i].files) > len(y[j].files)
	})
	fmt.Println("folders with duplicates found: ", len(y))

	fmt.Print("\n\n########## FOUND RESULTS ##########\n\n")
	for i := 0; i < len(y) && i < 10; i++ {
		fmt.Println(y[i].path, len(y[i].files))
	}
	fmt.Print("\n########## Top Result from above, locations of duplicates ##########\n\n")

	if len(y) == 0 {
		return
	}

	// For the top result, Print out the folder locations that have matching files that need to be de-duplicated.
	top := y[0]
	dirMap := make(map[string]int, len(top.files))
	for _, x := range top.files {
		filesWithMatchingHash := dup.FileHashMap[x.Hash]
		dirMarked := map[string]bool{}
		for _, f := range filesWithMatchingHash {
			// fmt.Println(f.Path)
			dir := filepath.Dir(f.Path)

			// don't allow a bunch of copies in the same directory jack up the count
			if dirMarked[dir] {
				continue
			}
			dirMarked[dir] = true
			dirMap[dir] = dirMap[dir] + 1
		}
	}
	// fmt.Print("####################\n\n")

	y = make([]mostDuplicateFolderStruct, 0, len(dirMap))
	for a, b := range dirMap {
		y = append(y, mostDuplicateFolderStruct{path: a, count: b})
	}

	sort.Slice(y, func(i, j int) bool {
		return y[i].count > y[j].count
	})
	for i := 0; i < len(y) && i < 20; i++ {
		fmt.Println(y[i].path, y[i].count)
	}
	fmt.Print("\n####################\n\n")
}

When I run this on my music collection, I get the following trimmed result:

Saved cache found: D__Documents_Music.data
D:\Documents\Music
folders with duplicates found:  448


########## FOUND RESULTS ##########

D:\Documents\Music\Classical\Tchaikovsky\Pyotr Tchaikovsky\The Sleeping Beauty [Previn - London Symphony] Disc 2 35
D:\Documents\Music\Soundtrack\The Sleeping Beauty [Previn - London Symphony] Disc 2 35
D:\Documents\Music\Swing Music 23
...
...
...

########## Top Result from above, locations of duplicates ##########

D:\Documents\Music\Classical\Tchaikovsky\Pyotr Tchaikovsky\The Sleeping Beauty [Previn - London Symphony] Disc 2 35
D:\Documents\Music\Soundtrack\The Sleeping Beauty [Previn - London Symphony] Disc 2 35

####################

As you can see, I have the same Sleeping Beauty CD in two places. It was filed two different ways.

The Final Step

Now that you know where some duplicates are, what do you do? Well, that is up to you. Most likely not all the files in a directory will be exact duplicates. Maybe you copied a school project onto a jump drive and updated some files. But then you forgot about it and kept adding reports and assignments to the original location for the rest of the school year. To merge then together you will need to use another tool like WinMerge or BeyondCompare. Point the program at the two directories and then tell the program to perform a binary compare of all the files.

Folder comparison example in Beyond Compare

Being able to quickly find which folders have the most overlap is very useful though. Without it you just have to traverse your directory tree guessing how you would have sorted it last time.

Project Directory Tree

The ending state of your directory structure should be as follows:

├── cmd
│   └── find
│       ├── D__Documents_Music.data
│       └── main.go
├── deDupe.go
├── getFiles.go
├── go.mod
├── go.sum
├── util_test.go
└── util.go

Thanks for reading! Subscribe for free and receive updates and support my work.