Add an option to avoid hard link size duplication

eugenesv · Post by **eugenesv** » Tue Jun 02, 2020 11:25 pm

Would it be possible to add an option that would avoid counting the size of duplicate hardlinks?
For example, my Git installation folder is shown as having a size of ~600m, but then it has 129! hard links the size of ~3m each, so the real disk space used is not 129*3=387m, but just 3m

Understand that it might be out of scope for Everything, but given that it does some greate memory caching and is blazingly fast, it might be in a unique position to also provide a fast way to get such an adjustment to file/folder sizes to avoid this hard link duplication.
Not sure how best to do it, so just sharing some ideas:

for each hard link (e.g. a file with a hard link reference count x>1) the size would be shown as CurrentSize/x (or 3/129=0.023m), and this could propagate to all the folder sizes.

hard link file size stays the same, but the folder size gets reduced if it contains multiple hard links pointing to the same file within the same folder. This looks much more complicated as it requires comparing the full lists? and doesn't fully deduplicate (e.g. two subfolders with 1 identical hard link will each show the full file size)

hard link size/folder size stays the same if it's too slow/complicated, but at least there is an additional optional column that shows the number of hard link files inside a given folder and their size so that you could roughly see potentially duplicate size (given the mix of size of each hard link, you won't be able to tell the exact size to deduct from the total, but a rough indication would still be better than nothing)

or something

Post by **void** » Wed Jun 03, 2020 8:27 am

I'll consider an option to only count hard links sizes once.

Maybe even an option to index only one instance of a hard link might help.

Thanks for the suggestion.

eugenesv · Post by **eugenesv** » Wed Jun 03, 2020 1:08 pm

Thanks for considering this pro option

!
Just an idea — maybe for the sizes that refrect hard link deduplication it might make sense to have a separate column instead of a settings toggle? This way a user can have the familiar "Windows File Explorer" number and then the "real"(hard link deduplicated) number to slightly help deconfuse the situation?

void wrote: ↑Wed Jun 03, 2020 8:27 am Maybe even an option to index only one instance of a hard link might help.

That's a very interesting idea! Unfortunately, by design hard links are all identical, so it'd be a challenge to pick "the only one instance"

In the example described in my first post, I'd be fine with the incorrect version of having only "git.exe" indexed with the full size of 3m shown and with other files' sizes shown as 0 (or maybe shown as 3m only in that extra special separate column), I just have no good idea on how to make a general rule out of it (shortest file name?, shortest path?, prioritize some paths over the others?) as it's too much context-specific

voidtools forum

Add an option to avoid hard link size duplication

Add an option to avoid hard link size duplication

Re: Add an option to avoid hard link size duplication

Re: Add an option to avoid hard link size duplication