Feature Request: (OPTIONAL) Background Hashing

collinchaffin · Post by **collinchaffin** » Sun Jun 02, 2019 12:57 pm

I have been an avid ES evangelist for years but there are a few features I recently found to not exist in ES but that I think (and hope others with agree) would be a valuable addition to the already impressive feature list.

This request is for the following:

Add hashing (xxHash) as an option

Add to database (optional index)

Add to GUI column display (optional)

Hashing in separate thread with perhaps user-configurable process priority setting

This hashing in turn would instantly add to ES "duplicate" comparison arsenal of options

Notes / More Detail:

Using the xxHash algorithm is common in other file/copy/verify tools (fastcopy, etc.) and performance is far greater than that of larger SHA algorithms which could still be offered as optional user-selectable, but then adds far greater complexity and database storage, processing overhead, etc.

This writeup is just one of many that outline the pretty shocking performance of the xxHash algorithm compared to other common SHA-based when it comes to performance vs crypto benefits:

https://cyan4973.github.io/xxHash/

Enabling this as perhaps a new "utilize ES hashes, IF AVAILABLE" option to perform my duplicate file searches no doubt would only ensure more and more accurate results. Once in the db, it is obviously viable to add basic comparison logic to "if all matches have computed hashes in db" to ensure apples-to-apples comparisons for duplicates.

Maintaining the hashing in an always-separate, low-process and low-io priority default will help minimize overhead as much as possible. I believe many (myself included) who already use ES for all file searching would gladly allow a LOW, IDLE-TIME-ONLY hashing thread to run once for considerable time to perform initial hashing of all files already present and scanned by ES. The ongoing LOW, IDLE-TIME-ONLY hashing of newly added files would be something to initially monitor to determine average completion times and whether the low priority is capable of meeting my personal needs as far as availability time and whether I would personally need to increase the hashing priority or not - as it would be for any users interested in this feature.

Hope this description/detail helps and that you'll consider this for the future features of ES. Thanks!

-C

Post by **void** » Mon Jun 03, 2019 12:37 pm

I'll consider an option to do background hashing (CRC32/MD5/SHA1/SHA256 and others) and storing this as a property that could be indexed, searched or sorted by.

Thanks for the suggestion.

Post by **therube** » Mon Jun 03, 2019 1:51 pm

Does this belong in the realm of NoScript Everything?

Isn't this more apt for an actual duplicate file checker / hashing program?

As it is, I use all 3 types of programs.

NoScript & its' "dupe" functions.
And if I need further refinement, I use a duplicate file checker or hash program.

NoScript points out 2 (or more) dupes.
I throw them to Hasher or HashMyFiles.

Maybe next time, I'll send them (Send To) checksum (BLAKE2 algorithm) [if it can work in the way I'd want] or some utility using this xxHASH algorithm.
http://corz.org/windows/software/checksum/

---

And since void wrote that (before I saw it), if said functionality is in Everything, that's fine too.

---

A quick look (all I have time so far) with checksum, hash (BLAKE2) of a 1 GB file took 5.22 sec, compared to (I think it was 31 sec with MD5).
And if xxHASH is quicker yet...

.

Post by **NotNull** » Mon Jun 03, 2019 4:21 pm

therube wrote: ↑Mon Jun 03, 2019 1:51 pm NoScript & its' "dupe" functions.

Where? How? Why? What?

(subtitles: I'm surprised NoScript even has such a function. Never encountered it in all those years using it. Where can I find this?)

Post by **therube** » Mon Jun 03, 2019 6:55 pm

Heh, you caught me. Fixed above.

Post by **therube** » Mon Jun 03, 2019 7:03 pm

(checksum, above, & what I may post below, I am not familiar with, & have only run sandboxed, Sandboxie, so far.)

XP e4300 2GB RAM x86
W:\2012-06-05_23-27-45_125 - 1.09 GB

QuickHash v3.0.4 (link/version updated, below [two post] results based on v2.8.0)

Code: Select all

xxH-32  28 sec.  6% cpu
sha1    28 sec. 20% cpu
md5     28 sec. 25% cpu
sha512  1.39    50% cpu

So, surprising to me, all methods (save sha512) ran at the same speed (on my lowly XP box - so x86 only).
(CPU usage was the least using xxHash.
Not sure what version is being used?)

An older, rather old, Windows binary version - 64-bit only, xxHash v0.6.2.

Post by **NotNull** » Mon Jun 03, 2019 8:15 pm

See also this thread for more hashers

<offtopic>

BTW: I wondered how you could mix up Everything and NoScript. Turns out you are "quite" active on the NoScript forum too!
(and your name comes along very often on the BulkRenameUtility forum ...)

BTW2: Still 2 NoScript references in your post

</offtopic>

Post by **therube** » Tue Jun 04, 2019 12:52 am

Yep, I've been there (NoScript) for some time now.
%s/noscript/everything/g
There, done

.

BRU, not because I use that utility, particularly (most often Everything handles what I need), but rather I pick up on Regex knowledge there in a not overwhelming setting.

Post by **therube** » Tue Jun 04, 2019 1:03 am

i5-3570k 16 GB Win7 x64
QuickHash x64
3.92 GB .iso

Code: Select all

                                                   2nd run:
xxH-64   2 sec.    25% cpu (spike, only)                2 sec
sha1     6 sec.    25% cpu                             14 sec
sha512   15 sec    25% cpu (don't ask me why?)         15 sec (why that is that sha512 was quicker then sha256 & on par with sha1)
sha256   25 sec.   25% cpu                             25 sec
md5     2:03 min.   4% cpu                             11 sec

(25% cpu = 100% of 1 of 4 cores)
(started, first time with the particular file, with the md5 hash. then just ran down the list in sequence, then did it again.)

Then, picked a 3.47 GB file, using xxHash64 - first, & that took 27 sec.
So there must be some sort of caching, or whatever, involved.
So I guess to get "accurate" results one must start "afresh" with each method.
NOT doing that...
sha513, 13 sec, then back to xxHash64, 2 sec, so I guess the results returned above (based on how I did things earlier) are not really indicative of what you'd get in the real world, getting hash's of "random" files. Maybe xxHash is quicker, but how it actually plays out would take more extensive testing & also understanding how "cache" (or whatever) is skewing the results. (Closing the program, reopening, then selecting the same file, is not sufficient.)

http://fastcompression.blogspot.com/201 ... -xxh3.html

Post by **therube** » Tue Jun 04, 2019 2:42 am

xxHash, command-line utility, interestingly, will accept a quoted file name with only an opening quote needed.

(Some other utility comes to mind that only needs an opening operand in certain circumstance

.)

Has a -b, benchmark.

Code: Select all

C:\CHECKSUM>xxh -b xxx.iso
xxh 0.6.2 (64-bits little endian), by Yann Collet
Loading xxx.iso...
XXH32               :  844821011 ->  6572.2 MB/s
XXH32 unaligned     :  844821010 ->  6562.5 MB/s
XXH64               :  844821011 -> 11577.6 MB/s
XXH64 unaligned     :  844821010 -> 11429.8 MB/s

C:\CHECKSUM>

Post by **therube** » Tue Jun 04, 2019 3:57 am

RHash looks interesting for its functionality (though doesn't support xxhash).

(first 3 files were already run, so "cached", last file was a first time read.)

Code: Select all

C:\CHECKSUM>rhash -M --speed "VTS_01_?.VOB"
8b111e4d314daa38aa082444e187e102  VTS_01_1.VOB
Calculated in 2.231 sec, 458.91 MBps
dab8de6901a0761ac30673ad21c288ce  VTS_01_2.VOB
Calculated in 2.235 sec, 458.18 MBps
248e43979c311ec00b0391ea496c8825  VTS_01_3.VOB
Calculated in 2.254 sec, 454.22 MBps
71d207796ee91a430c058ddba33bda7d  VTS_01_4.VOB
Calculated in 7.288 sec, 95.34 MBps
Total 14.059 sec, 267.94 MBps

C:\CHECKSUM>

Post by **NotNull** » Tue Jun 04, 2019 10:21 pm

therube wrote: ↑Tue Jun 04, 2019 12:52 am %s/noscript/everything/g
There, done .

and as a bonus: a free mini-course vi ...

BRU, not because I use that utility, particularly (most often Everything handles what I need), but rather I pick up on Regex knowledge there in a not overwhelming setting.

I post there too from time to time (under a different alias as I cherish my online anonymity) without ever having started BRU. Just for the fun of solving puzzles.

BTW: This is all going terribly off topic ..

Post by **therube** » Sat Jun 08, 2019 10:04 am

Messing with this for just a little bit now, pretty sure that xxHash is going to be worth it (from a speed perspective).
(Haven't found a decent GUI [or even the most recent binary] so just using a simple batch file [from SendTo] at the moment.)

XXhash64.BAT:

Code: Select all

@echo off
xxhsum.exe %*
pause

(xxhsum.exe needs to be in your %PATH% or you should include its full pathname in the batch file.)

Post by **NotNull** » Sat Jun 08, 2019 3:53 pm

therube wrote: ↑Sat Jun 08, 2019 10:04 am Messing with this for just a little bit now, pretty sure that xxHash is going to be worth it (from a speed perspective).

Someone has tested (in 2016) quite a few checksum algorithms (including code to test it yourself) and his results correspond with your observations: https://aras-p.info/blog/2016/08/09/Mor ... ion-Tests/

(Haven't found a decent GUI [or even the most recent binary] so just using a simple batch file [from SendTo] at the moment.)

A quick search came up with https://www.quickhash-gui.org/

voidtools forum

Feature Request: (OPTIONAL) Background Hashing

Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing

Re: Feature Request: (OPTIONAL) Background Hashing