sha1: is of no value

General discussion related to "Everything".
Post Reply
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

sha1: is of no value

Post by therube »

sha1: is of no value (in & of itself, mainly)

that given ;-)
should a sha1: sort "sort" by sha1: ?

as in is there any reason to "sort by sha1:"?
as in there really isn't any meaning to that, except in respect to dups
as in there is no correlaton between a sha1: & anything else, except itself (the sha1: itself)
to say that a sha1: is "close to" or "related to" anything else, is false

so would it make more sense for an "sha1: sort" to
"sort by sha1:", but /group/ by some other means?
(with the actual /display/ in a /group/|sha1: fashion)

so a listing of "sha1:'s",
grouped by name, or by path, or size...

(granted, if sha1:, then size must be equal...)

Advanced Sort can't do it

if you have a list of hash|name|path:
  • 00df|krishna|c:/heart of soul
    05c5|das||c:/heart of soul
    0aa8|hari|c:/heart of soul
    00df|meditation|c:/unkown
    05c5|the wolf|c:/andrewwk
you have 2 sets of dups & you sort by sha1: with a "tag along" of name
(i.e. a "pair" of name|sha1)

you would have a list like:
  • 05c5|das|c:/heart of soul
    05c5|the wolf|c:/andrewwk
    -
    00df|krishna|c:/heart of soul
    00df|meditation|c:/unkown
with D (das), T (the wolf) sorted by Name, by Hash &
with K (krishna), M (meditation) also sorted by Name, by Hash

so a subsort of Name, but only "per Hash"


with sha1: with a "tag along" of Path, you would have a list like:
  • 05c5|the wolf|c:/andrewwk
    05c5|das||c:/heart of soul
    -
    00df|krishna|c:/heart of soul
    00df|meditation|c:/unkown
so a subsort of Path, bu only "per Hash"

would there be benefit to something like that?
(& all of this assume there is in fact no [hash] collisions)
(i'm having a hard time visualizing this without being able to "see" it)
(so a multi-key sort, say hash+name, but /displayed/ name[grouped by]hash

the only time a sort by hash is of value
is if a "human" were manually looking down a list of hashes
to find existance or lack thereof of a particular hash
- but, a human is not adept at doing something like that...

Sha1: sort, simply a list of sequential "numbers", no meaning
Path: sort, displayed by Path, then by Sha1: (or Name, then by Sha1...)
Sha1: 131, is first in Sha1: sort, but 7th in Path: sort (in this case)
(in this case there are no [Sha1:] dups)

Code: Select all

Sorted by Sha1:                     Sorted by Path:
Sha1:	 Path:			    Sha1:    Path:
-----	 --------------		    -----    -------------
131	 3302 CLIFTON		    148	     2905 GARRISON
132	 3306 CLIFTON		    170	     2905 GARRISON
133	 3304 CLIFTON		    176	     2905 GARRISON
134	 3306 CLIFTON		    190	     2905 GARRISON
135	 3304 CLIFTON		    207	     2905 GARRISON
137	 3500 FAIRVIEW		    208	     2905 GARRISON
138	 3500 FAIRVIEW		    131	     3302 CLIFTON
139	 3500 FAIRVIEW		    199	     3302 CLIFTON
140	 3504 FAIRVIEW		    202	     3302 CLIFTON
141	 3500 FAIRVIEW		    203	     3302 CLIFTON
142	 3500 FAIRVIEW		    204	     3302 CLIFTON
143	 3514 CLIFTON		    206	     3302 CLIFTON
144	 3501 FAIRVIEW		    133	     3304 CLIFTON
145	 3512 CLIFTON		    135	     3304 CLIFTON
147	 3501 FAIRVIEW		    146	     3304 CLIFTON
148	 2905 GARRISON		    149	     3304 CLIFTON
149	 3304 CLIFTON		    185	     3304 CLIFTON
150	 3306 CLIFTON		    197	     3304 CLIFTON
152	 3512 CLIFTON		    132	     3306 CLIFTON
153	 3514 CLIFTON		    134	     3306 CLIFTON
154	 3516 CLIFTON		    150	     3306 CLIFTON
155	 3502 FAIRVIEW		    198	     3306 CLIFTON
156	 3516 CLIFTON		    200	     3306 CLIFTON
157	 3514 CLIFTON		    201	     3306 CLIFTON
158	 3516 CLIFTON		    164	     3314 CARLISLE
159	 3502 FAIRVIEW		    166	     3314 CARLISLE
160	 3501 FAIRVIEW		    175	     3314 CARLISLE
161	 3502 FAIRVIEW		    178	     3314 CARLISLE
162	 3501 FAIRVIEW		    179	     3314 CARLISLE
164	 3314 CARLISLE		    196	     3314 CARLISLE
165	 3413 FAIRVIEW		    165	     3413 FAIRVIEW
166	 3314 CARLISLE		    173	     3413 FAIRVIEW
167	 3504 FAIRVIEW		    180	     3413 FAIRVIEW
170	 2905 GARRISON		    188	     3413 FAIRVIEW
171	 3504 FAIRVIEW		    189	     3413 FAIRVIEW
172	 3501 FAIRVIEW		    192	     3413 FAIRVIEW
173	 3413 FAIRVIEW		    137	     3500 FAIRVIEW
174	 3504 FAIRVIEW		    138	     3500 FAIRVIEW
175	 3314 CARLISLE		    139	     3500 FAIRVIEW
176	 2905 GARRISON		    141	     3500 FAIRVIEW
177	 3500 FAIRVIEW		    142	     3500 FAIRVIEW
178	 3314 CARLISLE		    177	     3500 FAIRVIEW
179	 3314 CARLISLE		    144	     3501 FAIRVIEW
182	 3512 CLIFTON		    147	     3501 FAIRVIEW
184	 3514 CLIFTON		    160	     3501 FAIRVIEW
185	 3304 CLIFTON		    162	     3501 FAIRVIEW
186	 3502 FAIRVIEW		    172	     3501 FAIRVIEW
188	 3413 FAIRVIEW		    181	     3501 FAIRVIEW
189	 3413 FAIRVIEW		    155	     3502 FAIRVIEW
190	 2905 GARRISON		    159	     3502 FAIRVIEW
192	 3413 FAIRVIEW		    161	     3502 FAIRVIEW
194	 3504 FAIRVIEW		    186	     3502 FAIRVIEW
195	 3504 FAIRVIEW		    187	     3502 FAIRVIEW
196	 3314 CARLISLE		    205	     3502 FAIRVIEW
199	 3302 CLIFTON		    140	     3504 FAIRVIEW
202	 3302 CLIFTON		    167	     3504 FAIRVIEW
204	 3302 CLIFTON		    171	     3504 FAIRVIEW
205	 3502 FAIRVIEW		    174	     3504 FAIRVIEW
206	 3302 CLIFTON		    194	     3504 FAIRVIEW
207	 2905 GARRISON		    195	     3504 FAIRVIEW
208	 2905 GARRISON		    136	     3512 CLIFTON
				    145	     3512 CLIFTON
				    151	     3512 CLIFTON
				    152	     3512 CLIFTON
				    182	     3512 CLIFTON
				    193	     3512 CLIFTON
				    143	     3514 CLIFTON
				    153	     3514 CLIFTON
				    157	     3514 CLIFTON
				    163	     3514 CLIFTON
				    183	     3514 CLIFTON
				    184	     3514 CLIFTON
				    154	     3516 CLIFTON
				    156	     3516 CLIFTON
				    158	     3516 CLIFTON
				    168	     3516 CLIFTON
				    169	     3516 CLIFTON
				    191	     3516 CLIFTON

still not sure if this is meaningful?
(& yes i know the datasets are not the same length)
void
Developer
Posts: 16680
Joined: Fri Oct 16, 2009 11:31 pm

Re: sha1: is of no value

Post by void »

There's no real reason to sort by sha1.
The hash doesn't have any real meaning.

Typically you'll do a:

dupe:size;sha1

which will sort the results by size, then sha1



You can use any sort you like with the sort: function.

For example sort by name after finding items with duplicate size and sha1:

dupe:size;sha1 sort:name

or, sort by name, then sha1:

dupe:size;sha1 sort:name



Should Everything implicitly sort by size when sorting by sha1?
for example:
dupe:sha1
would be the same as:
dupe:size;sha1
(i would rather avoid doing something like this as it might be unexpected and just force the user to specify dupe:size;sha1)



dupe:sha1 might be useful to find sha1 collisions?
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

Re: sha1: is of no value

Post by therube »

Typically you'll do a:
Ah, but typically I don't - so I'll have to give it a go.
(Without looking, yet, thinking that should work, & should give me enough information to see if it actually gives meaningful results.)

dupe:sha1 might be useful to find sha1 collisions?
You mean something like this.
MD5 collision, but not sha1, sha256...
.
Everything md5 hash collision.png
Everything md5 hash collision.png (12.08 KiB) Viewed 6449 times
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: sha1: is of no value

Post by raccoon »

If we're still doing color groupings, without using dupe, then I think the sort of any column should behave as expected without weird special-treatment behaviors. If someone wants to use SHA to spot dupes, or to create a pseudo-random ordering, let them. Ordering by size would be bizarre and off putting.

Task: Which file's SHA has the most prefix 0's?
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

Re: sha1: is of no value

Post by therube »

Which file's SHA has the most prefix 0's?
None?
As in a sha hash cannot begin with a 0?
Wrong!

Otherwise, I'd say, I output ascii 0..127, singularly (no cr/lf), to individual files & didn't find an answer there either.
(Actually, I didn't, but worth a try.)

Oddly, scarily, my sha1sum gives a different answer then echo hello | sha1sum ?
And yet, cat - | sha1sum -, does correctly (?) give me, da39a3ee5e6b4b0d3255bfef95601890afd80709.
(cat - | sha1sum -, OR, sha1sum<cr><Ctrl+Z> [Ctrl+Z OR F6])


OK, so then the answer is, "nobody knows".
Over time, you can compute hashes, & find those that have more prefix 0's then another, but you can never know which has the "most".
So the same type of question as how big can infinity be?


C:\>sha1sum --version
sha1sum (GNU coreutils) 5.3.0
void
Developer
Posts: 16680
Joined: Fri Oct 16, 2009 11:31 pm

Re: sha1: is of no value

Post by void »

I've been thinking about soundex and metaphone properties.

These are hash-like values that represent the sound of the name.

Could be useful to find files with similar sounding names...

soundex:
raccoon
Posts: 1017
Joined: Thu Oct 18, 2018 1:24 am

Re: sha1: is of no value

Post by raccoon »

Is there anything for image or audio hashing to find files of dissimilar binary content but similar audio/visual content?
void
Developer
Posts: 16680
Joined: Fri Oct 16, 2009 11:31 pm

Re: sha1: is of no value

Post by void »

Perceptual hashing

I have put this on my TODO list.
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

Re: sha1: is of no value

Post by therube »

WinMerge can find "duplicate" images, where the image is the same but otherwise different hash (due to something like one file having [incorrectly] a cr/lf at the eof).

Something like this will find identical "video" content itself - where there is other ancillary data that would otherwise not match by hash. (Maybe WinMerge can do video? FFmpeg can probably do images too.)

hashvideo.BAT:

Code: Select all

:: hashmediastream.BAT
:: SjB 02-25-2023

:: drag&drop of *.mp4 works correctly, but hashvideo.bat *.mp4 doesn't - outputting all echo's, but only 1 hash - why?


@echo off

:: for %%i in (%*) do  echo "%%i"
:: for %%i in (%*) do  echo %%i
:: pause
:: exit



:: for %%i in (%*) do echo %%i >>0hashlist && ffmpeg -v 4 -i %%i  -map 0  -c copy  -f streamhash  -hash md5 - >>0hashlist
:: for %%i in (%*) do echo %%i | tee -a 0hashlist && ffmpeg -v 4 -i %%i  -map 0  -c copy  -f streamhash  -hash md5 - >>0hashlist


for %%i in (%*) do echo %%i | tee -a 0hashmedialist && ffmpeg -v 4 -i %%i  -map 0  -c copy  -f streamhash  -hash md5 - | tee -a 0hashmedialist


pause
goto end:


exit



This just gets the MD5 using stream copy mode. There is no decoding.
	ffmpeg -i input.mp4 -map 0:v -c copy -f md5 -

Show video and audio checksums separately
Using the streamhash muxer:
	ffmpeg -i input.mp4 -map 0   -c copy -f streamhash -hash md5 -

Per frame
Using the framehash muxer:
	ffmpeg -i input.mp4 -map 0   -f framehash -


:end

Duplicate File Finders can find duplicate or similar audio/video using various algorithms.
(A big problem with duplicate finders is that they are directory based, where Everything can find everything - anywhere.)

AllDup
Duplicate Cleaner
therube
Posts: 4955
Joined: Thu Sep 03, 2009 6:48 pm

Re: sha1: is of no value

Post by therube »

so far, i'm not really coming up with particularly real usefulness depending on how i've mixed & matched dupe: & ...


dupe:size:sha1, might be some benefit,
but then it really needs pair coloring (even if only on the hash itself) within "groups"

dupe:name;sha1, again might be of some benefit,
perhaps more so in cases where you've first used a distinct:
(which would have filtered out 'name;sha1' pairs)
again coloring with a "group" would help, but probably less important then with 'size:sha1'

using either of those does group "pairs" together
- where otherwise they would not necessarily be ("together")

& then there would be times where have pairs together is not as "good" as having them "apart"


so as things are now, may be just fine
(i'm still experimenting...)
Post Reply