Jump to content

Is there a hard drive file organizer that will ...


Recommended Posts

Guest Franc Zabkar
Posted

Re: Is there a hard drive file organizer that will ...

 

On Sun, 26 Oct 2008 16:23:16 -0400, "FromTheRafters"

<erratic@nomail.afraid.org> put finger to keyboard and composed:

>> AFAICS, a fundamental flaw in duplicate finder software is that it

>> relies on direct binary comparisons. With programs like FindDup, if we

>> have 3 files of equal size, then we would need to compare file1 with

>> file2, file1 with file3, and file2 with file3. This requires 6 reads.

>> For n equally sized files, the number of reads is n(n-1).

>>

>> Alternatively, if we relied on MD5 checksums, then each file would

>> only need to be read once.

>

>So...once it is found to be the same checksum, what should the

>program do next? How important are these files? A fundamental

>flaw would be to trust MD5 checksums as an indication that the

>files are indeed duplicates. You can mostly trust MD5 checksums

>to indicate two files are different, but the other way around?

 

OK, I retract my ill-informed comment, but it still seems to me that

the benefits far outweigh the risks. FindDup has been running for the

past 18 hours or so as I write this, so I'm happy to accept a 30

minute alternative. In any case, all programs appear to require that

the user decides whether or not a file can be safely deleted. To this

end the programmer could allow for a binary comparision in those cases

where there is any doubt.

 

- Franc Zabkar

--

Please remove one 'i' from my address when replying by email.

Posted

Re: Is there a hard drive file organizer that will ...

 

"Bill in Co." wrote:

> > A fundamental flaw would be to trust MD5 checksums as an

> > indication that the files are indeed duplicates.

>

> Since when? What is the statistical likelyhood of that being

> true?

 

If there was no malicious intent or source involved, I'd say the odds

are pretty low. But even if you had 2 identical hashs, it's simple

enough to just see if the files are the same length, and if they were,

then you do a byte-by-byte comparison.

Guest J. P. Gilliver (John)
Posted

Re: Is there a hard drive file organizer that will ...

 

In message <6og9g4ppkkap70p8ljdadfh8v8g6q6isc3@4ax.com>, Franc Zabkar

<fzabkar@iinternode.on.net> writes

[]

>AFAICS, a fundamental flaw in duplicate finder software is that it

>relies on direct binary comparisons. With programs like FindDup, if we

>have 3 files of equal size, then we would need to compare file1 with

>file2, file1 with file3, and file2 with file3. This requires 6 reads.

>For n equally sized files, the number of reads is n(n-1).

 

Yes, if none of the comparisons match; if file 1 is found to be the same

as file 2, then there's no need to compare file 3 to both of them, only

one.

>

>Alternatively, if we relied on MD5 checksums, then each file would

>only need to be read once.

 

It's a while since I played with FindDup - but I think using checksums

of some sort is one of its configuration options.

[]

>>http://www.steffengerlach.de/freeware/

>>; this is what I can only describe as a hierarchical piecharter, and you

>>should try it. Of course, it must be rubbish, as it's only a 164K

[]

>I *love* small utility software. At the moment I'm playing with

>Windows CE in a small GPS device. It reminds me what can be done with

>a small amount of resources, eg a 16KB calculator, 23KB task manager,

>6.5KB screen capture utility.

[]

Of course, I was being sarcastic - I like small util.s too; not just for

the intrinsic appeal, but because they tend to run more quickly and with

fewer problems.

 

I still haven't found anything to beat flamer.com - OK, it is only a

fire simulator, but how it manages to do it in 437 bytes (4xx, anyway) I

still don't know. (Works under everything I've tried up to XP.)

--

J. P. Gilliver. UMRA: 1960/<1985 MB++G.5AL(+++)IS-P--Ch+(p)Ar+T[?]H+Sh0!:`)DNAf

Lada for sale - see http://www.autotrader.co.uk

 

This trip should be called "Driving Miss Crazy" - Emma Wilson, on crossing the

southern United States with her mother, Ann Robinson, 2003 or 2004

Guest FromTheRafters
Posted

Re: Is there a hard drive file organizer that will ...

 

 

"Franc Zabkar" <fzabkar@iinternode.on.net> wrote in message

news:rpn9g4h6d10d3kv20ud3j02e2phuq66ucg@4ax.com...

> On Sun, 26 Oct 2008 16:23:16 -0400, "FromTheRafters"

> <erratic@nomail.afraid.org> put finger to keyboard and composed:

>

>>> AFAICS, a fundamental flaw in duplicate finder software is that it

>>> relies on direct binary comparisons. With programs like FindDup, if we

>>> have 3 files of equal size, then we would need to compare file1 with

>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.

>>> For n equally sized files, the number of reads is n(n-1).

>>>

>>> Alternatively, if we relied on MD5 checksums, then each file would

>>> only need to be read once.

>>

>>So...once it is found to be the same checksum, what should the

>>program do next? How important are these files? A fundamental

>>flaw would be to trust MD5 checksums as an indication that the

>>files are indeed duplicates. You can mostly trust MD5 checksums

>>to indicate two files are different, but the other way around?

>

> OK, I retract my ill-informed comment, but it still seems to me that

> the benefits far outweigh the risks. FindDup has been running for the

> past 18 hours or so as I write this, so I'm happy to accept a 30

> minute alternative. In any case, all programs appear to require that

> the user decides whether or not a file can be safely deleted. To this

> end the programmer could allow for a binary comparision in those cases

> where there is any doubt.

 

It all depends on the risk you are willing to assume. It would be nice

to have a hybrid case where you could switch between the MD5

mode and the byte by byte mode depending on such factors as type

or location of files etc.

Guest FromTheRafters
Posted

Re: Is there a hard drive file organizer that will ...

 

"Bill in Co." <not_really_here@earthlink.net> wrote in message

news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...

> FromTheRafters wrote:

>>> AFAICS, a fundamental flaw in duplicate finder software is that it

>>> relies on direct binary comparisons. With programs like FindDup, if we

>>> have 3 files of equal size, then we would need to compare file1 with

>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.

>>> For n equally sized files, the number of reads is n(n-1).

>>>

>>> Alternatively, if we relied on MD5 checksums, then each file would

>>> only need to be read once.

>>

>> So...once it is found to be the same checksum, what should the

>> program do next? How important are these files?

>

>> A fundamental

>> flaw would be to trust MD5 checksums as an indication that the

>> files are indeed duplicates.

>

> Since when?

 

Forever.

 

Checksums are often smaller than the file they are derived from

(thats kinda the point, eh?).

> What is the statistical likelyhood of that being true?

 

Greater than zero.

Guest FromTheRafters
Posted

Re: Is there a hard drive file organizer that will ...

 

> Your understanding of networks in not nearly as impeccable as

> your logic of finding duplicates on one drive on which I

> commented in my previous reply.

 

Okay, so as this thread reaches its EOL, it may interest

someone that all might not be as it seems.

 

I'm not sure about modern disk operating systems, but

some older ones would not actually make a copy when

asked to do so. Rather, they would make another full

path to the same data on disk (why waste space with

redundant data). Copying to another disk, or partition

on the same disk, would actually necessitate a copy

and would take longer as a result. When access was

made to the file, and it was modified, then the path used

to access that file would point to a newly created file

while the *original* would still be accessed from the

other paths.

 

So, deleting duplicate files on a single drive in this case

would only clean up the file system without freeing up

any harddrive space.

Guest Bill in Co.
Posted

Re: Is there a hard drive file organizer that will ...

 

FromTheRafters wrote:

> "Bill in Co." <not_really_here@earthlink.net> wrote in message

> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...

>> FromTheRafters wrote:

>>>> AFAICS, a fundamental flaw in duplicate finder software is that it

>>>> relies on direct binary comparisons. With programs like FindDup, if we

>>>> have 3 files of equal size, then we would need to compare file1 with

>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.

>>>> For n equally sized files, the number of reads is n(n-1).

>>>>

>>>> Alternatively, if we relied on MD5 checksums, then each file would

>>>> only need to be read once.

>>>

>>> So...once it is found to be the same checksum, what should the

>>> program do next? How important are these files?

>>

>>> A fundamental

>>> flaw would be to trust MD5 checksums as an indication that the

>>> files are indeed duplicates.

>>

>> Since when?

>

> Forever.

>

> Checksums are often smaller than the file they are derived from

> (thats kinda the point, eh?).

 

No, that's not the point. Your statement was that the checksums did not

assure the integrity of the file, whatsoever - i.e., that two files could

have the same hash valus and yet be different, which I still say is *highly*

unlikely. A statistically insignificant probability, so that using hash

values is often prudent and is much more expedient, of course.

Guest FromTheRafters
Posted

Re: Is there a hard drive file organizer that will ...

 

 

"Bill in Co." <not_really_here@earthlink.net> wrote in message

news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...

> FromTheRafters wrote:

>> "Bill in Co." <not_really_here@earthlink.net> wrote in message

>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...

>>> FromTheRafters wrote:

>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it

>>>>> relies on direct binary comparisons. With programs like FindDup, if we

>>>>> have 3 files of equal size, then we would need to compare file1 with

>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.

>>>>> For n equally sized files, the number of reads is n(n-1).

>>>>>

>>>>> Alternatively, if we relied on MD5 checksums, then each file would

>>>>> only need to be read once.

>>>>

>>>> So...once it is found to be the same checksum, what should the

>>>> program do next? How important are these files?

>>>

>>>> A fundamental

>>>> flaw would be to trust MD5 checksums as an indication that the

>>>> files are indeed duplicates.

>>>

>>> Since when?

>>

>> Forever.

>>

>> Checksums are often smaller than the file they are derived from

>> (thats kinda the point, eh?).

>

> No, that's not the point. Your statement was that the checksums did not

> assure the integrity of the file, whatsoever

 

I didn't say anything about the integrity of a file, and I also didn't

say 'whatsoever'. You can still read what I said above.

 

If you want to ensure they are duplicates - compare the files exactly.

If you only need to be reasonably sure they are duplicates, checksums

are adequate.

> - i.e., that two files could have the same hash valus and yet be

> different, which I still say is *highly* unlikely.

 

Highly unlikely -yes. But files can be highly valuable too. Just

how fast does such a program need to be? How much speed

is worth how much accuracy?

> A statistically insignificant probability, so that using hash values is

> often prudent and is much more expedient, of course.

 

True, but to aim toward accuracy instead of speed is not a flaw.

Guest thanatoid
Posted

Re: Is there a hard drive file organizer that will ...

 

"FromTheRafters" <erratic@nomail.afraid.org> wrote in

news:#mS$aA8NJHA.3876@TK2MSFTNGP04.phx.gbl:

>> Your understanding of networks in not nearly as impeccable as

>> your logic of finding duplicates on one drive on which I

>> commented in my previous reply.

>

> Okay, so as this thread reaches its EOL, it may interest

> someone that all might not be as it seems.

>

> I'm not sure about modern disk operating systems, but

> some older ones would not actually make a copy when

> asked to do so. Rather, they would make another full

> path to the same data on disk (why waste space with

> redundant data). Copying to another disk, or partition

> on the same disk, would actually necessitate a copy

> and would take longer as a result. When access was

> made to the file, and it was modified, then the path used

> to access that file would point to a newly created file

> while the *original* would still be accessed from the

> other paths.

>

> So, deleting duplicate files on a single drive in this case

> would only clean up the file system without freeing up

> any harddrive space.

 

OR deleting duplicates, it would seem (don't want to read it

again, see below).

 

Thanks for the headache. What a nightmare.

 

 

--

Those who cast the votes decide nothing. Those who count the

votes decide everything.

- Josef Stalin

 

NB: Not only is my KF over 4 KB and growing, I am also filtering

everything from discussions.microsoft and google groups, so no

offense if you don't get a reply/comment unless I see you quoted

in another post.

Guest Bill in Co.
Posted

Re: Is there a hard drive file organizer that will ...

 

FromTheRafters wrote:

> "Bill in Co." <not_really_here@earthlink.net> wrote in message

> news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...

>> FromTheRafters wrote:

>>> "Bill in Co." <not_really_here@earthlink.net> wrote in message

>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...

>>>> FromTheRafters wrote:

>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it

>>>>>> relies on direct binary comparisons. With programs like FindDup, if

>>>>>> we

>>>>>> have 3 files of equal size, then we would need to compare file1 with

>>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.

>>>>>> For n equally sized files, the number of reads is n(n-1).

>>>>>>

>>>>>> Alternatively, if we relied on MD5 checksums, then each file would

>>>>>> only need to be read once.

>>>>>

>>>>> So...once it is found to be the same checksum, what should the

>>>>> program do next? How important are these files?

>>>>

>>>>> A fundamental

>>>>> flaw would be to trust MD5 checksums as an indication that the

>>>>> files are indeed duplicates.

>>>>

>>>> Since when?

>>>

>>> Forever.

>>>

>>> Checksums are often smaller than the file they are derived from

>>> (thats kinda the point, eh?).

>>

>> No, that's not the point. Your statement was that the checksums did not

>> assure the integrity of the file, whatsoever

>

> I didn't say anything about the integrity of a file, and I also didn't

> say 'whatsoever'. You can still read what I said above.

>

> If you want to ensure they are duplicates - compare the files exactly.

> If you only need to be reasonably sure they are duplicates, checksums

> are adequate.

 

"Very reasonably sure" is correct.

>> - i.e., that two files could have the same hash valus and yet be

>> different, which I still say is *highly* unlikely.

>

> Highly unlikely -yes. But files can be highly valuable too. Just

> how fast does such a program need to be? How much speed

> is worth how much accuracy?

 

That's the question, isn't it. Considering the difference in speed, and

for most of our applications, I'd say the hash checksum approach does just

fine. :-)

>> A statistically insignificant probability, so that using hash values is

>> often prudent and is much more expedient, of course.

>

> True, but to aim toward accuracy instead of speed is not a flaw.

 

And there is a point of diminishing returns. Prudence comes in here; i.e.,

using the appropriate technique for the case at hand.

Guest Franc Zabkar
Posted

Re: Is there a hard drive file organizer that will ...

 

On Sun, 26 Oct 2008 21:46:24 -0400, "FromTheRafters"

<erratic@nomail.afraid.org> put finger to keyboard and composed:

>

>"Bill in Co." <not_really_here@earthlink.net> wrote in message

>news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...

>> FromTheRafters wrote:

>>> "Bill in Co." <not_really_here@earthlink.net> wrote in message

>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...

>>>> FromTheRafters wrote:

>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it

>>>>>> relies on direct binary comparisons. With programs like FindDup, if we

>>>>>> have 3 files of equal size, then we would need to compare file1 with

>>>>>> file2, file1 with file3, and file2 with file3. This requires 6 reads.

>>>>>> For n equally sized files, the number of reads is n(n-1).

>>>>>>

>>>>>> Alternatively, if we relied on MD5 checksums, then each file would

>>>>>> only need to be read once.

>>>>>

>>>>> So...once it is found to be the same checksum, what should the

>>>>> program do next? How important are these files?

>>>>

>>>>> A fundamental

>>>>> flaw would be to trust MD5 checksums as an indication that the

>>>>> files are indeed duplicates.

>>>>

>>>> Since when?

>>>

>>> Forever.

>>>

>>> Checksums are often smaller than the file they are derived from

>>> (thats kinda the point, eh?).

>>

>> No, that's not the point. Your statement was that the checksums did not

>> assure the integrity of the file, whatsoever

>

>I didn't say anything about the integrity of a file, and I also didn't

>say 'whatsoever'. You can still read what I said above.

>

>If you want to ensure they are duplicates - compare the files exactly.

>If you only need to be reasonably sure they are duplicates, checksums

>are adequate.

>

>> - i.e., that two files could have the same hash valus and yet be

>> different, which I still say is *highly* unlikely.

>

>Highly unlikely -yes. But files can be highly valuable too. Just

>how fast does such a program need to be? How much speed

>is worth how much accuracy?

>

>> A statistically insignificant probability, so that using hash values is

>> often prudent and is much more expedient, of course.

>

>True, but to aim toward accuracy instead of speed is not a flaw.

 

Sorry, bad choice of word on my part. However, speed and accuracy, or

speed and safety, are legitimate compromises that we make on a daily

basis. For example, our residential speed limit has been reduced from

60kph to 50kph in the interests of public safety, but we could easily

have a zero road toll if we reduced the limit all the way to 1kph.

Similarly, I could have left FindDup running for several more hours (I

killed it after about 24), but the inconvenience finally got to me.

I'd rather go for speed with something like FastSum, and safeguard

against unlikely losses with a total backup. In fact, I wonder why it

is that no antivirus product seems to be able to reliably detect *all

known* viruses. Is this an intentional compromise of speed versus

security? For example, I used to download Trend Micro's pattern file

updates manually for some time, and noticed that the ZIP files grew to

as much a 23MB until about a year (?) ago when they suddenly shrank to

only 15MB. Have Trend Micro decided to exclude extinct or rare viruses

from their database, or have they really found a more efficient way to

do things?

 

- Franc Zabkar

--

Please remove one 'i' from my address when replying by email.

Guest FromTheRafters
Posted

Re: Is there a hard drive file organizer that will ...

 

"Franc Zabkar" <fzabkar@iinternode.on.net> wrote in message

news:upjag4hvtmrrrnldacv22l39huud20jt74@4ax.com...

[snip]

> Sorry, bad choice of word on my part. However, speed and accuracy, or

> speed and safety, are legitimate compromises that we make on a daily

> basis. For example, our residential speed limit has been reduced from

> 60kph to 50kph in the interests of public safety, but we could easily

> have a zero road toll if we reduced the limit all the way to 1kph.

> Similarly, I could have left FindDup running for several more hours (I

> killed it after about 24), but the inconvenience finally got to me.

> I'd rather go for speed with something like FastSum, and safeguard

> against unlikely losses with a total backup.

 

Having read some of your excellent posts, I was sure you would

know where I was coming from with those comments.

> In fact, I wonder why it

> is that no antivirus product seems to be able to reliably detect *all

> known* viruses.

 

Virus detection is reducible to "The Halting Problem".

http://claymania.com/halting-problem.html

 

Add to that the many methods applied by viruses to make the task

more difficult for the detector.

 

Heuristics is a less accurate but faster approach, and reminds me of the

current topic (only more markedly). It seems a shame to equate a near

100% accurate byte by byte method as equivalent to a MD5 hash when

in virus detection world 100% is a pipe dream and heuristics must be

dampened to avoid false positives getting out of hand.

 

Many of the better AV programs use a mixture of methods including

but not limited to the above methods.

> Is this an intentional compromise of speed versus

> security? For example, I used to download Trend Micro's pattern file

> updates manually for some time, and noticed that the ZIP files grew to

> as much a 23MB until about a year (?) ago when they suddenly shrank to

> only 15MB. Have Trend Micro decided to exclude extinct or rare viruses

> from their database, or have they really found a more efficient way to

> do things?

 

You might find this interesting:

 

http://us.trendmicro.com/imperia/md/content/us/pdf/threats/securitylibrary/perry-vb2008.pdf

 

I highly suspect that old (extinct?) viruses will still be detected.

Guest FromTheRafters
Posted

Re: Is there a hard drive file organizer that will ...

 

 

"Bill in Co." <not_really_here@earthlink.net> wrote in message

news:%23c5PZw%23NJHA.1144@TK2MSFTNGP05.phx.gbl...

> FromTheRafters wrote:

>> "Bill in Co." <not_really_here@earthlink.net> wrote in message

>> news:ugvJco8NJHA.3636@TK2MSFTNGP05.phx.gbl...

>>> FromTheRafters wrote:

>>>> "Bill in Co." <not_really_here@earthlink.net> wrote in message

>>>> news:%23RrPjC7NJHA.1908@TK2MSFTNGP04.phx.gbl...

>>>>> FromTheRafters wrote:

>>>>>>> AFAICS, a fundamental flaw in duplicate finder software is that it

>>>>>>> relies on direct binary comparisons. With programs like FindDup, if

>>>>>>> we

>>>>>>> have 3 files of equal size, then we would need to compare file1 with

>>>>>>> file2, file1 with file3, and file2 with file3. This requires 6

>>>>>>> reads.

>>>>>>> For n equally sized files, the number of reads is n(n-1).

>>>>>>>

>>>>>>> Alternatively, if we relied on MD5 checksums, then each file would

>>>>>>> only need to be read once.

>>>>>>

>>>>>> So...once it is found to be the same checksum, what should the

>>>>>> program do next? How important are these files?

>>>>>

>>>>>> A fundamental

>>>>>> flaw would be to trust MD5 checksums as an indication that the

>>>>>> files are indeed duplicates.

>>>>>

>>>>> Since when?

>>>>

>>>> Forever.

>>>>

>>>> Checksums are often smaller than the file they are derived from

>>>> (thats kinda the point, eh?).

>>>

>>> No, that's not the point. Your statement was that the checksums did

>>> not

>>> assure the integrity of the file, whatsoever

>>

>> I didn't say anything about the integrity of a file, and I also didn't

>> say 'whatsoever'. You can still read what I said above.

>>

>> If you want to ensure they are duplicates - compare the files exactly.

>> If you only need to be reasonably sure they are duplicates, checksums

>> are adequate.

>

> "Very reasonably sure" is correct.

>

>>> - i.e., that two files could have the same hash valus and yet be

>>> different, which I still say is *highly* unlikely.

>>

>> Highly unlikely -yes. But files can be highly valuable too. Just

>> how fast does such a program need to be? How much speed

>> is worth how much accuracy?

>

> That's the question, isn't it. Considering the difference in speed, and

> for most of our applications, I'd say the hash checksum approach does just

> fine. :-)

>

>>> A statistically insignificant probability, so that using hash values is

>>> often prudent and is much more expedient, of course.

>>

>> True, but to aim toward accuracy instead of speed is not a flaw.

>

> And there is a point of diminishing returns. Prudence comes in here;

> i.e., using the appropriate technique for the case at hand.

 

We agree then! :o)

 

I think a hybrid approach would be best. For instance filetypes like JPEG

are rather large and I value them much lower than I do PDF, DOC, and

even some JPEG depending on their location. Plus, that puts the

responsibilty

on the user who made the informed decision to use a pretty nearly flawless

approach instead of a most nearly flawless approach in the event of a

disaster.

 

Writing such a program selling it as just as good but faster than byte by

byte

comparisons could leave one open to a lawsuit.

×
×
  • Create New...