[clamav-users] Scanning very large files in chunks

sapientdust+clamav at gmail.com sapientdust+clamav at gmail.com
Wed Aug 10 14:21:18 EDT 2016


On Wed, Aug 10, 2016 at 10:11 AM, G.W. Haywood
<clamav at jubileegroup.co.uk> wrote:
> Hello again,
> In August 2016, sapientdust+clamav at gmail.com wrote:
>> The specifics are not important to my question
> That's not what you said earlier.  To be specific, you said:
>> >> In my case, the consequence factor is very large

Those two statements are perfectly consistent. The consequences are
significant enough that I have to scan all files, but why the
consequence are large, or what the specific consequences are, don't
matter for my technical question.

>> >> Does anybody have any feedback on the proposed solution to scanning
>> >> large files in chunks?
>> > > Stop worrying about it, it's a waste of time and effort.  The
>> > > probability
>> > that you will actually find what you're looking for is very small.
>> What are the technical reasons that the probability is very small
>> (compared to the probability of finding a virus if the file is small
>> enough to be scanned in one instream call)?
> I didn't say anything about comparisons.  You asked for feedback, I
> gave you some, and I said you wouldn't like it.  You're not going to
> like it any better if you modify the question, because my feedback is
> going to be the same.  I've been using ClamAV for more than a decade
> so I have a reasonable idea what it can achieve and what it can't.

I didn't say that you mentioned comparisons. I was making clear that
I'm not asking for 100% reliability and I'm not asking whether the
multi-scan idea is perfect in some general sense, but only whether
it's significantly worse than scanning a smaller file that doesn't
need to be broken into multiple pieces.

I'm interested in knowing if there are technical reasons why the
following two scenarios would work very differently:

scenario 1:

I scan a 2.5 GB file in one instream call

scenario 2:

I scan a 4.5 GB file in multiple instream calls, by scanning the first
3 GB in one call, and then making a second instream call that provides
the first N  MB followed by the last 2 GB of the file.

Would clamav be expected to work similarly in the two cases in terms
of identifying a virus, assuming the virus is the same in the two
scenarios and it's in ClamAV's database? Or are there technical
reasons why ClamAV wouldn't detect the virus in the second scenario
but would in the first, even though the virus bytes are identical?

This is a question for clamav developers or those who understand the
codebase sufficiently to know the impact of scanning a partial file.

Should I have asked this question on the developer list? I asked here
because it looked like the developer list gets very little use, and I
thought developers would probably be on this list too.

More information about the clamav-users mailing list