[clamav-users] Scanning very large files in chunks

sapientdust+clamav at gmail.com sapientdust+clamav at gmail.com
Fri Aug 12 17:32:22 EDT 2016

On Thu, Aug 11, 2016 at 10:15 AM, G.W. Haywood
<clamav at jubileegroup.co.uk> wrote:
> Hello once again,
> On Thu, 11 Aug 2016, sapientdust+clamav at gmail.com wrote:
>> I scan a 4.5 GB file in multiple instream calls, by scanning the first
>> 3 GB in one call, and then making a second instream call that provides
>> the first N  MB followed by the last 2 GB of the file.
>> Would clamav be expected to work similarly in the two cases in terms
>> of identifying a virus, assuming the virus is the same in the two
>> scenarios and it's in ClamAV's database? Or are there technical
>> reasons why ClamAV wouldn't detect the virus in the second scenario
>> but would in the first, even though the virus bytes are identical?
> There's a possibility of failing to find it in the second scenario.
> It's anybody's guess what the probability will be; my guess would be
> that the probability of that failure would be small compared with the
> relatively large probability of not finding it at all in both cases.

I was hoping to hear from a developer, because non-developer "guesses"
don't help me very much. There is a definite answer to my question,
but only somebody familiar with the ClamAV code will know the answer.

>From what I've been able to learn based on some emails on the
developer list, ClamAV specifies virus signatures as either offsets in
a file of a certain type (and they are only found at that offset), or
they are specified as a pattern that can appear anywhere in the file.
I think it most likely that any virus inserted in a very large file
(multiple GB) would probably be of the second kind, which will be
recognized if those bytes are scanned anywhere within the file, as
long as the correct file type is identified. That means that as long
as the first N bytes are prepended to each chunk so that ClamAV thinks
each file is a file of the same type, it will identify the virus in a
block from a large file as reliably as it would identify the same
virus in a large file that is nevertheless small enough to be scanned
in one call.

>> This is a question for clamav developers or those who understand the
>> codebase sufficiently to know the impact of scanning a partial file.
> I don't think so.  Just think about it a bit:
> Much of ClamAV's operation is looking for pattern matches.
> Suppose you scan a 4.5GB file in two chunks.
> Suppose half this mysterious 'huge file virus' is in the first chunk.
> Presumably the other half is in the second chunk.
> What happens if the pattern is designed to match the entire virus?

First, the virus itself would not be huge. It would be just a normal
virus embedded in a large file, where almost all the size is
legitimate data. Every example I gave since my first email shows the
chunks being broken up in such a way that your scenario cannot arise,
because data that would be split is repeated in the next block in such
a way that it's not split into two pieces in the subsequent block.
Note above, where I said a 4.5 GB file would be scanned in two calls,
the first providing bytes 0-3GB, and the second providing the first N
bytes concatenated with the bytes from 2.5GB-4.5GB. The boundary at
3GB is squarely inside the second block and not split across blocks so
that it's still recognized in its entirety, as long as the virus isn't
larger than 500MB, which as far as I can tell is always the case for
the sorts of things that ClamAV can identify.

More information about the clamav-users mailing list