[clamav-users] Scanning very large files in chunks

TR Shaw tshaw at oitc.com
Fri Aug 12 17:51:39 EDT 2016

Actually there is always a probability that a detection will not occur if you beak apart at file into pieces  This is due to the following

1) md5 signatures based upon any file type are applied on any file and match to the md4 hash of that file AND the file’s size. If you break apart a file, neither the hash nor the file size will match the signature.

2) Complex signatures that a logical grouping of the results of multiple other signature detections are the other type that can break if you break a file in pieces.

This question of breaking apart files and checking comes up regularly by folks who need to support high data rate inputs and still be NIST FISMA compliant and the answer is always not you can’t do that.


> On Aug 12, 2016, at 5:32 PM, sapientdust+clamav at gmail.com wrote:
> On Thu, Aug 11, 2016 at 10:15 AM, G.W. Haywood
> <clamav at jubileegroup.co.uk <mailto:clamav at jubileegroup.co.uk>> wrote:
>> Hello once again,
>> On Thu, 11 Aug 2016, sapientdust+clamav at gmail.com wrote:
>>> I scan a 4.5 GB file in multiple instream calls, by scanning the first
>>> 3 GB in one call, and then making a second instream call that provides
>>> the first N  MB followed by the last 2 GB of the file.
>>> Would clamav be expected to work similarly in the two cases in terms
>>> of identifying a virus, assuming the virus is the same in the two
>>> scenarios and it's in ClamAV's database? Or are there technical
>>> reasons why ClamAV wouldn't detect the virus in the second scenario
>>> but would in the first, even though the virus bytes are identical?
>> There's a possibility of failing to find it in the second scenario.
>> It's anybody's guess what the probability will be; my guess would be
>> that the probability of that failure would be small compared with the
>> relatively large probability of not finding it at all in both cases.
> I was hoping to hear from a developer, because non-developer "guesses"
> don't help me very much. There is a definite answer to my question,
> but only somebody familiar with the ClamAV code will know the answer.
> From what I've been able to learn based on some emails on the
> developer list, ClamAV specifies virus signatures as either offsets in
> a file of a certain type (and they are only found at that offset), or
> they are specified as a pattern that can appear anywhere in the file.
> I think it most likely that any virus inserted in a very large file
> (multiple GB) would probably be of the second kind, which will be
> recognized if those bytes are scanned anywhere within the file, as
> long as the correct file type is identified. That means that as long
> as the first N bytes are prepended to each chunk so that ClamAV thinks
> each file is a file of the same type, it will identify the virus in a
> block from a large file as reliably as it would identify the same
> virus in a large file that is nevertheless small enough to be scanned
> in one call.
>>> This is a question for clamav developers or those who understand the
>>> codebase sufficiently to know the impact of scanning a partial file.
>> I don't think so.  Just think about it a bit:
>> Much of ClamAV's operation is looking for pattern matches.
>> Suppose you scan a 4.5GB file in two chunks.
>> Suppose half this mysterious 'huge file virus' is in the first chunk.
>> Presumably the other half is in the second chunk.
>> What happens if the pattern is designed to match the entire virus?
> First, the virus itself would not be huge. It would be just a normal
> virus embedded in a large file, where almost all the size is
> legitimate data. Every example I gave since my first email shows the
> chunks being broken up in such a way that your scenario cannot arise,
> because data that would be split is repeated in the next block in such
> a way that it's not split into two pieces in the subsequent block.
> Note above, where I said a 4.5 GB file would be scanned in two calls,
> the first providing bytes 0-3GB, and the second providing the first N
> bytes concatenated with the bytes from 2.5GB-4.5GB. The boundary at
> 3GB is squarely inside the second block and not split across blocks so
> that it's still recognized in its entirety, as long as the virus isn't
> larger than 500MB, which as far as I can tell is always the case for
> the sorts of things that ClamAV can identify.
> _______________________________________________
> Help us build a comprehensive ClamAV guide:
> https://github.com/vrtadmin/clamav-faq <https://github.com/vrtadmin/clamav-faq>
> http://www.clamav.net/contact.html#ml <http://www.clamav.net/contact.html#ml>

More information about the clamav-users mailing list