Oh, absolutely Micah. My scan times were negligible as I was scanning a single PHP that was 150 bytes or so (opening PHP tag, two lines of comments, and a call to phpinfo), so those times I gave were entirely load time.

I'm glad that you found the information helpful.

--Maarten

On Tue, Apr 9, 2019 at 5:29 PM Micah Snyder (micasnyd) <micasnyd@cisco.com> wrote:

Maarten,

 

Your test results are pretty great.  I really like your breakdown of the signatures by category.  I will caution that scan times will vary quite heavily depending on what you’re scanning, based on Target type (https://www.clamav.net/documents/clamav-file-types). 

 

In addition, it’s important to distinguish between load and scan times.  The time reported by clamscan is both load + scan.  If you just want scan time, you will want to load the database with clamd and then test the scantime with clamdscan.

 

Regarding load time vs scantime, all of the signatures must be loaded, but depending on the target type of the file being scanned, not all of the signatures will be matched against the file.  That is, daily_Win.ldb might take the longest to load due to the number of signatures or complexity of the signatures but when scanning a PDF, they probably won’t impact scan time, as Win signatures are probably mostly target type 1 (PE file).

 

I’ve bit of time today investigating what I believe is responsible for slow load and scan times for the Phishtank sigs.  I had a hunch, based on a conversation we saw a while back in the mailing list, that the identical beginning for URL-based signatures result in an un-balanced and inefficient tree for matching. That is, some 3000 signatures each began with either:

  1. href="http:// (687265663d22687474703a2f2f)
  2. HYPERLINK"http (48595045524c494e4b2022687474703a2f2f)
  3. S/URI/URI(http:// (532f5552492f55524928687474703a2f2f)

 

Looking at a few of the Phish.Phishing signatures, these appear to have the same issue (href="http:// prefix).  In testing with scan of a PDF document, I was able to reduce the scan time from 31.987 sec down to 2.632 sec simply by changing the start of the Phishtank signatures for the following:

  1. href="http://
    1. from: 687265663d22687474703a2f2f
    2. to: 687265663d2268747470{3-4}
  2. HYPERLINK "http
    1. from: 48595045524c494e4b2022687474703a2f2f
    2. to: 48595045524c494e4b202268747470{3-4}
  3. S/URI/URI(http://
    1. from: 532f5552492f55524928687474703a2f2f
    2. to: 532f5552492f5552492868747470{3-4}

 

This should get the same detection with a faster load and scan time, and will accommodate for httpS for better coverage.  To turn lemonade into really good lemonade, we may be able to take the above optimization and apply it to the Phish.Phishing signatures identified by Maarten to reduce scan times further to levels below those before the addition of the Phishtank signatures.

 

As noted by Maarten as well, the Phish.Phishing sigs are Target type 0, whereas we’d split the Phishtank.Phishing signatures up by target type to reduce scan times of files where the signatures won’t apply.  It should also speed things up quite a bit for other file types to split those up by Target types.

 

Further research into scan time optimization is definitely welcome and appreciated.

 

Regards,

Micah

 

 

From: clamav-users <clamav-users-bounces@lists.clamav.net> on behalf of Maarten Broekman via clamav-users <clamav-users@lists.clamav.net>
Reply-To: ClamAV users ML <clamav-users@lists.clamav.net>
Date: Tuesday, April 9, 2019 at 12:00 PM
To: ClamAV users ML <clamav-users@lists.clamav.net>
Cc: Maarten Broekman <maarten.broekman@gmail.com>
Subject: Re: [clamav-users] [External] Re: Scan very slow

 

Clearly the latest daily.cvd is performing better, but the remaining "Phishtank" sigs are not a majority of the slowness.

 

I unpacked the current (?) cvd (ClamAV-VDB:09 Apr 2019 03-53 -0400:25414:1548262:63:X:X:raynman:1554796413) and then ran a test scan with each part to see what the load times looked like:

daily.cdb ==== Time: 0.007 sec (0 m 0 s)

daily.cfg ==== Time: 0.004 sec (0 m 0 s)

daily.crb ==== Time: 0.006 sec (0 m 0 s)

daily.cvd ==== Time: 11.384 sec (0 m 11 s)

daily.fp ==== Time: 0.009 sec (0 m 0 s)

daily.ftm ==== Time: 0.005 sec (0 m 0 s)

daily.hdb ==== Time: 0.303 sec (0 m 0 s)

daily.hdu ==== Time: 0.006 sec (0 m 0 s)

daily.hsb ==== Time: 1.093 sec (0 m 1 s)

daily.hsu ==== Time: 0.005 sec (0 m 0 s)

daily.idb ==== Time: 0.006 sec (0 m 0 s)

daily.ldb ==== Time: 5.563 sec (0 m 5 s)

daily.ldu ==== Time: 0.005 sec (0 m 0 s)

daily.mdb ==== Time: 0.061 sec (0 m 0 s)

daily.mdu ==== Time: 0.007 sec (0 m 0 s)

daily.msb ==== Time: 0.005 sec (0 m 0 s)

daily.msu ==== Time: 0.005 sec (0 m 0 s)

daily.ndb ==== Time: 0.017 sec (0 m 0 s)

daily.ndu ==== Time: 0.005 sec (0 m 0 s)

daily.pdb ==== Time: 0.010 sec (0 m 0 s)

daily.sfp ==== Time: 0.006 sec (0 m 0 s)

daily.wdb ==== Time: 0.014 sec (0 m 0 s)

 

So, half the run time of a clamscan is from the daily.ldb. To break it down farther, I split the daily.ldb into "daily_<virus>.ldb" where <virus> is the first part of the dot-separated signature name.

daily_Andr.ldb ==== Time: 0.008 sec (0 m 0 s)

daily_Archive.ldb ==== Time: 0.009 sec (0 m 0 s)

daily_Asp.ldb ==== Time: 0.004 sec (0 m 0 s)

daily_Doc.ldb ==== Time: 0.116 sec (0 m 0 s)

daily_Eicar-Test-Signature.ldb ==== Time: 0.009 sec (0 m 0 s)

daily_Email.ldb ==== Time: 0.014 sec (0 m 0 s)

daily_Emf.ldb ==== Time: 0.007 sec (0 m 0 s)

daily_Heuristics.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Html.ldb ==== Time: 0.010 sec (0 m 0 s)

daily_Hwp.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Img.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Ios.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Java.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Js.ldb ==== Time: 0.007 sec (0 m 0 s)

daily_Legacy.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Lnk.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Mp4.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Multios.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Osx.ldb ==== Time: 0.008 sec (0 m 0 s)

daily_Pdf.ldb ==== Time: 0.007 sec (0 m 0 s)

daily_Phish.ldb ==== Time: 1.612 sec (0 m 1 s)

daily_Phishtank.ldb ==== Time: 0.146 sec (0 m 0 s)

daily_Php.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Ppt.ldb ==== Time: 0.007 sec (0 m 0 s)

daily_Py.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Rtf.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Svg.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Swf.ldb ==== Time: 0.007 sec (0 m 0 s)

daily_Ttf.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Txt.ldb ==== Time: 0.009 sec (0 m 0 s)

daily_Unix.ldb ==== Time: 0.008 sec (0 m 0 s)

daily_Vbs.ldb ==== Time: 0.009 sec (0 m 0 s)

daily_Win.ldb ==== Time: 3.391 sec (0 m 3 s)

daily_Xls.ldb ==== Time: 0.009 sec (0 m 0 s)

daily_Xml.ldb ==== Time: 0.007 sec (0 m 0 s)

 

"Phish.", not "Phishtank.", and "Win." are the longest run times. Looking at the number of signatures in each, the 'Phish.' signatures are taking a disproportionate amount of time to load compared to the other signatures:

     216 daily_Andr.ldb

       3 daily_Archive.ldb

       1 daily_Asp.ldb

    2096 daily_Doc.ldb

       1 daily_Eicar-Test-Signature.ldb

    1017 daily_Email.ldb

       2 daily_Emf.ldb

       5 daily_Heuristics.ldb

     250 daily_Html.ldb

       1 daily_Hwp.ldb

      15 daily_Img.ldb

       6 daily_Ios.ldb

      16 daily_Java.ldb

      69 daily_Js.ldb

      27 daily_Legacy.ldb

       9 daily_Lnk.ldb

       1 daily_Mp4.ldb

       9 daily_Multios.ldb

     175 daily_Osx.ldb

     132 daily_Pdf.ldb

    2515 daily_Phish.ldb

    3516 daily_Phishtank.ldb

      18 daily_Php.ldb

       5 daily_Ppt.ldb

       3 daily_Py.ldb

      28 daily_Rtf.ldb

       1 daily_Svg.ldb

     103 daily_Swf.ldb

       2 daily_Ttf.ldb

     140 daily_Txt.ldb

     222 daily_Unix.ldb

      21 daily_Vbs.ldb

   43928 daily_Win.ldb

     165 daily_Xls.ldb

       8 daily_Xml.ldb

 

From the look of it, "Phish." has those REPHISH signatures. Those signatures seem to be looking at any file (Target 0) and have subsignatures that are combined to match depending on which filetype they are 'looking' for (so, href for HTML files, %PDF, Subtype, and URI objects for PDFs, etc) as opposed to the remaining Phishtank sigs which seem to have a separate signature depending on the target type.

 

Breaking up daily_Win into it's constituent sub-parts doesn't reveal any particular culprit from just a simple scan timing though...

daily_Win.Adware.ldb ==== Time: 0.013 sec (0 m 0 s)

daily_Win.Coinminer.ldb ==== Time: 0.009 sec (0 m 0 s)

daily_Win.Downloader.ldb ==== Time: 0.035 sec (0 m 0 s)

daily_Win.Dropper.ldb ==== Time: 0.240 sec (0 m 0 s)

daily_Win.Exploit.ldb ==== Time: 0.016 sec (0 m 0 s)

daily_Win.Ircbot.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Win.Keylogger.ldb ==== Time: 0.010 sec (0 m 0 s)

daily_Win.ldb ==== Time: 3.418 sec (0 m 3 s)

daily_Win.Macro.ldb ==== Time: 0.009 sec (0 m 0 s)

daily_Win.Malware.ldb ==== Time: 0.731 sec (0 m 0 s)

daily_Win.Packed.ldb ==== Time: 0.131 sec (0 m 0 s)

daily_Win.Packer.ldb ==== Time: 0.008 sec (0 m 0 s)

daily_Win.Phishing.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Win.Proxy.ldb ==== Time: 0.005 sec (0 m 0 s)

daily_Win.Ransomware.ldb ==== Time: 0.019 sec (0 m 0 s)

daily_Win.Spyware.ldb ==== Time: 0.006 sec (0 m 0 s)

daily_Win.Tool.ldb ==== Time: 0.008 sec (0 m 0 s)

daily_Win.Trojan.ldb ==== Time: 0.582 sec (0 m 0 s)

daily_Win.Virus.ldb ==== Time: 0.059 sec (0 m 0 s)

daily_Win.Worm.ldb ==== Time: 0.030 sec (0 m 0 s)

 

     158 daily_Win.Adware.ldb

      14 daily_Win.Coinminer.ldb

     561 daily_Win.Downloader.ldb

    8084 daily_Win.Dropper.ldb

     216 daily_Win.Exploit.ldb

       9 daily_Win.Ircbot.ldb

     193 daily_Win.Keylogger.ldb

   43928 daily_Win.ldb

       1 daily_Win.Macro.ldb

   14820 daily_Win.Malware.ldb

    4209 daily_Win.Packed.ldb

      20 daily_Win.Packer.ldb

       2 daily_Win.Phishing.ldb

       4 daily_Win.Proxy.ldb

     500 daily_Win.Ransomware.ldb

      32 daily_Win.Spyware.ldb

     121 daily_Win.Tool.ldb

   12051 daily_Win.Trojan.ldb

    1967 daily_Win.Virus.ldb

     966 daily_Win.Worm.ldb

 

Malware and Trojan take the longest, but they also have a majority of the signatures.

 

On Tue, Apr 9, 2019 at 11:19 AM Steve Basford <steveb_clamav@sanesecurity.com> wrote:

On 2019-04-09 12:02, Brent Clark via clamav-users wrote:
> Cant those be adopted / managed by Sanesecurity?
>
> For all you know, those are already in Sanesecurity.

They are... and have been for quite some time:


"The following databases are distributed by Sanesecurity, but produced
by Porcupine Signatures"

phishtank.ndb.

Briefly...

Number of sigs in phishtank.ndb: 9,309

eg:

PhishTank.Phishing.6002281, matches:

https://www.phishtank.com/phish_detail.php?phish_id=6002281

So, there is going to be some possible cross over now that
Phish.Phishing.REPHISH_ID_20190404_67-6931549-0
type signatures names from PhishTank feed are in daily.ldb and
daily.ndb.

I'll check back on the thread later.

--
Cheers,

Steve
Twitter: @sanesecurity

_______________________________________________

clamav-users mailing list
clamav-users@lists.clamav.net
https://lists.clamav.net/mailman/listinfo/clamav-users


Help us build a comprehensive ClamAV guide:
https://github.com/vrtadmin/clamav-faq

http://www.clamav.net/contact.html#ml