[clamav-users] Win.Trojan.URLspoof-2 signtuare and WARC files
Al Varnell
alvarnell at mac.com
Tue Dec 20 04:24:01 UTC 2016
One correction to the Group 2 signature, it's just '%00@'.
The only available method for having a signature removed or modified is by submitting one or more False Positives at
<http://www.clamav.net/reports/fp> and include the details you have covered below. If you would like to be notified of changes in the virus database, you will need to join the clamav-virusdb mailing-list
<http://lists.clamav.net/cgi-bin/mailman/listinfo/clamav-virusdb>.
You can submit any suggested revised signature through the ClamAV Community Signatures program
<http://blog.clamav.net/2014/02/introducing-clamav-community-signatures.html>.
Although I'm not a signature expert by any means, but I would have to agree that both the art and ClamAV engine capabilities have improved since this one was apparently written and it should be easily improved.
-Al-
On Mon, Dec 19, 2016 at 05:40 PM, Jay Gattuso wrote:
>
> Win.Trojan.URLspoof-2
> We’re encountering some issues with this particular “virus”, and having worked through what we’re seeing, I wanted to ask a couple of questions..
> The signature is pretty weak.
>
> [main.ndb] Win.Trojan.URLspoof-2:0:*:20687265663d22*0125303040*223e*3c2f
>
>
> We’ve seen hits against this signature 14 times in 8 years (I’m not sure how long it’s been in the defs, but we’ve been checking our ~20Mil files against ClamAV for 8 years).
> Every hit for Win.Trojan.URLspoof-2 we’ve seen is a false positive.
> Breaking the signature sequence into parts reveals the weakness of this particular signature:
>
> Group 1: 20687265663d22 = ’ href=’
> Group 2: 0125303040 = ‘\x01%00@’
> Group 3: 223e = ‘">’
> Group 4: 3c2f = ‘</’
>
> This false positives is appearing in WARC files (http://iipc.github.io/warc-specifications/), and its earlier variant ARC (http://archive.org/web/researcher/ArcFileFormat.php)
> I’ve been pulling these containers apart, and can see that we only get a hit when the signature parts are found across the content container, so for us, group 1 appearing in any piece of HTML, group 2 appearing in a variety of file formats including PDF, MP3, MP4 and JPG. Groups 3 and 4 are trivial and appear everywhere. The point here, is that it is never caused by a single file as would found in the wild, only through the aggregation we undertake ourselves when creating these WARC files.
>
> We run a slightly non-standard conf:
>
> # MaxScanSize
> # Default: 100M
> MaxScanSize 2048M
>
> And
>
> # MaxFileSize
> # Default: 25M
> MaxFileSize 2048M
>
> Questions:
>
> 1) How would I go about getting this signature either removed or hardened? For example, if the signature is specifically hunting for a URL, perhaps it could be confined to the max URL length * 2 or some such (http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers) say 4000 bytes. As I’ve never seen a positive hit against this signature, and I have no idea how common it is or what its actually looking for. Removing it might not be a great idea.
>
> Is there any resources that might help me to work on a stronger signature for this particular threat, and what’s the process for suggesting a revision/removal?
>
> 2) These hits all happen in the W/ARC container. These containers are simple serialisations of arbitrary files harvested from websites, and their associated HTTP transaction. These are used to “replay” web harvests (like the wayback machine etc). Is there any way we can handle these particular file types differently? As these files are aggregations of any number of binary items we are much more likely to encounter false positives, especially for weak signatures. We’ve only seen false positives for the Trojan URL signature, but I anticipate seeing more when we process the 80Tbs of WARCs we have waiting to come in – these will translate into ~2billion files housed in several hundred thousand WARC files.
>
> Ideally we ought to be ripping the (W)ARC into its binary parts – by parsing an arbitrary aggregation of many files as a coherent file of single payload I think we’re doing ourselves a disservice. I wondered if there was a method within the ClamAV architecture that would support the construction of a WARC parser. This might allow WARC files to be “properly” consumed as a series of disconnected binary items, reducing the likelihood of false positives.
>
> We are also looking at what it would mean for our workflow to explode the W/ARCs into their parts before they are presented for scanning, and that’s a viable option. For now I’m mainly interested in knowing what we could/could not do.
>
>
> Jay Gattuso | Digital Preservation Analyst | Preservation, Research and Consultancy
> National Library of New Zealand | Te Puna Mātauranga o Aotearoa
> PO Box 1467 Wellington 6140 New Zealand | +64 (0)4 474 3064
> jay.gattuso at dia.govt.nz<mailto:jay.gattuso at natlib.govt.nz>
>
> _______________________________________________
> clamav-users mailing list
> clamav-users at lists.clamav.net
> http://lists.clamav.net/cgi-bin/mailman/listinfo/clamav-users
>
>
> Help us build a comprehensive ClamAV guide:
> https://github.com/vrtadmin/clamav-faq
>
> http://www.clamav.net/contact.html#ml
-Al-
--
Al Varnell
Mountain View, CA
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3573 bytes
Desc: not available
URL: <https://lists.clamav.net/pipermail/clamav-users/attachments/20161219/824cf7c6/attachment.bin>
More information about the clamav-users
mailing list