Karl Bernard

RegEx for avoiding documents

Oct 23, 2012
In case you need to exclude URL's in a WAS scan based on file extensions...


We're currently scanning a large CMS-based site and I noticed that there were a large number (552) of documents (.pdf, .doc, etc) that came up in the scan. I don't see any real value in scanning them, so I did an analysis of what extensions were present by doing some grep processing of links listed in QID: 150009/Information Gathered/Links Crawled,  (grep -Eo \.([a-zA-Z]{3,4})$ cms_host_files_crawled.txt|sort|uniq -c) which gave me a list and count of all extension-looking endings.


Based on these findings, I created the following RegEx which I tested with grep -f:



I ran another discovery scan with this RegEx in the blacklist section (Edit Application -> Crawl Exclusion Lists -> Regular Expressions) and no URLs with these extensions were listed.