Web Application Scanning - Controlling Links Crawled with Explicit URLs, Redundant Links, Black Lists, and White Lists

Document created by John Delaroderie Employee on Feb 18, 2020
Version 1Show Document
  • View in full screen mode

Qualys WAS offers many options to control what URLs are crawled and tested during a Web Application Scan.  However, customers can potentially misconfigure their web application configuration and end up scanning URLs they did not intend to scan or even miss URLs they wanted to test.  This can occur because certain configurations take precedence over other configurations.  In this document, I am going to review the most common configurations used to control what URLs are crawled and discuss the order of precedence each one takes in a Qualys Web Application Scan.  In this way, you can be sure not to create any conflicts with the rules you set up.

 


Controlling What URLs Are Crawled

By design, the Qualys Web Application Scanning engine will identify all hard-coded and dynamically generated links to URLs as it crawls through your application.  In most cases, you will not need to do anything to ensure complete coverage of your application.   The following methods are in place to provide customers greater granularity over which URLs should be crawled or ignored under very specific use cases.

 

Explicit URLs to Crawl

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Located under the Application Details tab of your Web Application configuration, explicit URLs allow you to identify URLs that are not directly linked to any other URL in your web application.  These orphaned pages are uncommon, but may sometimes occur with special registration pages that may be emailed out to site users.  Remember that you should never rely on security through obscurity, so intentionally hiding a page by removing a link to it is not endorsed by security professionals.  Instead, this feature is primarily intended to reach URLs in very specific use cases, and any HTML links on any of the URLs you explicitly define will also be crawled and explored by our Web Application Scanning engine.  It is not necessary to specify any Explicit URLs to Crawl if you do not have any pages meeting the criteria outlined above. 

 

The following QIDs located under the Information Gathered - Scan Diagnostics portion of your scan report can be helpful in troubleshooting Explicit URLs:

150009 (Links Crawled) - This will report the actual links crawled during a scan.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Identifying and excluding pages with identical source code can be one way you can dramatically improve the efficiency and scan time of your web application testing.  For example, you may have a section of your web application that contains press releases going back multiple years.  The URLs may look something like the following:

https://www.example.com/press-releases/press-release?id=20200218

https://www.example.com/press-releases/press-release?id=20200213

https://www.example.com/press-releases/press-release?id=20200210

https://www.example.com/press-releases/press-release?id=20200201

https://www.example.com/press-releases/press-release?id=20200125

https://www.example.com/press-releases/press-release?id=20200115

https://www.example.com/press-releases/press-release?id=20200109

https://www.example.com/press-releases/press-release?id=20200104

From those URLs, it appears that the source code may be the same on each press release page, and the "?id=XXXXX" is a query parameter being passed in for a database lookup to populate the content on the press release pages.  In that case, after validating that it is the same source code, you can use the Redundant Links feature to limit the number of press release pages that are crawled and tested.  The Redundant Links configuration, located under the Redundant Links tab of your Web Application configuration, allows you to specify a regular expression to match against these URLs.  A good source for creating and testing regular expressions can be found at https://regex101.com/ 

 

In this example, I could use the following regular expression to identify all press release URLs:

 

https:\/\/www\.example\.com\/press-releases\/press-release\?id=\d+

 

The setting for Max. Links to Crawl can be left at the default value of 5.  This just means that the Qualys Web Application Scanning engine will crawl and test up to 5 URLs that match the chosen regular expression pattern.  All other URLs matching the pattern will be skipped.

 

The following QIDs located under the Information Gathered - Scan Diagnostics portion of your scan report can be helpful in troubleshooting Redundant Links:

150009 (Links Crawled) - This will report the actual links crawled during a scan.

150140 (Redundant links/URL paths crawled and not crawled)  - This will report which Redundant Link URLs were matched, crawled, and tested, as well as which URLs were matched but skipped. 

 

Exclusions - White Lists

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Located under the Exclusions tab of your Web Application configuration, White Lists (either URLs or Regular Expressions) allow you to restrict crawling to only the URLs you specify.  You can use specific URLs that you want tested, regular expressions to match the URLs you want tested, or even a combination of the two.  The important thing to remember with using a white list is that it does not work like Explicit Links described above.  For example, if you use URLs to identify what links you want to crawl and test, the Qualys Web Application Scanning engine will not directly enter those URLs and crawl them.  Starting from the Web Application URL you defined under the Target Definition box of the Asset Details tab, Qualys will start identifying links and ONLY crawl those permitted by the White List rules you have defined.  If there are URLs that you have white listed but there is no link chain to those URLs that is permitted by your White List rules, the Qualys Web Application Scanning engine will not crawl them.  In other words, the Qualys Web Application Scanning engine is still crawling your site and still discovering URLs, but it is only going to those links that it is permitted by your White List rules.  At a minimum, your starting URL as defined in the Target Definition box has to match against your White List rules or no links will be crawled at all.  All URLs not matching your White List rules are effectively black listed.

 

The following QIDs located under the Information Gathered - Scan Diagnostics portion of your scan report can be helpful in troubleshooting White Lists:

150009 (Links Crawled) - This will report the actual links crawled during a scan.

150021 (Scan Diagnostics) - This will report if any White Lists or Black Lists were loaded as part of the scan.

150041 (Links Rejected) - This will identify any links encountered during a crawl that were rejected due to a White List or Black List rule. 

 

Exclusions - Black Lists

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Located under the Exclusions tab of your Web Application configuration, Black Lists (either URLs or Regular Expressions) allow you to prevent the Qualys Web Application Scanning engine from crawling specific URLs.  You can use specific URLs that you want skipped, regular expressions to match the URLs you want skipped, or even a combination of the two.  The important caveat to remember is that any links downstream from the URLs you Black List will not be crawled either. 

 

The following QIDs located under the Information Gathered - Scan Diagnostics portion of your scan report can be helpful in troubleshooting Black Lists:

150009 (Links Crawled) - This will report the actual links crawled during a scan.

150021 (Scan Diagnostics) - This will report if any White Lists or Black Lists were loaded as part of the scan.

150041 (Links Rejected) - This will identify any links encountered during a crawl that were rejected due to a White List or Black List rule.

 

Order of Precedence

Now that we have clearly explained how Explicit URLs, Redundant Links, White Lists, and Black Lists work, it is important to understand how they interact with each other.  For example, if you set a Explicit URL to crawl but it is also matched against your Black List rules, what exactly happens?  There is a definite order of precedence for how these crawling controls are applied. 

 

Explicit URLs versus Other Rules

 

Rule 1Rule 2Which Rule Takes Precedence?
Explicit URLRedundant LinksRedundant Links - The max links to crawl setting for Redundant Links will override Explicit URLs and can result in Explicit URLs not crawled
Explicit URLWhite ListWhite List - Any Explicit URLs that does not match against a White List rule will not be crawled
Explicit URLBlack ListBlack List - Any Explicit URLs that conflict with a Black List rule will not be crawled

 

 

Rule 1Rule 2Which Rule Takes Precedence?
Redundant LinksExplicit URLRedundant Links - The max links to crawl setting for Redundant Links will override Explicit URLs and can result in Explicit URLs not crawled
Redundant LinksWhite ListRedundant Links - Links matching White List rules can be crawled, but the Redundant Links rule will limit the actual links to the max links to crawl setting
Redundant LinksBlack ListBlack List - If a Redundant Links rule conflicts against a Black List rule, no Redundant Links matches will be crawled

 

White Lists versus Other Rules

 

Rule 1Rule 2Which Rule Takes Precedence?
White ListExplicit URLWhite List - Any Explicit URLs that does not match against a White List rule will not be crawled
White ListRedundant LinksRedundant Links - Links matching White List rules can be crawled, but the Redundant Links rule will limit the actual links to the max links to crawl setting
White ListBlack ListsNothing is excluded if there is a conflict - White Lists and Black Lists rules are completely ignored, not just the specific conflict

 

Black Lists versus Other Rules

 

Rule 1Rule 2Which Rule Takes Precedence?
Black ListExplicit URLBlack List - Any Explicit URLs that conflict with a Black List rule will not be crawled
Black ListRedundant LinksBlack List - If a Redundant Links rule conflicts against a Black List rule, no Redundant Links matches will be crawled
Black ListWhite ListNothing is excluded if there is a conflict - White Lists and Black Lists rules are completely ignored, not just the specific conflict


Global Exclusions

If you are using any Global Exclusions (White Lists, Black Lists), they are followed exactly as the White List and Black List order of precedence outlined above.

 

A Final Note on Progressive Scanning

If you are making changes to your web application configuration to control what links are crawled during a scan, it is important to note that these changes will not be immediately reflected in any current scheduled scans that are part of a progression (i.e. progressive scanning is enabled) that is currently in progress.  The reason for this is simple - when you start a progressive scan, a snapshot of your web application configuration and scanning option profile are used for each scan of a web application until the entire web application is scanned and that progression ends.  At this point it will take a new snapshot of your web application configuration and scanning option profile and use those settings in the next progression of scheduled scans.

 

Using the "scan again" feature will also use the snapshot of the web application and scanning option profile used for the scan you are relaunching, so any changes made will not be reflected using the "scan again" option either.

 

A manual scan can be launched right away, however, and the new web application settings and scanning option profiles will be used.

Attachments

    Outcomes