Host List Detection API Best Practices

Document created by Kyle Schuster Employee on Dec 20, 2019Last modified by Robert Dell'Immagine on Dec 26, 2019
Version 13Show Document
  • View in full screen mode

This article is shared courtesy Nick Williams, Qualys Manager, Site Reliability Engineering. The topics here can help Qualys users understand and optimize the Host List Detection API for their internal needs.

 

 

Background

When API calls are done to pull large sets of data, the backend will process data by streaming that information in batches to ensure data integrity and preventing overloading the backend services. That means that there will be brief periods of speeds declining while the next batch is being retrieved and processed to stream back to the client. However, the overall speed averages itself out in the long run.

 

You also need to keep in mind the contributing factors that could impact performance on a shared resource. Such as performing data pulls during peak usage, which will hit congestion and speeds will not be as fast as those conducted during off-peak hours. There are also additional factors from the use of optional parameters used in API calls that do extra processing before streaming the data, active_kernels_only being an example.

 

We have been and will continue to innovate and re-architect the capabilities of processing large amounts of encrypted data for streaming through API to scale to our customer’s needs. While being able to provide customers with all of their Vulnerability information as quickly as possible is a primary focal point, it should be innovated in such a way that keeps data integrity at the forefront of every release. To do this, it takes time, effort, and dedicated resources to ensure full testing is done to account for all aspects. With that in mind, the use of automation, threading, and parallelism are techniques that can assist with increasing performance with API data pulls.

 

 

Multi-Threading

Many developers use APIs to integrate different tools together into an automated solution. One of the best ways to get large amounts of data from any solution is by pooling resources and multi-threading. Some examples of how to accomplish this are located here:

 

https://github.com/Qualys/qPyMultiThread

 

When pulling large amounts of data, the use of multiple threads greatly increases the potential throughput that can be achieved. The maximum benefit can be seen when the batch size is set evenly throughout the number of parallel threads used. For example, a host detection call resulting in a return of 100k assets, and using 10 threads in parallel, would benefit the most by using a batch size of (100,000 / 10) = 10,000. This was tested out with 10 threads on a test subscription with 120GB of data. It created 10 output XML files and the end-averaged speed was 2.7MB/sec. The optimal batch size will be subjective, so it’s best to do some testing to find what works best for each subscription and use case.

 

This, of course, is under the assumption that there is no outside interference that would hinder results, i.e. connection speed from either the client side or Qualys, or performance impact during peak hours. To reduce having one thread slow down the entire process by hitting a congested server, we break this out further into batches of 1,000 to 5,000 hosts, resulting in more output files. But this compensates for slower performing threads that can arise.

 

API Parameters

 

Host based filters

Recommended parameters instead of (vm_scan_since, vm_scan_date_after, no_vm_scan_since, vm_scan_date_before):

vm_processed_before

vm_processed_after


The Host Detection API parameters: vm_scan_since and vm_scan_after are "Host based" Filters. This means they are to determine which Hosts to include in the fetch to get vulnerability data.

 

vm_scan_since will add a clause, which states, "Show me hosts which have a Last Scan Date higher than the date specified by the user. If Last Scan Date happens to be null for some reason, then pick the highest non-null value from “Date IG Modified” or “Date Vuln Modified” for hosts.

 

With the deployment of QWEB Major Release 8.9.0, there were changes made to scanning behavior that impact certain parameters of the Host Detection API (vm_scan_since, vm_scan_date_after, no_vm_scan_since, vm_scan_date_before)

 

References:

Qualys API (VM, PC) XML/DTD Reference Guide

See “Host List Output” and “Host List VM Detection Output”

Cloud Platform 8.9 API Release Notes

See “VM – Improvements to Reporting Host Scan Time”

 

Clipped from release notes:

Host scan time is now based on scan end date

We’ve changed the way we report the host scan time when updating vulnerabilities and tickets. The host scan time will now be based on when the scan finished, not when the scan started. We’ll get this date from QID 45038 “Host Scan Time”. If this QID was not included in your vulnerability scan then we’ll use the scan start date/time.

 

In depth:

Let’s say you have a scan against a /15 network (131,072 hosts) that launches as 10:00am GMT with the following information:

         Scan start datetime: 04/15/2019 10:00:00 GMT
         Target: 10.0.0.1 – 10.1.255.254 (131,072)
         Scan end datetime: 04/15/2019 22:00:00 GMT
         Filters: Search list included for only authenticated severity 3-5, QID 45038 will be not be included.

 

As the scan progresses and IPs are processed, since QID 45038 is not present and the exact scan time for the host cannot be extracted, the scan start time is used for the hosts ‘last scan date’ (4/15/2019 10:00:00 GMT).

 

This means that if a host is scanned at 10am, but not processed until 10pm, the last scan date would be set to 10am, 12 hours before its processing time. If streaming data through the Host Detection API in an aggressive manner, it will be possible to miss pulling hosts. For that reason, additional parameters were introduced to use the date that a host was processed, and is recommended for automation that pulls vulnerability data on a continuous basis:
vm_processed_before - Show hosts with vulnerability scan results processed before a certain date and time. Specify the date in YYYY-MM-DD[THH:MM:SSZ] format (UTC/GMT), like “2016-09-12” or “2016-09- 12T23:15:00Z”.
vm_processed_after - Show hosts with vulnerability scan results processed after a certain date and time. Specify the date in YYYY-MM-DD[THH:MM:SSZ] format (UTC/GMT), like “2016-09-12” or “2016-09- 12T23:15:00Z”.
When implementing automation to perform continuous data pulls, you will want to ensure the parameters used for host-based filtering are the ones listed above, otherwise, hosts will be missed. In the example below, the host list detection is called at the top of every hour. Scans use a search list that does not include Information Gathering QID’s, so the scan start time will be used as its ‘last scan date’. The parameter used is ‘vm_scan_since’, and dynamically configured for an hour earlier:
The 2pm host list detection response call will include:
  • Host A’s first scan
  • Host B

The 3pm host list detection response call will include:

  • Host D
  • Host A’s second scan is missed (last_vm_scan date is processed as 1:50pm)
  • Host C is missed (last_vm_scan date is processed as 1:40pm)

 

Now that we understand the scope of a host, it’s also important to understand that the entire host is in scope. This mean’s host-based filters at the host level, not the vulnerability finding level. All vulnerabilities from the host’s previous scans will be included (by default) since the Host List Detection API leverages Host Based Findings.

 

QID based filters

The Host Detection API parameters: detection_updated_since and detection_updated_after are "QID based" filters. This means they will not affect which hosts to pull data for, but instead, determine which Vulnerabilities are retrieved for hosts. Here is the logic behind each parameter:

 

detection_processed_after

      Date Last Processed >= Specified date in YYYY-MM-DD[THH:MM:SSZ] format (UTC/GMT), like “2016-09-12” or “2016-09-12T23:15:00Z

 

detection_processed_before
      Date Last Processed < Specified date in YYYY-MM-DD[THH:MM:SSZ] format (UTC/GMT), like “2016-09-12” or “2016-09-12T23:15:00Z


detection_updated_before
      (Was found in last scan and Date last Processed < Specified date in YYYY-MM-DD[THH:MM:SSZ]
      format (UTC/GMT) ) OR (Was marked 'Fixed' and Date Last Fixed < Specified date in YYYY-MM- DD[THH:MM:SSZ] format (UTC/GMT))


detection_updated_since
      (Was found in last scan and Date last Processed >= Specified date in YYYY-MM-DD[THH:MM:SSZ]
      format (UTC/GMT)) OR (Was marked 'Fixed' and Date Last Fixed >= Specified date in YYYY-MM- DD[THH:MM:SSZ] format (UTC/GMT))

 

Recommendations

To improve performance, Multi-threading should be used. Here is an outline of what the POC multi-threading script does to obtain the maximum throughput:

 1. Make an initial API call to the Host List API endpoint to retrieve all host IDs for the subscription that need to have data retrieved.

Note: It’s important to do any filtering on hosts at this point, as filtering during the detection pull can impact performance.

Host List API Endpoint:

         https://<qualysapi url>/api/2.0/fo/asset/host/

         Using Parameters:
               params = {
                        'action': 'list',
                        'truncation_limit': 0,
                        'output_format': 'XML', ‘vm_processed_after’: <DateTime in UTC>,

               }

 

2. Break the total Host IDs into batches of 1,000-5,000 and send to a Queue.

 

3. Launch X worker threads that will pull the batches from the Queue and launch an API call against:

         https://<qualysapi url>/ api/2.0/fo/asset/host/detection/
         Using Parameters:

               params = {
                        'action': 'list',
                        'show_igs': 1, # Optional
                        'truncation_limit': 0,
                        'output_format': 'XML',
                        'status': 'Active,New,Fixed,Re-Opened', # Optional

                        'detection_updated_since': <DateTime in UTC>,

                        'ids': ids

               }       

Considerations


Batch size

On the backend, the host detection engine will break up the number of hosts to retrieve information on with a maximum size of 10,000. Using a batch size higher than this will not add any benefit to performance. In the same context, there are multiple places that need to pull information so there is an overhead cost regardless of the size being used. For that reason, using a batch size too small can start to hinder performance slightly due to the overhead being used on small requests. Different parameters and the amount of total data on the backend can make requests vary in duration, it is best to experiment with different batch size’s during peak and non-peak hours to determine the optimal size to use.

 

Chunking

When using a chunking mechanism to put a host list into batches, minimizing a list of ids down to a range instead is okay to use but only when no filters are going to be used and the end result is to obtain all hosts. If filtering is used, then using a range instead of a list of ids will include hosts outside the scope of that filter. For example, say we intend to pull all hosts that have been updated in the last 4 hours. We perform the Host List API call with ‘vm_processed_since’ and it returns 5,000 hosts of scattered ids:
         1001,1009,1011 ... 16893,16898,16941

 

The batching mechanism is configured to pull chunks of 1,000 ids at a time and breaks the full list of 5,000 into 5 smaller chunks of 1,000 each. Several hosts hadn’t been scanned within that window, so our id lists are scattered between 1001 and 16941 (i.e: 1001 ,1009, 1042, 1065 ... 16898, 16941).
         - Chunk 1: 1001 ,1009 ... 1999, 2358 (1000 ids)
         - Chunk 2: 2359,2368 ... 3423,3424 (1000 ids)
         - Chunk 3: 3425,3435 ... 7585,8602 (1000 ids)
         - Chunk 4: 9603, 9615 ... 11742, 12743 (1000 ids)
         - Chunk 5: 12744,13745 ... 15898,16941 (1000 ids)
If this were to be further minimized into ranges instead, then it would look like this: - Chunk 1: 1001-2358 (1357 ids)

         - Chunk 2: 2359-3424 (1065 ids)
         - Chunk 3: 3425-8602 (5177 ids)
         - Chunk 4: 9603-12743 (3140 ids)
         - Chunk 5: 12744-16941 (4197 ids)


This would mean that we were including hosts that do not meet the criteria of vm_processed_since. Since we'd be using the same host based filter's in the Host List Detection API call, the hosts won’t be included in the data returned; however, they will be in scope during the processing of data on the backend and thereby impact performance.

 

Error Handling

Robust error handling and logging are key to any automation and it is a recommended best practice to implement mechanisms to catch exceptions and retry with exponentially increasing backoff times when errors are encountered. This includes all functions dealing with connection requests, parsing, or writing to disk. Taking care to log as much precise detail as possible, so it will be easier to audit later should the need arise.

 

Parsing

If an error is encountered, the API will return an error code and a description of the error, which will look like this:

 

         Simple Return with error:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE GENERIC_RETURN SYSTEM "https://qualysapi.qualys.com/api/2.0/simple_return.dtd”> <SIMPLE_RETURN>
<RESPONSE> <DATETIME>2018-02-14T02:51:36Z</DATETIME> <CODE>1234</CODE>
<TEXT>Description of Error</TEXT>
</RESPONSE>
</SIMPLE_RETURN>

 

 

  Generic Return with error:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE GENERIC_RETURN SYSTEM "https://qualysapi.qualys.com/generic_return.dtd"> <GENERIC_RETURN>
<API name="index.php" username="username at="2018-02-13T06:09:27Z"> <RETURN status="FAILED" number="999">
Internal error. Please contact customer support.
</RETURN>
</GENERIC_RETURN>
<!-- Incident signature: 123a12b12c1de4f12345678901a12a12 //-->

 

A full list of Error code Responses can be found in the API User Guide in Appendix 1
(https://www.qualys.com/docs/qualys-api-v2-user-guide.pdf).

 

Connection Errors

With retrieving large amounts of data sets and continuously streaming through the API for prolonged periods of time, comes the possibility of running into edge cases with regards to connections. Whichever method is used to make the outbound connection to the API endpoint, it is recommended to set a timeout to abort/retry a connection if it hasn’t been established in a reasonable amount of time. This is to prevent stalling out a thread, resulting in reduced performance. Also consider these types of connection errors, amongst others:

  • Empty Responses
  • Timeouts
  • Connection Reset or Internal Error responses. Status codes: 503, 500.
  • Connection Closed

These types of issues can be caused by either side of the connection, so they need to be caught, logged, and if they continue to occur then they should be reported to support for an investigation. 

 

 

Appendix

 

Performance examples

Host Detection API speed tests with multithreading example

Example python POC located at https://github.com/Qualys/community/tree/master/hostdetection, No modifications

 

Parameters used:

'action': 'list',
'echo_request': 1,
'show_igs': 1,
'truncation_limit': 0,
'output_format': 'XML',
'status': 'Active,New,Re-Opened,Fixed', 'vm_scan_since': '<60 days ago>',
'ids': ids

 

 

Notes:
Test with a batch size of 3000 used vm_scan_since set to 2 days ago instead of 60. Results can vary by subscription and using a larger thread count doesn’t always mean a higher throughput.

3 people found this helpful

Outcomes