Test crawls
Running test crawls provides you with a mechanism to see how a crawl will behave with the current seed settings, without having to worry about the data captured during a test crawl being saved. It is a safe way to test changes to seeds or collections that will highlight any scoping errors that may cause problems down the road. The data from a test crawl can either be saved (if the crawl completed as expected) or deleted, if there were errors in the crawl that need to be addressed. The first step in properly scoping any seed to run a test crawl.
Test Crawl Individual Seeds
In order to see how the addition of an individual new seed to a collection will affect that collection, you should start by running a test crawl on that seed. To run a test crawl on an individual seed you should:
1) Open the Collection Overview page for the collection that you have added the seed to.
2) Check the checkbox next to the seed you wish to run a test crawl on.
3) Click on 'Run Crawl'
4) Select 'Test Crawl' as crawl type
5) Set the duration equal to the duration of the production crawl the seed will be added to. For example if the seed will be in a collection with a recurring monthly crawl that lasts three days, set the time limit to three days.
6) Click on Crawl
Test Crawl a Batch of Seeds
Archive-It limits the amount of concurrent test crawls that can run at any given point in time. If you are adding multiple seeds to a collection, that will be crawled with identical frequency and duration, you can run a test crawl that includes multiple seeds. To do so you would follow the same steps as above, but just select the checkbox of every seed that you wish to include in the test crawl.
Test Crawl Reports
Test crawl reports are identical to production crawl reports. They are intended to give you a complete overview of the data captured during a crawl as well as links to the wayback captures of the site for quality assurance review. Test crawl reports can be found by going to the main 'Crawls' tab of the Archive-It admin interface and then clicking on 'Test Crawls'. There are a few things to pay particularly close attention to when evaluating the effectiveness of a test crawl.
-
Capture Completeness - This is best evaluated by looking at the wayback capture of the page and comparing it to the 'live' site. You want to ensure that all items on the live site were included on the wayback capture. You can look at layout, photos, navigation items and videos.
-
Data Budget - The overview page of the crawl report shows you how much data was gathered during the crawl. You can compare this to other crawls you have completed to see if it appears to be an appropriate amount of data.
-
Queued URLs - When looking at the test crawl report, you can examine the hosts list for each seed URL that was included in the test crawl. When examining the hosts list pay attention to the number of 'Queued' URLs on the report. This number (and the URL list generated by clicking on the number) shows you URLs that the crawler detected, and wanted to crawl, but did not have time to crawl before the limit was reached. Ideally this number would be 0.
- Fixing Queued URLs - You can remedy a large Queued URL number by increasing the crawl duration or by modifying scoping rules to ensure that any crawler traps are avoided. An example of a crawler trap would be a calendar where every day is a link even if there is no event or page associated with that link which could create thousands of 'false' links. The crawler sees each link as valid and will try to follow each one for as far into the future as the calendar is programed to show. More information on crawler traps can be found here.
Convert Test Crawl to Production Crawl
Once you have finished the QA on a test crawl and are confident that it is ready to be added as a production seed, do the following:
1) Open the collection landing page that contains the seed
2) Click on the 'Seeds' tab
3) Click the checkbox next to the seed you wish to add to a production crawl
4) Click 'Edit Settings'
5) Check the checkbox labeled 'Visible to the public'
6) Set the crawl frequency
7) Click 'Save'