Ongoing Maintenance

In order to maintain consistency throughout the archive, and to ensure that it is properly updated there are a few recurring tasks that should be regularly carried out. By regularly updating crawl schedules, seed redirections and the active state of seeds in the collection we can ensure that the collection we are presenting aligns well with our collecting strategy.

Crawl Schedules

Part of the organizational structure of our web archive is based on grouping similar sites together, and crawling them at similar frequencies. This allows us to crawl news sites with one frequency, major college or university-wide departmental sites with another, and smaller sites with yet another frequency. These frequencies roughly correspond to the likelihood that information on that website will change. You will also notice a correlation between the number of seeds and the frequency of crawls, the more frequent the crawl the fewer seeds that it is likely to contain. Currently we have the following regular crawls scheduled:

URL Redirection / Seed Updating

From time to time website configurations may change and a particular URL may be swapped out for a new version. Generally when this happens the old URL is redirected to the new URL for some period of time. However, redirections are rarely permanent. In the scope of Archive-It there are two types of redirections that we should be aware of, both should be handled differently.

Active / Inactive Seed and Collection Management

Archive-It gives administrators the option to label seeds or collections as active or inactive. Active seeds are crawled according to the schedule they are placed on. Inactive seeds remain visible as part of the collection but are not crawled according to the crawl schedule. The most common reason for setting a seed as inactive is that the URL for the seed is no longer valid. Either the website was taken down or the URL has been permanently redirected to a new URL. By marking it as inactive, visitors to the collection can still find the seed, and see all of the capture dates.