Web Archiving

This guide covers web archiving, UTSA's web archiving efforts, and considerations for researchers.

What is a web crawler?

Web crawling is the process of visiting a set list of URL's called seeds and identifying all of the hyperlinks in that seed, copying and saving information as it "crawls" the seed. This information is saved as snapshots in a Web ARChive (WARC) file format so when it is replayed with a WARC viewer, they appear as they were captured.

What types of web content might not be adequately captured?

Not all contents of an archived web resource may be captured. Some site’s owners may forbid web crawlers from accessing their page while other site owners might require a login to access their site. There may also be technical contents that cannot be captured, including, audiovisual materials, interactive components, mailto links, forms, social media feeds, and input fields (e.g., comment boxes and search boxes).

Some style elements might not be captured in crawls, too, so the organization and design of the page may differ from the original. These missing contents may return a 404 error on the captured resource, may rearrange site information, or, in the case of some embedded advertising, current content may show up rather than the original content. In this latter case, when archived contents reach out to the current web, these are known as zombie resources. All of these possibilities can affect the overall look and feel of an archived web resource, so keep them in mind when accessing captured contents.

What types of web content do you NOT collect?

  • Web resources created by private individuals for personal purposes 
  • Password protected sites 
  • Databases
  • Calendars
  • Non-UTSA web resources that have robots.txt exclusion requests 
  • Linked resources from utsa.edu that are not hosted by utsa.edu/not created by UTSA entities; the original URL of these will be captured in the crawl but Special Collections can currently only devote resources to UTSA-created content

How do I cite an archived website?

APA

Google Inc.. (1998, December 2nd). Google!. https://web.archive.org/web/19981202230410/http://www.google.com/

Chicago Manual of Style

McDonald, R. C. "Basic Canary Care." _Robirda Online_. 12 Sept. 2004. 18 Dec. 2006 . _Internet Archive_. < http://web.archive.org/web/20041009202820/http://www.robirda.com/cancare.html>.

MLA

You should cite the webpage as you would normally, and then give the Wayback Machine information.

McDonald, R. C. "Basic Canary Care." _Robirda Online_. 12 Sept. 2004. 18 Dec. 2006 [http://www.robirda.com/cancare.html]. _Internet Archive_. [ http://web.archive.org/web/20041009202820/http://www.robirda.com/cancare.html].

If the date that the information was updated is missing, one can use the closest date in the Wayback Machine. Then comes the date when the page is retrieved and the original URL. Neither URL should be underlined in the bibliography itself.

How does UTSA Special Collections decide which websites to archive?

Special Collections staff use a variety of measures to determine whether or not a website is appropriate for archiving, and the frequency with which particular websites are crawled. Is the website:

  • Relevant to our thematic collections?
  • Of appropriate size and scope?
  • Already part of a larger seed?
  • Already crawled by the Wayback Machine? If so, do we agree with the frequency of the Wayback Machine crawls?
  • Updated frequently?
  • Protected by robots.txt exclusions? If so, do we have the right to crawl the content, or should we contact the site owner for permission to crawl?
  • Password protected?
  • Dynamic, i.e. does functionality depend upon user input (databases, JavaScript) that the crawler cannot capture?
  • Structured such that calendars can be avoided, as these cannot be easily crawled (is there a “\calendar” page we can exclude from the crawl)?

Does Archive-It capture the date and time when a web page is updated?

No. The Archive-It web crawler can only take a snapshot of a website at a given time. Comparison of a website with previous crawls of that website may show that content has changed between crawls. Currently, there is no way to exactly determine when and how a particular website modifies its content.

How do I know I am looking at an archived web page, and not the live web?

All archived web resources will appear with a yellow header at the top of the page so as not to be confused with the current, live version of the resource. This header displays the following text: “You are viewing an archived web page, collected at the request of University of Texas, San Antonio using Archive-It..."

Screenshot of message appearing at the top of an archived website that states "You are a viewing an archived web page".'

How do I make sure that my website is included in the archive?

If you are a member of the UTSA or South Texas community and your web content is not currently being archived by the Internet Archive or the Archive-It Program, you can contact Special Collections staff to inquire about having your website included in the UTSA web archive.

How do I remove my web page or photograph from the archive?

Web content in our collections is part of the Internet Archive. We are not able to remove content from past web crawls. If you would like for your website to not be archived by the UTSA Special Collections' web crawler in the future, please contact Special Collections staff.

Why am I directed to the Internet Archive's 'Not in Archive' page for internal links within some websites?

Some web pages may only have a portion of their website archived by the Archive-It web crawler. Web content may be excluded from the archive either because of robots.txt exclusion requests, if portions of the website are hosted on a different web domain, or if the crawler was not able to copy the entire site.

Do you archive social media?

If a social media platform does not require a login to access content, we will archive it, but if a platform requires a login, we will are unable to archive it.

What is link rot?

Link rot refers to the phenomenon of a link ceasing to point to its original targeted file, web page, or server due to it being relocated or becoming permanently unavailable.

Screenshot of a 403 HTTP status code meaning access to the requested resource is forbidden for some reason.

Screenshot of "404 Not Found" HTTP status code meaning that the browser was able to communicate with a given server, but the server could not find what was requested.

Why can't I just google archived web resources?

Google's search results are ranked based on popularity and most users are looking for access to the latest version of a site. Google does provide a Chrome extension which allows users to search Google's cached content, but this content is separate from what is found in the Wayback Machine. To find an older version of a site or locate a now defunct site, tools like Wayback Machine or organizations' Archive-It partner pages are a much better bet.