As I was looking at the Google cache of a page, I noticed that the layout was a bit weird. The issue was that the cache date was Nov. 1 and the site had undergone updates on the 3rd or so. The updated external CSS files weren't playing well with the old cache page.
So I hit refresh and noticed that the cache date was now Nov. 4th, and the page looked fine, as the page from Nov. 4th was designed for the updated external CSS files. So I hit refresh a few more times and noticed I was able to randomly toggle between two different versions; The cached page from Nov. 1st and the page from Nov 4th. So, Google obviously stores cache in various different places. This gave me an idea.
It has been said before that Google has kept copies of all of the different indices it has ever created. If this means what I think it means, then they should have all of the different cached copies for every URL that Google has every crawled. OK, so maybe you know where I'm going with this, but keep reading anyway...
I'd like to introduce you to... Google "Versions" (or possibly Google "Timeport"). I'm going to use Google as the example search engine in this case. I'm sorry Bing - This could equally apply to you, but I spend most of my day worrying about Google.
BUT... each result has an extra link called "Versions" beside the standard "Cached" and "Similar" links.
When you click on this "Versions" link, you are presented with a list of the dates for which Google has a cached version of the page. You click on a date and get the cached version of the page on that date.
Yes, this is basically the concept of archive.org aka the "Wayback Machine", except that archive.org does a relatively bad job of crawling pages often enough for it to be useful. I say "relatively" because they obviously don't have the resources of a company such as Google or Microsoft. So perhaps archive.org does a fantastic job given their resources, but they're terrible when compared to either of the aforementioned companies.
But, archive.org has synchronously cached pages and their associated objects for years, so presumably Google could do it also if they were so inclined.
Another problem is the potential copyright issues, but I don't see this really being a hurdle - especially in the US. The robots.txt is the defacto standard for exclusion from search engines. A separate User Agent string could be used for Google Versions e.g. "VersionsBot". Also, the meta robots noarchive tag should also prevent a page from being indexed in the Versions archive.
It's an interesting idea that I would like to see Google or Bing introduce. It's certainly in line with Google's mission statement to "organize the world's information and make it universally accessible and useful."
What are your thoughts?