In his All Our Yesterdays presentation at An Event Apart in Boston, MA 2011 Jeremy Keith outlined the problem of digital preservation on the Web and provided some strategies for taking a long term view of our Web pages. Here are my notes from his talk:
- Our concept of time on the Web is accelerated. Things move very quickly. There’s lots of excitement about the real time Web and we can do a lot with instant connections. But it also helps to think about the Web in longer terms. Things that last years and even decades.
- Flow is the feed: the stream of updates that reminds you that you exist. Stock is the durable stuff that stays around and continues to come up in search. The flow can be a beautiful source to the stock of content on the Web.
- Millennial seed project is creating a thousand year seed bank that allows us to re-grow plants if needed. Storing nuclear waste is a very long-term project. Once you are considering keeping things safe for thousands of years you need to move beyond language to communicate.
- The Long Now foundation was created by Brian Eno and Stuart Brand and creates projects that encourage people to think very long term. One such project is the long-now clock engineered to run for ten thousand years.
- The Voyager spacecraft is the furthest man-made object in the universe. There’s a chance it could be found by another civilization. The Voyager record was designed as an album that conveyed information about planet Earth. It was an analog device. Digital is harder for long-term storage.
Digital is Harder
- Digital makes it easier and cheaper to make non-destructive copies. On top of the Internet we have the Web (HTML +URI +HTTP) which is small pieces loosely joined.
- There’s a common belief that things that get put online always stay there. But “the Internet never forgets” is simply not true. The Internet forgets all the time. Intuitively we don’t think there is a problem. But if we are trying to tell stories and leave a legacy online it’s a real problem.
- We’re building on sand on the Internet. URLs are pretty ephemeral. Jeremy made a bet on long-bets.org that the URL of his bet would not be around in 11 years.
- One of the major problems with URLs is that we don’t own them. We give our stories to companies, start-ups, and other organizations. Many of these get bought up or shutdown. The content goes away.
- Yahoo! Recently shut down Geocities. Though the content was ugly, it was a reflection of our past –our history on the Web.
- Reasons for taking content down are: bandwidth costs are high or no one is viewing the content. But these two views are mutually exclusive. If no one is viewing it how can it be too expensive to maintain?
- Pownce shut down when it was acquired and removed all its content. Magnolia suffered an outage and lost all its data. There are many examples of sites removing data and URLs.
- One option is we can host a canonical copy of your content and ping out to other services. We can host our own content but it’s hard. Self-hosting is still the domain of geeks.
- You don’t own domain names. You rent them. So there is no guarantee they will be around.
- De-centralization has allowed the Web to expand. But there is one central authority in the system: I-CANN.
- A better way of thinking about your URL might be the IP address instead of the domain name.
- Even if you host yourself, you need to figure out what formats to use. This is a hard question. Stone and paper have proven to be more resilient than CDs and Laserdiscs.
- Text content is more durable because it is simpler. You can more easily recover a text file. They are readable by humans and machines.
- Binary data like images, audio, and video are readable by machines but not easily read by humans. They are inherently harder to archive and recover.
- Text formats have a longer life than binary formats.
- Binary files need to be decoded. In order to be decoded they need instructions on how to be decoded. This all needs to be preserved.
- PLANTS project is designed to preserve formats and how to decode them. A location in the mountains stores instructions on how to decode instructions in many different formats including HTML.
- Likely HTML will be the best archival format. It has proved to be pretty durable. While it’s kind of messy, it is mostly good enough.
- HTML5 is designed to preserve our information long-term. 10 years from now HTML will still be around. It’s not the best format, but it is the right format.
- The problem of locking down HTML prevents it from being copied and thereby being preserved.
- Copyright was set up to protect publishers for 14 years. Then later extended to 28 years. Both of these were renewable. In 1976 the US copyright act changed it to the life of the author plus 50 years. In 1988 it became to life plus 70 years.
- Anytime Mickey Mouse is in danger of following into the public domain, copyright law is extended. If your content came after Steamboat Willy, you can have copyright forever.
- When you combine restrictive formats with copyright, that’s the worst thing for long-term preservation. Set your content free. Copies will keep your content around longer.
- Hosting yourself helps keep your content around. Using open formats especially HTML can keep your content around. Non-restrictive licenses like creative commons will keep your content around.
- Preserving our culture requires holding on the little things that define our history. It’s not a technical problem to preserve our culture and our story. But we need people to want to do so.