Oct. 24th, 2012

Some thoughts on the generalization of files with external components, such as html and Office with linked files, and imagining a standard for same at the OS or userspace level.

Component files

Consider data files composed of multiple components that different users may edit. These may be useful for:

  • Programs allowing both global and local settings
  • Collaborative work with strictly defined realms
  • Web and file services with user-controlled realms

Examples

The core file may look like:

{
	foo: bar
	baz: import(~bob/component.dat)
	quux: import(~fred/component.dat)
}

The processed file may look like:

{
	foo: bar
	baz: {
		alpha: asdf
		beta: bsdf
	}
	quux {
		alpha: bits
		beta: breakers
	}
}

Editing

An editor, acting as root, should be able to edit the processed file with the result that edits made to the user-contributed part of the file are saved to the user's file.

For this to be possible, there must be a system having certain knowledge:

  1. Know the contents of the old versions of the processed and pre-processed file.
  2. Know what lines in the processed file are obtained from which user-edited files.
  3. Know what syntax scopes in the processed file are obtained from which user-edited files.
  4. Know when the processed file has been changed (using filesystem events)

Potential solutions

  • Have all edits use a special command that handles the comparisons and magic, analagous to the vipw command for editing the unix password file.
  • Have a special command to only handle the comparisons and magic, and expect the user to run it after editing the file. The command will then save the new file to a .bak file for further comparisons.
  • Create a fake device for the file which wraps the edit around read and write filters. The read filter produces the processed file, while the write filter detects differences in the edited version and merges the differences into the appropriate files.

Security considerations

Components must be complete and self-contained so that their inclusion does not produce a side effect or syntax error. This requires that the including system have knowledge of how to parse the included data.

Components must not be able to change settings outside of their scope, unless they are specifically granted access to those settings.

Components may be denied the authority to change certain settings, and attempts to change these settings will be ignored.

Speed considerations

A component that has not changed may be cached. The operating system's regular filesystem cache may be the best solution.

If all components are unchanged, the final result may be cached.

In a future system, a component may not be a flat file but may be a function of variables, like exec("/bin/perl foo.pl 123 foo"). To cache the result of this function, a system must be able to identify every variable in the function and be able to tell if any of them may have changed from a previous run of the function. Variables are not only explicit $variables but anything else that can change and can have an effect on program flow, such as a resource that is read.

Alternatives

  • As it is today, a program may have custom logic to search a user's home directory for a configuration file. There is no standard; this duplicates effort but allows for the development of alternatives.

What are the chances that a given <a href="..."> hyperlink will still be valid in the future?

Set 1: Mr. T vs. Everything

A site from 2002-2003 pulled from Internet Archive.

 393 total links 
-214 completely dead 
 179 surviving in any form
- 69 saved on the Internet Archive, 24.4% archival rate (of 69/283)
 110 surviving links, 28.0% retention rate


 143 total hosts (duplicates include Angelfire, Geocities, etc) 
- 95 hosts with irretrievable content 
  48 hosts with no lost content.

  60 hosts with retrievable content
- 22 saved on the Internet Archive, 21.0% archival rate (22/105)
  38 surviving hosts, 26.6% retention rate 

Set 2: My links page

Officially last updated in 2003, but I may have added links to it up until 2005.

 872 total links
-403 dead
 469 surviving links, 53.8% retention rate


 769 total hosts 
-368 hosts associated with at least one dead link
 401 hosts with no dead links, 52.1% retention rate

The Internet Archive was not checked for this larger data set, but I suspect that it will have most sites that were not configured to deny robot searches.

Summary notes

Link rot is a very serious problem. You can expect 40%-75% of links to be broken within ten years.

Novelty sites seem to have a lower retention rate than other sites. They're funny at one point in time, and then the owner forgets about them and lets the domain expire.

Links to individual articles often died after a website was reworked.

Redirections were counted as surviving links, even though they are more fragile. This includes plain HTML links to the page's current location, as long as the page was there at the new location.

Page generated Mar. 10th, 2026 12:37 am
Powered by Dreamwidth Studios