[TheForge] Re: searchable theforge archive revisited
Mike Spencer
mspencer at tallships.ca
Wed Nov 24 21:00:59 EST 2004
> 0. e-mail addresses in theforge archives. should they be:
b. munged
Human-readable but in a format that will prevent (or at least
hinder) automated harvesting.
> 1. urls. should they be:
>
b. left as is.
A 404 URL is at worst annoying. At best it may offer a clue to
to finding the/a current location. Since this will be an
archive, old URLs may have historical value.
> 2. signatures
b. left as is.
Personal sigs left as is. Listserv boilerplate can be deleted.
In any case, many mailers don't adhere rigidly to the '-- '
convention for .sig marker so it can't realistically be used
as an EOT flag.
> general comments on the searchable theforge archive.
>
> 0. blank lines are being deleted.
No. A blank line is only 1 or 2 bytes. Some writers carefully
format their ASCII text and in some cases -- say, tables -- blank
lines may even be essential for readability.
> 1. the various footers inserted by qth.net are being deleted.
Yes.
> 2. lines which contain only '>' are being deleted.
No. Often original messages are very poorly formatted as ASCII,
e.g. when the sender's mailer uses a variable-width font. Good
form for quoting may result in '>'-quoted blank lines that enhance
readability. Cf. "general comments...", above.
Trailing blank lines, i.e. those that come after all text, with
or without '>' quoting, are content-free and won't be missed if
elided.
In general, an archiver should never omit or change content. Archival
meta-data additions should be clearly flagged. Format changes are
optional.
HTH, IMHO, IMNSHO, YMMV, IANAL etc. etc.,
- Mike
--
Michael Spencer Nova Scotia, Canada .~.
/V\
mspencer at tallships.ca /( )\
http://home.tallships.ca/mspencer/ ^^-^^
--
More information about the TheForge
mailing list