U bent hier
Voortbestaan
Early Microsoft Excel
The first version of Microsoft Excel was released on Macintosh in 1985. Before that there was MultiPlan.
The ancestor of Excel is Multiplan 1981-1988.
Until v4, it used blitted files.
From v4 on, it’s using Biff1, like Excel 1: Starts with 09 00 BoF, Ends with 0A 00 EoF.
Easter egg? Plan lives forever
MultiPlan version 4 and Excel version 2 used the well known and documented BIFF format. Before BIFF2 the formats are a bit of a mystery. AFAIK, Microsoft never released any documentation on the file format used for Excel version 1 and MultiPlan 1 -3, they emphasized using the SYLK format for interchange. To make matters worse, there were upwards of 100 different versions of the early MultiPlan, ported for dozens of different systems. Some of them are discussed on the TRS-80 website.
Or you can take MultiPlan 1.06 for a spin over at PCjs!
Needless to say documenting and finding a pattern which could be used to identify the early versions of MultiPlan and Excel 1 are difficult. These versions are missing from the PRONOM registry, but hopefully with enough samples, some patterns can be found to confidently identify formats from the early days of spreadsheets!
Marco Pontello’s TrID identifier software has signatures for the early Multiplan and Excel formats. His software scans for patterns in samples and finds commonalities between them. So the more samples he can scan the more accurate the identification can be.
Currently the signatures are as follows.
Microsoft Excel for Mac Spreadsheet (v1.x) <Pattern> <Bytes>532700</Bytes> <ASCII> S '</ASCII> <Pos>0</Pos> </Pattern> <Pattern> <Bytes>AB27000000000000000203</Bytes> <ASCII> . '</ASCII> <Pos>4</Pos> </Pattern> Multiplan for Mac spreadsheet (v1.x) <Pattern> <Bytes>11AB000013E8000000000000</Bytes> <ASCII> . . . . . . . . . . . .</ASCII> <Pos>0</Pos> </Pattern> Multiplan spreadsheet (v1.x) <Pattern> <Bytes>0CE9000008AB08001F0016000200</Bytes> <Pos>0</Pos> </Pattern> Multiplan spreadsheet (v1.0x) <Pattern> <Bytes>08E700</Bytes> <Pos>0</Pos> </Pattern> <Pattern> <Bytes>0100</Bytes> <Pos>6</Pos> </Pattern> <Pattern> <Bytes>000000</Bytes> <Pos>11</Pos> </Pattern> Multiplan spreadsheet (v2.x) <Pattern> <Bytes>0CEC000008AB08001F001A000300</Bytes> <Pos>0</Pos> </Pattern> Multiplan for Xenix spreadsheet (v2.x) <Pattern> <Bytes>0AEC000008AB08001F001A000300</Bytes> <Pos>0</Pos> </Pattern> Multiplan spreadsheet (v3.x) <Pattern> <Bytes>0CED000008AB08001F001A000000</Bytes> <Pos>0</Pos> </Pattern>There seems to be some patterns between versions, but then also some major differences. Without a specification or an understanding of the system the samples were created on, it is hard to identify these formats with certainty. There could be hex values which are the same for the samples we have but different for others, headers can often have values indicating dates or length of the file, so finding variations in files is key to a good signature.
Keep an eye on my GitHub PRONOM Research folder as I add more samples and prepare a signature for PRONOM.
Adobe Illustrator and PDF
Adobe Illustrator is a power design tool. Originally released in 1987 for the Macintosh, it has been the vector design tool of choice for many professionals.
Originally the Adobe Illustrator Format (AI) was based on postscript. With each file having a postscript header. This all changed with Illustrator version 9 moving to PDF as its core, released in the year 2000.
Even though AI files begin with a PDF header, there is much more to them which makes them a unique file format. So as Dov Isaacs put it, “PDF files are not Adobe Illustrator files and vice versa”.
Working in digital preservation the need to identify a file format is vital to the process. It is also important to identify when the format changes over time in order to properly maintain that file. Adobe Illustrator files created in version 8 or earlier are substantially different than those created in version 9 and greater and will need different software to render properly.
This is where identification tools come in handy.
Using the “File” command in a CLI we get:
Illustrator9v8-s04.ai: PostScript document text conforming DSC level 3.0
Illustrator9-s04.ai: PDF document, version 1.4, 1 pages
While partly true, we need more specific identification if we want to properly preserve these file in the long term. Enter PRONOM, which is a file identification registry based on signatures to identify file formats. Using a tool like DROID, Siegfried with the PRONOM registry we can get better identification.
siegfried : 1.10.0 scandate : 2023-04-07T11:59:18-06:00 signature : default.sig created : 2023-03-23T15:09:43Z identifiers : - name : 'pronom' details : 'DROID_SignatureFile_V111.xml; container-signature-20230307.xml' --- filename : 'Illustrator9-s04.ai' filesize : 77829 modified : 2023-04-07T11:04:53-06:00 errors : matches : - ns : 'pronom' id : 'fmt/558' format : 'Adobe Illustrator' version : '9.0' mime : 'application/postscript' class : 'Image (Vector)' basis : 'extension match ai; byte match at [[0 8] [1536 557]]' warning : --- filename : 'Illustrator9v8-s04.ai' filesize : 323748 modified : 2023-04-07T11:05:11-06:00 errors : matches : - ns : 'pronom' id : 'fmt/557' format : 'Adobe Illustrator' version : '8.0' mime : 'application/postscript' class : 'Image (Vector)' basis : 'extension match ai; byte match at 0, 673' warning :This identification is possible because of signatures built for the file format specific to each version. The file format wiki has a list of the current signatures for the Illustrator format. The problem is, the last signature added to PRONOM was for version 16 (CS6). Since then there have been more changes to the format.
If we attempt an identification of a Illustrator file created with current 2023 software we get this result.
filename : 'Illustrator2023-s01.ai' filesize : 1195445 modified : 2023-02-16T12:29:16-07:00 errors : matches : - ns : 'pronom' id : 'fmt/20' format : 'Acrobat PDF 1.6 - Portable Document Format' version : '1.6' mime : 'application/pdf' class : 'Page Description' basis : 'byte match at [[0 8] [1195439 5]]' warning : 'extension mismatch'Although technically correct, as the Illustrator file has a PDF 1.6 header, identification needs to know this is an Illustrator file. So if we create a new signature by adding the following hexadecimal pattern:
255044462D312E36*3C696C6C7573747261746F723A547970653E446F63756D656E743C2F696C6C7573747261746F723A547970653E*252150532D41646F62652D332E30*254149355F46696C65466F726D6174203134
filename : 'Illustrator2023-s01.ai' filesize : 1195445 modified : 2023-02-16T12:29:16-07:00 errors : matches : - ns : 'pronom' id : 'BYUDev/3' format : 'Adobe Illustrator CC 2020' version : '24.2+' mime : 'application/postscript' class : basis : 'extension match ai; byte match at [[0 8] [8766 45] [45347 348]]' warning :Lets break down the hexadecimal pattern. The “*” is a wildcard indicating there is 0 to many bytes in between.
255044462D312E36 translates to: %PDF-1.6 3C696C6C7573747261746F723A547970653E446F63756D656E743C2F696C6C7573747261746F723A547970653E translates to: <illustrator:Type>Document</illustrator:Type> 252150532D41646F62652D332E30 translates to: %!PS-Adobe-3.0 254149355F46696C65466F726D6174203134 translates to: %AI5_FileFormat 14Identification is based first on the PDF Header, then some XMP metadata indicating this is an Illustrator document, then the Postscript header, then finally the version identifier. Each Illustrator since version 5 has a file format version, when Adobe switched from the CS labels to CC, they stuck with version 13 until 2020, when the format was changed to version 14. With one catch, when Illustrator version 24 (2020) was first released it was format version 14, but still had the PDF 1.5 header. This was changed in version 24.2 to a PDF 1.6 header which added a bigger Canvas size.
In the current PRONOM signatures going back to version 9, there was some offsets assumed for the space between the PDF header, postscript header, and version number. I have found through many samples there are quite a few which are outside those offsets, especially as the size of the AI file gets larger. Therefore I am suggesting the “*” wildcard between all segments.
One area that still needs a bit more research is with Illustrator versions 9-12 (CS2). These do not include the XMP metadata indicating they are Illustrator Documents, so they will more often get misidentified as PDF. I did find, however, AI files have a string “/AIPrivateData”, while saved as PDF, they have “/AIPDFPrivateData”. So signature will have this added to distinguish.
Another anomaly is some samples I found on the Illustrator 9 CD-ROM. Illustrator 9 was released in June of 2000, but many of these files were created in February of 2000, they have a PDF 1.4 header but have a format version 4, which is what version 8 uses. So these files were probably created with an early build of Illustrator 9 and format was incremented to 5 in the public release.
You can see my submission suggestion on my GitHub page along with the PRONOM Signature and sample files. There is still a couple tweaks I need to make, but let me know what you think.
Note: All Illustrator files and PDF’s saved with Illustrator compatibility checked which include a section of the file called “AI Private Data”, this is where all the Illustrator data lives. It includes a “creator” version and a “container” version which could also be used to identify an Illustrator files version.
PhD Placement focussing on Manuscripts from West Africa
New online - March 2023
New online - February 2023
The curse of HTML mail
It’s been most of a year since I last posted here, but I wanted to rant about HTML mail, and this is the right blog for it. People complain about the intrusiveness of Web tracking, but email tracking is even worse. I’ve noticed this especially after subscribing to a couple of Substack newsletters. They’re sent as HTML, and whenever possible, I click the link to the equivalent Web page, which is less intrusive. Every link in a Substack newsletter is a tracking link, with the odd exception of the link to the Substack page.
The links in a Substack newsletter don’t go to the target page but to a Substack redirection URL. Their purpose is to let Substack know about everything you click on. There are no terms or privacy policy in the email telling you what Substack uses the information for.
It has a privacy policy on its website, but there’s no direct way to get to it. The policy says it collects personally identifiable information, including your name, address, picture, and phone number, and shares them with “affiliates.” Other services, such as Mailchimp, do much the same. Some HTML email services put “web bugs,” single-pixel images, into their mail. If your client displays images, the service knows each time you open the message.
The tracking links are tailored to you, so email is less private than opening a page on a site you haven’t logged into.
Tracking links make it difficult or impossible to tell where a link is actually going. Substack links use an encoding that doesn’t show the actual target in plain text, even if you view the message source.
You can read Substack messages as plain text; they’re sent as multipart messages with a plaintext version. With some newsletters, this doesn’t work too badly, but others are so interspersed with long URLs that they’re painful to read.
There is one way email is less bad than websites. Few modern email client applications, if any, will run JavaScript in email. Some early ones did, but opening a message from a malicious spammer and letting it run JavaScript would be a security disaster. If you read your email in a Web client, though, it will usually run its own JavaScript (the client’s, not the sender’s). It could also modify the links to add its own tracking.
The security risks of HTML email are widely known. Before the format was widely used, the idea of spreading malware by email was a joke. Now people are advised not to open email from suspicious-looking senders, with good reason. The battle is lost, and email for personal communication has gone into steep decline.
Thunderbird and some other clients offer “simple HTML” as a compromise. It does basic formatting but doesn’t display images. If you have to open HTML messages, that’s the safest way.
Personally, I view all my email as text when it’s possible. If a message is unreadable that way, I discard it unless it’s really important.
New online - November 2022
New online - October 2022
EAP Cataloguer Vacancy
New online - September 2022
EAP video
New online - August 2022
The Marvels of the Manaki Brothers
Webinars for Applicants – Round 18
West African Manuscripts Crowdsourcing Project Fellowship: Call now open
New online - July 2022
EAP Regional Hub Event at Jadavpur University, 14 September 2022
Job Opportunity
The Secret Service text message situation
The disappearance of the Secret Service’s text messages from January 6, 2021 is a data preservation issue, so I’m briefly reviving this blog from its long sleep to analyze it the best I can.
What we know
“Text messages” sent between Secret Service phones on January 6, 2021, during the unrest in Washington, DC, became unavailable within the bureau. News reporting has gotten so bad that it’s hard to find out just what this means; this CNN article contains more detail than most of the reports I’ve found.
The DHS Inspector General requested text records from the phones of 24 individuals in the Secret Service. These people included the heads of the details for the president and vice president. Only one record was given in response, and the bureau said no additional records were available. Ten phones had metadata indicating the transfer of text messages but didn’t have the messages’ content. On July 20, 2022, the Inspector General announced a criminal investigation into the lost messages.
Secret Service has stated that it lost messages as the result of a “system migration,” which occurred sometime between January 6 and February 26. It further claims that “none of the texts it [the Office of Inspector General] was seeking had been lost in the migration.” In other words, it’s saying there were no lost messages within the investigation’s scope.
Messaging and data retention
That’s not a lot to go on. Depending on whom you believe, we could be looking at anything from inconsequential sloppiness to a deliberate cover-up. But let’s see what we can get out of it.
“Text messages” usually means SMS messaging, but I haven’t found anything that explicitly says so. SMS messages are encrypted, but not end-to-end; they’re vulnerable to man-in-the-middle and spoofing attacks. If Secret Service values the “secret” in its name and it’s guarding against tech-savvy terrorists, I’d think it should use something more secure. But in the absence of other information, I’ll assume SMS. (But see below; iMessage may also have been used.)
A government agency dealing with sensitive data needs a data retention policy. It needs to make sure information doesn’t get lost and doesn’t get into unauthorized hands. The Federal Records Act requires such policies in many cases. SMS messages are normally retained only on the sender’s and recipient’s devices, so a data retention policy needs to focus there. If both the sender’s and recipient’s phones were destroyed and their text messages were never backed up, the data could be gone for good. However, it appears this isn’t what happened.
Data backup prior to migration was left up to individual Secret Service agents. This amounts to no retention policy. Even if everyone made a good-faith effort to do a backup, the saved messages would be all over the place, some of them stored on insecure servers, some irrecoverably lost.
A Washington Post article comments: “Cybersecurity professionals said that policy was ‘highly unusual,’ ‘ludicrous,’ a ‘failure of management’ and ‘not something any other organization would ever do.'” The article suggests some agents may have used iMessage on iPhones rather than SMS. It includes this extremely interesting bit:
In a letter to the House select committee investigating the insurrection, Secret Service officials said they began planning in the fall of 2020 to move all devices onto Microsoft Intune, a “mobile device management” service, known as an MDM, that companies and other organizations can use to centrally manage their computers and phones.
That sounds as if it wasn’t a matter of tossing old phones on the fire but merely installing some new software. A software installation isn’t supposed to wipe out existing data by default. It certainly shouldn’t delete it so thoroughly that forensic software can’t find at least some of the lost data.
The situation invites comparison to Hillary Clinton’s unauthorized use of a private email server for her office as Secretary of State in 2016. Some people overreacted to it, even calling for her execution, but the situations are similar in their failure to handle sensitive government records properly. The present situation is much more likely to involve the actual and possibly deliberate loss of vital information.
There’s a saying: “Never attribute to malice what can be explained by stupidity.” Is the Secret Service message black hole the result of a cover-up or gross negligence? Hopefully we’ll find out soon.
New online - June 2022
(From the Endangered Archives Blog: Lynda Barraclough on histories in peril)