U bent hier

Voortbestaan

Early Microsoft Excel

Obsolete Thor - 14 april 2023 - 5:53pm

The first version of Microsoft Excel was released on Macintosh in 1985. Before that there was MultiPlan.

The ancestor of Excel is Multiplan 1981-1988.
Until v4, it used blitted files.
From v4 on, it’s using Biff1, like Excel 1: Starts with 09 00 BoF, Ends with 0A 00 EoF.

Easter egg? Plan lives forever

— Ange (@angealbertini) March 12, 2023

MultiPlan version 4 and Excel version 2 used the well known and documented BIFF format. Before BIFF2 the formats are a bit of a mystery. AFAIK, Microsoft never released any documentation on the file format used for Excel version 1 and MultiPlan 1 -3, they emphasized using the SYLK format for interchange. To make matters worse, there were upwards of 100 different versions of the early MultiPlan, ported for dozens of different systems. Some of them are discussed on the TRS-80 website.

Or you can take MultiPlan 1.06 for a spin over at PCjs!

Needless to say documenting and finding a pattern which could be used to identify the early versions of MultiPlan and Excel 1 are difficult. These versions are missing from the PRONOM registry, but hopefully with enough samples, some patterns can be found to confidently identify formats from the early days of spreadsheets!

Marco Pontello’s TrID identifier software has signatures for the early Multiplan and Excel formats. His software scans for patterns in samples and finds commonalities between them. So the more samples he can scan the more accurate the identification can be.

Currently the signatures are as follows.

Microsoft Excel for Mac Spreadsheet (v1.x) <Pattern> <Bytes>532700</Bytes> <ASCII> S '</ASCII> <Pos>0</Pos> </Pattern> <Pattern> <Bytes>AB27000000000000000203</Bytes> <ASCII> . '</ASCII> <Pos>4</Pos> </Pattern> Multiplan for Mac spreadsheet (v1.x) <Pattern> <Bytes>11AB000013E8000000000000</Bytes> <ASCII> . . . . . . . . . . . .</ASCII> <Pos>0</Pos> </Pattern> Multiplan spreadsheet (v1.x) <Pattern> <Bytes>0CE9000008AB08001F0016000200</Bytes> <Pos>0</Pos> </Pattern> Multiplan spreadsheet (v1.0x) <Pattern> <Bytes>08E700</Bytes> <Pos>0</Pos> </Pattern> <Pattern> <Bytes>0100</Bytes> <Pos>6</Pos> </Pattern> <Pattern> <Bytes>000000</Bytes> <Pos>11</Pos> </Pattern> Multiplan spreadsheet (v2.x) <Pattern> <Bytes>0CEC000008AB08001F001A000300</Bytes> <Pos>0</Pos> </Pattern> Multiplan for Xenix spreadsheet (v2.x) <Pattern> <Bytes>0AEC000008AB08001F001A000300</Bytes> <Pos>0</Pos> </Pattern> Multiplan spreadsheet (v3.x) <Pattern> <Bytes>0CED000008AB08001F001A000000</Bytes> <Pos>0</Pos> </Pattern>

There seems to be some patterns between versions, but then also some major differences. Without a specification or an understanding of the system the samples were created on, it is hard to identify these formats with certainty. There could be hex values which are the same for the samples we have but different for others, headers can often have values indicating dates or length of the file, so finding variations in files is key to a good signature.

Keep an eye on my GitHub PRONOM Research folder as I add more samples and prepare a signature for PRONOM.

Adobe Illustrator and PDF

Obsolete Thor - 7 april 2023 - 11:47pm

Adobe Illustrator is a power design tool. Originally released in 1987 for the Macintosh, it has been the vector design tool of choice for many professionals.

Originally the Adobe Illustrator Format (AI) was based on postscript. With each file having a postscript header. This all changed with Illustrator version 9 moving to PDF as its core, released in the year 2000.

Illustrator 8 vs Illustrator 9 header.

Even though AI files begin with a PDF header, there is much more to them which makes them a unique file format. So as Dov Isaacs put it, “PDF files are not Adobe Illustrator files and vice versa”.

Working in digital preservation the need to identify a file format is vital to the process. It is also important to identify when the format changes over time in order to properly maintain that file. Adobe Illustrator files created in version 8 or earlier are substantially different than those created in version 9 and greater and will need different software to render properly.

This is where identification tools come in handy.

Using the “File” command in a CLI we get:

Illustrator9v8-s04.ai: PostScript document text conforming DSC level 3.0

Illustrator9-s04.ai: PDF document, version 1.4, 1 pages

While partly true, we need more specific identification if we want to properly preserve these file in the long term. Enter PRONOM, which is a file identification registry based on signatures to identify file formats. Using a tool like DROID, Siegfried with the PRONOM registry we can get better identification.

siegfried : 1.10.0 scandate : 2023-04-07T11:59:18-06:00 signature : default.sig created : 2023-03-23T15:09:43Z identifiers : - name : 'pronom' details : 'DROID_SignatureFile_V111.xml; container-signature-20230307.xml' --- filename : 'Illustrator9-s04.ai' filesize : 77829 modified : 2023-04-07T11:04:53-06:00 errors : matches : - ns : 'pronom' id : 'fmt/558' format : 'Adobe Illustrator' version : '9.0' mime : 'application/postscript' class : 'Image (Vector)' basis : 'extension match ai; byte match at [[0 8] [1536 557]]' warning : --- filename : 'Illustrator9v8-s04.ai' filesize : 323748 modified : 2023-04-07T11:05:11-06:00 errors : matches : - ns : 'pronom' id : 'fmt/557' format : 'Adobe Illustrator' version : '8.0' mime : 'application/postscript' class : 'Image (Vector)' basis : 'extension match ai; byte match at 0, 673' warning :

This identification is possible because of signatures built for the file format specific to each version. The file format wiki has a list of the current signatures for the Illustrator format. The problem is, the last signature added to PRONOM was for version 16 (CS6). Since then there have been more changes to the format.

If we attempt an identification of a Illustrator file created with current 2023 software we get this result.

filename : 'Illustrator2023-s01.ai' filesize : 1195445 modified : 2023-02-16T12:29:16-07:00 errors : matches : - ns : 'pronom' id : 'fmt/20' format : 'Acrobat PDF 1.6 - Portable Document Format' version : '1.6' mime : 'application/pdf' class : 'Page Description' basis : 'byte match at [[0 8] [1195439 5]]' warning : 'extension mismatch'

Although technically correct, as the Illustrator file has a PDF 1.6 header, identification needs to know this is an Illustrator file. So if we create a new signature by adding the following hexadecimal pattern:

255044462D312E36*3C696C6C7573747261746F723A547970653E446F63756D656E743C2F696C6C7573747261746F723A547970653E*252150532D41646F62652D332E30*254149355F46696C65466F726D6174203134

filename : 'Illustrator2023-s01.ai' filesize : 1195445 modified : 2023-02-16T12:29:16-07:00 errors : matches : - ns : 'pronom' id : 'BYUDev/3' format : 'Adobe Illustrator CC 2020' version : '24.2+' mime : 'application/postscript' class : basis : 'extension match ai; byte match at [[0 8] [8766 45] [45347 348]]' warning :

Lets break down the hexadecimal pattern. The “*” is a wildcard indicating there is 0 to many bytes in between.

255044462D312E36 translates to: %PDF-1.6 3C696C6C7573747261746F723A547970653E446F63756D656E743C2F696C6C7573747261746F723A547970653E translates to: <illustrator:Type>Document</illustrator:Type> 252150532D41646F62652D332E30 translates to: %!PS-Adobe-3.0 254149355F46696C65466F726D6174203134 translates to: %AI5_FileFormat 14

Identification is based first on the PDF Header, then some XMP metadata indicating this is an Illustrator document, then the Postscript header, then finally the version identifier. Each Illustrator since version 5 has a file format version, when Adobe switched from the CS labels to CC, they stuck with version 13 until 2020, when the format was changed to version 14. With one catch, when Illustrator version 24 (2020) was first released it was format version 14, but still had the PDF 1.5 header. This was changed in version 24.2 to a PDF 1.6 header which added a bigger Canvas size.

In the current PRONOM signatures going back to version 9, there was some offsets assumed for the space between the PDF header, postscript header, and version number. I have found through many samples there are quite a few which are outside those offsets, especially as the size of the AI file gets larger. Therefore I am suggesting the “*” wildcard between all segments.

One area that still needs a bit more research is with Illustrator versions 9-12 (CS2). These do not include the XMP metadata indicating they are Illustrator Documents, so they will more often get misidentified as PDF. I did find, however, AI files have a string “/AIPrivateData”, while saved as PDF, they have “/AIPDFPrivateData”. So signature will have this added to distinguish.

Another anomaly is some samples I found on the Illustrator 9 CD-ROM. Illustrator 9 was released in June of 2000, but many of these files were created in February of 2000, they have a PDF 1.4 header but have a format version 4, which is what version 8 uses. So these files were probably created with an early build of Illustrator 9 and format was incremented to 5 in the public release.

You can see my submission suggestion on my GitHub page along with the PRONOM Signature and sample files. There is still a couple tweaks I need to make, but let me know what you think.

Note: All Illustrator files and PDF’s saved with Illustrator compatibility checked which include a section of the file called “AI Private Data”, this is where all the Illustrator data lives. It includes a “creator” version and a “container” version which could also be used to identify an Illustrator files version.

PhD Placement focussing on Manuscripts from West Africa

Endangered Archives Blog - 30 maart 2023 - 4:15pm
As a PhD placement student at the British Library, I had the privilege of being part of the Endangered Archives Programme. It allowed me to dive into the rich history and culture of West Africa through its manuscripts, and to play a role in making these unique works accessible to... Endangered Archives

New online - March 2023

Endangered Archives Blog - 23 maart 2023 - 3:47pm
This month we would like to highlight five new collections that have recently been made available online. They have come from South Africa, India, Nepal and from Georgia. The first project we would like to showcase is EAP1190. This was a completely new type of project for EAP. The archive... Endangered Archives

New online - February 2023

Endangered Archives Blog - 3 maart 2023 - 3:14pm
This month we would like to highlight five new collections that can be accessed via the EAP website. Two of these are from India, with the others from Mali, Mongolia, and Brazil. Creating a digital archive of eighteenth- and nineteenth-century criminal and notarial records in Mamanguape, São João do Cariri,... Endangered Archives

The curse of HTML mail

Mad File Format Science - 16 februari 2023 - 5:15pm

It’s been most of a year since I last posted here, but I wanted to rant about HTML mail, and this is the right blog for it. People complain about the intrusiveness of Web tracking, but email tracking is even worse. I’ve noticed this especially after subscribing to a couple of Substack newsletters. They’re sent as HTML, and whenever possible, I click the link to the equivalent Web page, which is less intrusive. Every link in a Substack newsletter is a tracking link, with the odd exception of the link to the Substack page.

The links in a Substack newsletter don’t go to the target page but to a Substack redirection URL. Their purpose is to let Substack know about everything you click on. There are no terms or privacy policy in the email telling you what Substack uses the information for.

It has a privacy policy on its website, but there’s no direct way to get to it. The policy says it collects personally identifiable information, including your name, address, picture, and phone number, and shares them with “affiliates.” Other services, such as Mailchimp, do much the same. Some HTML email services put “web bugs,” single-pixel images, into their mail. If your client displays images, the service knows each time you open the message.

The tracking links are tailored to you, so email is less private than opening a page on a site you haven’t logged into.

Tracking links make it difficult or impossible to tell where a link is actually going. Substack links use an encoding that doesn’t show the actual target in plain text, even if you view the message source.

You can read Substack messages as plain text; they’re sent as multipart messages with a plaintext version. With some newsletters, this doesn’t work too badly, but others are so interspersed with long URLs that they’re painful to read.

There is one way email is less bad than websites. Few modern email client applications, if any, will run JavaScript in email. Some early ones did, but opening a message from a malicious spammer and letting it run JavaScript would be a security disaster. If you read your email in a Web client, though, it will usually run its own JavaScript (the client’s, not the sender’s). It could also modify the links to add its own tracking.

The security risks of HTML email are widely known. Before the format was widely used, the idea of spreading malware by email was a joke. Now people are advised not to open email from suspicious-looking senders, with good reason. The battle is lost, and email for personal communication has gone into steep decline.

Thunderbird and some other clients offer “simple HTML” as a compromise. It does basic formatting but doesn’t display images. If you have to open HTML messages, that’s the safest way.

Personally, I view all my email as text when it’s possible. If a message is unreadable that way, I discard it unless it’s really important.

New online - November 2022

Endangered Archives Blog - 19 december 2022 - 5:28pm
This month we are highlighting the following three projects that have recently been made available to view online. EAP1073: Creation of Historical Photography Archive at the History Department of Khartoum University [Sudan] EAP1293: Documenting and Copying (Estampage) Sluice Inscriptions: A Case Study of Pudukottai [India] EAP1294: Safeguarding for Posterity Two... Endangered Archives

New online - October 2022

Endangered Archives Blog - 16 november 2022 - 11:47am
This month we are highlighting the following four projects that have recently been made available to view online. The Historical Archive of the Institute of Charity "Hermandad De Dolores" (Fraternity of Sorrows), Santiago De Chile [EAP1289] The Manuscript Collection of Issa Iskandar al Maa’luf, Beirut [EAP1423] The Manuscripts Collection of... Endangered Archives

EAP Cataloguer Vacancy

Endangered Archives Blog - 2 november 2022 - 4:00pm
We are seeking to recruit a cataloguer to join the EAP team at the British Library’s St Pancras site. This post is until 31 December 2023 (with the hope that it will be renewed). The purpose of the post is to support the team by cataloguing material received from the... Endangered Archives

New online - September 2022

Endangered Archives Blog - 5 oktober 2022 - 4:13pm
We have another four projects that recently went online to highlight this month. Two projects from India, and one each from Cuba and Columbia: Preservation and Digitisation of Manuscripts Belonging to 16th to 20th Century of Central Kerala (EAP1320) Creating a digital archive of ecclesiastical records in the original seven... Endangered Archives

EAP video

Endangered Archives Blog - 30 september 2022 - 12:07pm
EAP recently commissioned a short film, in the hope that it would raise the profile of the Programme and highlight the importance of making digitised content freely available to everyone. The video is now available on the Library’s YouTube channel and we hope you enjoying watching it. EAP would like... Endangered Archives

New online - August 2022

Endangered Archives Blog - 6 september 2022 - 6:13pm
We have another four projects that recently went online to highlight this month, including two from Peru: Manuscripts and Documents at the Biblioteca Generale di Terra Santa: the second step [EAP1142] The Ancash Community Archive Digitisation (ACAD) Project, Peru [EAP1325] Traditional Mongolian Script Newspapers at Sukhbaatar District Library (1928-1935) [EAP1391]... Endangered Archives

The Marvels of the Manaki Brothers

Endangered Archives Blog - 25 augustus 2022 - 10:42am
EAP1470 has started with a bang - an exhibition at the State Archives of the Republic of North Macedonia to celebrate the 140th anniversary of the birth of Milton Manaki. Milton, along with his brother, are known as the first cinematographers in the Balkans. Photographers who left a lasting legacy... Endangered Archives

Webinars for Applicants – Round 18

Endangered Archives Blog - 24 augustus 2022 - 3:30pm
We are pleased to announce the dates of the Webinars for Applicants to Round 18. The call goes out on 19th September and we encourage anyone interested in submitting an application to attend the webinar which will give a broad overview of the requirements of the Programme and things to... Endangered Archives

West African Manuscripts Crowdsourcing Project Fellowship: Call now open

Endangered Archives Blog - 8 augustus 2022 - 11:20am
We are delighted to be partnering with Chevening to offer a professional development fellowship. The Chevening Fellow will develop a community crowdsourcing project to improve the discoverability of approximately 10,000 digitised West African manuscripts within the EAP collections. We are keen to ensure these manuscripts are assigned titles in Arabic... Endangered Archives

New online - July 2022

Endangered Archives Blog - 4 augustus 2022 - 5:01pm
This month we are highlighting four pilot projects that have recently been made available online, from Indonesia, Kenya, Russia, and Tunisia. Early Cyrillic books and manuscripts of old believers communities in Kostroma, Russia [EAP990] Family Manuscript Libraries on the island of Jerba, Tunisia [EAP993] Endangered manuscripts digitised in Kampar, Riau... Endangered Archives

EAP Regional Hub Event at Jadavpur University, 14 September 2022

Endangered Archives Blog - 3 augustus 2022 - 4:16pm
In 2021, the British Library launched a project to establish a network of institutional hubs as a framework for local training and outreach work. We are very happy to announce that the School of Cultural Texts and Records, Jadavpur University, Kolkata has been chosen as the EAP Regional Hub for... Endangered Archives

Job Opportunity

Endangered Archives Blog - 1 augustus 2022 - 12:10pm
The British Library's International Team is seeking an International Engagement Manager to work with partners across the world and lead on setting up international hubs for EAP. You would be working across both the International Office and the Endangered Archives Programme with a focus on skills and knowledge exchange. You... Endangered Archives

The Secret Service text message situation

Mad File Format Science - 30 juli 2022 - 3:28pm

The disappearance of the Secret Service’s text messages from January 6, 2021 is a data preservation issue, so I’m briefly reviving this blog from its long sleep to analyze it the best I can.

What we know

“Text messages” sent between Secret Service phones on January 6, 2021, during the unrest in Washington, DC, became unavailable within the bureau. News reporting has gotten so bad that it’s hard to find out just what this means; this CNN article contains more detail than most of the reports I’ve found.

The DHS Inspector General requested text records from the phones of 24 individuals in the Secret Service. These people included the heads of the details for the president and vice president. Only one record was given in response, and the bureau said no additional records were available. Ten phones had metadata indicating the transfer of text messages but didn’t have the messages’ content. On July 20, 2022, the Inspector General announced a criminal investigation into the lost messages.

Secret Service has stated that it lost messages as the result of a “system migration,” which occurred sometime between January 6 and February 26. It further claims that “none of the texts it [the Office of Inspector General] was seeking had been lost in the migration.” In other words, it’s saying there were no lost messages within the investigation’s scope.

Messaging and data retention

That’s not a lot to go on. Depending on whom you believe, we could be looking at anything from inconsequential sloppiness to a deliberate cover-up. But let’s see what we can get out of it.

“Text messages” usually means SMS messaging, but I haven’t found anything that explicitly says so. SMS messages are encrypted, but not end-to-end; they’re vulnerable to man-in-the-middle and spoofing attacks. If Secret Service values the “secret” in its name and it’s guarding against tech-savvy terrorists, I’d think it should use something more secure. But in the absence of other information, I’ll assume SMS. (But see below; iMessage may also have been used.)

A government agency dealing with sensitive data needs a data retention policy. It needs to make sure information doesn’t get lost and doesn’t get into unauthorized hands. The Federal Records Act requires such policies in many cases. SMS messages are normally retained only on the sender’s and recipient’s devices, so a data retention policy needs to focus there. If both the sender’s and recipient’s phones were destroyed and their text messages were never backed up, the data could be gone for good. However, it appears this isn’t what happened.

Data backup prior to migration was left up to individual Secret Service agents. This amounts to no retention policy. Even if everyone made a good-faith effort to do a backup, the saved messages would be all over the place, some of them stored on insecure servers, some irrecoverably lost.

A Washington Post article comments: “Cybersecurity professionals said that policy was ‘highly unusual,’ ‘ludicrous,’ a ‘failure of management’ and ‘not something any other organization would ever do.'” The article suggests some agents may have used iMessage on iPhones rather than SMS. It includes this extremely interesting bit:

In a letter to the House select committee investigating the insurrection, Secret Service officials said they began planning in the fall of 2020 to move all devices onto Microsoft Intune, a “mobile device management” service, known as an MDM, that companies and other organizations can use to centrally manage their computers and phones.

That sounds as if it wasn’t a matter of tossing old phones on the fire but merely installing some new software. A software installation isn’t supposed to wipe out existing data by default. It certainly shouldn’t delete it so thoroughly that forensic software can’t find at least some of the lost data.

The situation invites comparison to Hillary Clinton’s unauthorized use of a private email server for her office as Secretary of State in 2016. Some people overreacted to it, even calling for her execution, but the situations are similar in their failure to handle sensitive government records properly. The present situation is much more likely to involve the actual and possibly deliberate loss of vital information.

There’s a saying: “Never attribute to malice what can be explained by stupidity.” Is the Secret Service message black hole the result of a cover-up or gross negligence? Hopefully we’ll find out soon.

New online - June 2022

Endangered Archives Blog - 22 juni 2022 - 11:13am
We have another 4 new projects online to bring to your attention. This time from Indonesia, Iran, India, and West Africa: Bima Manuscripts [EAP988] Zoroastrian historical documents and Avestan...

(From the Endangered Archives Blog: Lynda Barraclough on histories in peril)

Pagina's

Abonneren op Informatiebeheer  aggregator - Voortbestaan