U bent hier


The 20th Century Time Machine

Internet Archive - 13 oktober 2017 - 4:24am

by Nancy Watzman & Katie Dahl

Jason Scott

With the turn of a dial, some flashing lights, and the requisite puff of fog, emcees Tracey Jaquith, TV architect, and Jason Scott, free range archivist, cranked up the Internet Archive 20th Century Time Machine on stage before a packed house at the Internet Archive’s annual party on October 11.

Eureka! The cardboard contraption worked! The year was 1912, and out stepped Alexis Rossi, director of Media and Access, her hat adorned with a 78 rpm record.


D’Anna Alexander (center) with her mother (right) and grandmother (left).

“Close your eyes and listen,” Rossi asked the audience. And then, out of the speakers floated the scratchy sounds of Billy Murray singing “Low Bridge, Everybody Down” written by Thomas S. Allen. From 1898 to the 1950s, some three million recordings of about three minutes each were made on 78rpm discs. But these discs are now brittle, the music stored on them precious. The Internet Archive is working with partners on the Great 78 Project to store these recordings digitally, so that we and future generations can enjoy them and reflect on our music history. New collections include the Tina Argumedo and Lucrecia Hug 78rpm Collection of dance music collected in Argentina in the mid-1930s.


Next to emerge from the Time Machine was David Leonard, president of the Boston Public Library, which was the first free, municipal library founded in the United States. The mission was and remains bold: make knowledge available to everyone. Knowledge shouldn’t be hidden behind paywalls, restricted to the wealthy but rather should operate under the principle of open access as public good, he explained. Leonard announced that the Boston Public Library would join the Internet Archive’s Great 78 Project, by authorizing the transfer of 200,000 individual 78s to digitize for the 78rpm collection, “a collection that otherwise would remain in storage unavailable to anyone.”

David Leonard and Brewster Kahle

Brewster Kahle, founder and digital librarian of the Internet Archive, then came through the time machine to present the Internet Archive’s Internet Archive Hero Award to Leonard. “I am inspired every time I go through the doors,” said Kahle of the library, noting that the Boston Public Library was the first to digitize not just a presidential library, of John Quincy Adams, but also modern books.  Leonard was presented with a tablet imprinted with the Boston Public Library homepage.


Kahle then set the Time Machine to 1942 to explain another new Internet Archive initiative: liberating books published between 1923 to 1941. Working with Elizabeth Townsend Gard, a copyright scholar at Tulane University, the Internet Archive is liberating these books under a little known, and perhaps never used, provision of US copyright law, Section 108h, which allows libraries to scan and make available materials published 1923 to 1941 if they are not being actively sold. The name of the new collection: the Sony Bono Memorial Collection, named for the now deceased congressman and former representative who led the passage of the Copyright Term Extension Act of 1998, which had the effect of locking up most books from the public domain back to 1923.

One of these books includes “Your Life,” a tome written by Kahle’s grandfather, Douglas E. Lurton, a “guide to a desirable living.” “I have one copy of this book and two sons. According to the law, I can’t make one copy and give it to the other son. But now it’s available,” Kahle explained.


Sab Masada

The Time Machine cranked to 1944, out came Rick Prelinger, Internet Archive board president, archivist, and filmmaker. Prelinger introduced a new addition to the Internet Archive’s film collection: long-forgotten footage of an Arkansas Japanese internment camp from 1944.  As the film played on the screen, Prelinger welcomed Sab Masada, 87, who lived at this very camp as a 12 year old.

Masada talked about his experience at the camp and why it is important for people today to remember it, “Since the election I’ve heard echoes of what I heard in 1942. Using fear of terrorism to target the Muslims and people south of the border.”


Next to speak was Wendy Hanamura, the director of partnerships. Hanamura explained how as a sixth grader she discovered a book at the library, Executive Order 9066, published in 1972, which told the tale of Japanese internment camps during World War II.

“Before I was an internet archivist, I was a daughter and granddaughter of American citizens who were locked up behind barbed wires,” said Hanamura. That one book – now out of print – helped her understand what had happened to her family.

Inspired by making it to the semi-final round of the MacArthur 100&Change initiative with a proposal that provides libraries and learners with free digital access to four million books, the Internet Archive is forging ahead with plans despite not winning the $100 million grant. Among the books the Internet Archive is making available: Executive Order 9066.


The year display turned to 1985, Jason Scott reappeared on stage, explaining his role as a software curator. New this year to the Internet Archive are collections of early Apple software, he explained, with browser emulation allowing the user to experience just what it was like to fire up a Macintosh computer back in its hay day. This includes a collection of the then wildly popular “HyperCards,” a programmatic tool that enabled users to create programs that linked materials in creative ways, before the rise of the world wide web.


After Vinay Goelthis tour through the 20th century, the Time Machine was set to present day, 2017. Mark Graham, director of the Wayback Machine and Vinay Goel, senior data engineer, stepped on stage. Back in 1996, when the Wayback Machine began archiving websites on the still new world wide web, the entire thing amounted to 2.2 terabytes of data. Now the Wayback Machine contains 20 petabytes. Graham explained how the Wayback Machine is preserving tweets, government websites, and other materials that could otherwise vanish. One example: this report from The Rachel Maddow Show, which aired on December 16, 2016, about Michael Flynn, then slated to become national security advisor. Flynn deleted a tweet he had made linking to a falsified story about Hillary Clinton, but the Internet Archive saved it through the Wayback Machine.

Goel took the microphone to announce new improvements to Wayback Machine 2.0 search. Now it’s possible to search for keywords, such as “climate change,” and find not just web pages from a particular time period mentioning these words, but also different format types — such as images, pdfs, or yes, even an old Internet Archive favorite, gifs from the now-defunct GeoCities–including snow globes!

Thanks to all who came out to celebrate with the Internet Archive staff and volunteers, or watched online. Please join our efforts to provide Universal Access to All Knowledge, whatever century it is from.


Syncing Catalogs with thousands of Libraries in 120 Countries through OCLC

Internet Archive - 12 oktober 2017 - 6:37pm

We are pleased to announce that the Internet Archive and OCLC have agreed to synchronize the metadata describing our digital books with OCLC’s WorldCat. WorldCat is a union catalog that itemizes the collections of thousands of libraries in more than 120 countries that participate in the OCLC global cooperative.

What does this mean for readers?
When the synchronization work is complete, library patrons will be able to discover the Internet Archive’s collection of 2.5 million digitized monographs through the libraries around the world that use OCLC’s bibliographic services. Readers searching for a particular volume will know that a digital version of the book exists in our collection. With just one click, readers will be taken to archive.org to examine and possibly borrow the digital version of that book. In turn, readers who find a digital book at archive.org will be able, with one click, to discover the nearest library where they can borrow the hard copy.

There are additional benefits: in the process of the synchronization, OCLC databases will be enriched with records describing books that may not yet be represented in WorldCat.

“This work strengthens the Archive’s connection to the library community around the world. It advances our goal of universal access by making our collections much more widely discoverable. It will benefit library users around the globe by giving them the opportunity to borrow digital books that might not otherwise be available to them,” said Brewster Kahle, Founder and Digital Librarian of the Internet Archive. “We’re glad to partner with OCLC to make this possible and look forward to other opportunities this synchronization will present.”

“OCLC is always looking for opportunities to work with partners who share goals and objectives that can benefit libraries and library users,” said Chip Nilges, OCLC Vice President, Business Development. “We’re excited to be working with Internet Archive, and to make this valuable content discoverable through WorldCat. This partnership will add value to WorldCat, expand the collections of member libraries, and extend the reach of Internet Archive content to library users everywhere.”

We believe this partnership will be a win-win-win for libraries and for learners around the globe.

Better discovery, richer metadata, more books borrowed and read.

Boston Public Library’s Sound Archives Coming to the Internet Archive for Preservation & Public Access

Internet Archive - 11 oktober 2017 - 8:06pm

Today, the Boston Public Library announced the transfer of significant holdings from its Sound Archives Collection to the Internet Archive, which will digitize, preserve and make these recordings accessible to the public. The Boston Public Library (BPL) sound collection includes hundreds of thousands of audio recordings in a variety of historical formats, including wax cylinders, 78 rpms, and LPs. The recordings span many genres, including classical, pop, rock, jazz, and opera – from 78s produced in the early 1900s to LPs from the 1980s. These recordings have never been circulated and were in storage for several decades, uncataloged and inaccessible to the public. By collaborating with the Internet Archive, Boston Public Libraries audio collection can be heard by new audiences of scholars, researchers and music lovers worldwide.

Some of the thousands of 20th century recordings in the Boston Public Library’s Sound Archives Collection.

“Through this innovative collaboration, the Internet Archive will bring significant portions of these sound archives online and to life in a way that we couldn’t do alone, and we are thrilled to have this historic collection curated and cared for by our longtime partners for all to enjoy going forward,” said David Leonard, President of the Boston Public Library.

78 rpm recordings from the Boston Public Library Sound Archive Collection

Listening to the 78 rpm recording of “Please Pass the Biscuits, Pappy,” by W. Lee O’Daniel and his Hillbilly Boys from the BPL Sound Archive, what do you hear? Internet Archive Founder, Brewster Kahle, hears part of a soundscape of America in 1938.  That’s why he believes Boston Public Library’s transfer is so significant.

Boston Public Library is once again leading in providing public access to their holdings. Their Sound Archive Collection includes hillbilly music, early brass bands and accordion recordings from the turn of the last century, offering an authentic audio portrait of how America sounded a century ago.” says Brewster Kahle, Internet Archive’s Digital Librarian. “Every time I walk through Boston Public Library’s doors, I’m inspired to read what is carved above it: ‘Free to All.’”

The 78 rpm records from the BPL’s Sound Archives Collection fit into the Internet Archive’s larger initiative called The Great 78 Project. This community effort seeks to digitize all the 78 rpm records ever produced, supporting their  preservationresearch and discovery. From about 1898 to the 1950s, an estimated 3 million sides were published on 78 rpm discs. While commercially viable recordings will have been restored or remastered onto LP’s or CD, there is significant research value in the remaining artifacts which include often rare 78rpm recordings.

“The simple fact of the matter is most audiovisual recordings will be lost,” says George Blood, an internationally renowned expert on audio preservation. “These 78s are disappearing right and left. It is important that we do a good job preserving what we can get to, because there won’t be a second chance.”

George Blood LP’s 4-arm turntable used for 78 digitization.

The Internet Archive is working with George Blood LP, and the IA’s Music Curator, Bob George of the Archive of Contemporary Music  to discover, transfer, digitize, catalog and preserve these often fragile discs.  This team has already digitized more than 35,000 sides.  The BPL collection joins more than 20 collections  already transferred to the Internet Archive for physical and digital preservation and access. Curated by many volunteer collectors, these collections will be preserved for future generations.

The Internet Archive began working with the Boston Public Library in 2007, and our scanning center is housed at its Central Library in Copley Square.  There, as a digital-partner-in-residence, the Internet Archive is scanning bound materials for Boston Public Library, including the John Adams Library, one of the BPL’s Collections of Distinction.

To honor Boston Public Library’s long legacy and pioneering role in making its valuable holdings available to an ever wider public online, we will be awarding the 2017 Internet Archive Hero Award to David Leonard, the President of BPL, at a public celebration tonight at the Internet Archive headquarters in San Francisco.



Structural Metadata: Key to Structured Content

Story Needle - 11 oktober 2017 - 11:27am

Structural metadata is the most misunderstood form of metadata.  It is widely ignored, even among those who work with metadata. When it is discussed, it gets confused with other things.  Even people who understand structural metadata correctly don’t always appreciate its full potential. That’s unfortunate, because structural metadata can make content more powerful. This post takes a deep dive into what structural metadata is, what it does, and how it is changing.

Why should you care about structural metadata? The immediate, self-interested answer is that structural metadata facilitates content reuse, taking content that’s already created to deliver new content. Content reuse is nice for publishers, but it isn’t a big deal for audiences.  Audiences don’t care how hard it is for the publisher to create their content. Audiences want content that matches their needs precisely, and that’s easy to use.  Structural metadata can help with that too.

Structural metadata matches content with the needs of audiences. Content delivery can evolve beyond creating many variations of content — the current preoccupation of many publishers. Publishers can use structural metadata to deliver more interactive content experiences.  Structural metadata will be pivotal in the development of multimodal content, allowing new forms of interaction, such as voice interaction.  Well-described chunks of content are like well-described buttons, sliders and other forms of interactive web elements.  The only difference is that they are more interesting.  They have something to say.

Some of the following material will assume background knowledge about metadata.  If you need more context, consult my very approachable book, Metadata Basics for Web Content.

What is Structural Metadata?

Structural metadata is data about the structure of content.  In some ways it is not mysterious at all.  Every time you write a paragraph, and enclose it within a
<p> paragraph element, you’ve created some structural metadata.  But structural metadata entails far more than basic HTML tagging.  It gives data to machines on how to deliver the content to audiences. When structural metadata is considered as a fancy name for HTML tagging, much of its potency gets missed.

The concept of structural metadata originated in the library and records management field around 20 years ago. To understand where structural metadata is heading, it pays to look at how it has been defined already.

In 1996, a metadata initiative known as the Warwick Framework first identified structural metadata as “data defining the logical components of complex or compound objects and how to access those components.”

In 2001, a group of archivists, who need to keep track of the relationships between different items of content, came up with a succinct definition:  “Structural metadata can be thought of as the glue that binds compound objects together.”

By 2004, the National Information Standards Organization (NISO) was talking about structural metadata in their standards.  According to their definition in the z39.18 standard, “Structural metadata explain the relationship between parts of multipart objects and enhance internal navigation. Such metadata include a table of contents or list of figures and tables.”

Louis Rosenfeld and Peter Morville introduced the concept of structural metadata to the web community in their popular book, Information Architecture for the World Wide Web — the “Polar Bear” book. Rosenfeld and Morville use the structural metadata concept as a prompt to define the information architecture of a websites:

“Describe the information hierarchy of this object. Is there a title? Are there discrete sections or chunks of content? Might users want to independently access these chunks?”

A big theme of all these definitions is the value of breaking content into parts.  The bigger the content, the more it needs breaking down.  The structural metadata for a book relates to its components: the table of contents, the chapters, parts, index and so on.  It helps us understand what kinds of material is within the book, to access specific sections of the book, even if it doesn’t tell us all the specific things the book discusses.  This is important information, which surprisingly, wasn’t captured when Google undertook their massive book digitization initiative a number of years ago.  When the books were scanned, entire books became one big file, like a PDF.   To find a specific figure or table within book on Google books requires searching or scrolling to navigate through the book.

Image of Google Books webpage.The contents of scanned books in Google Books lack structural metadata, limiting the value of the content.

Navigation is an important purpose of structural metadata: to access specific content, such as a specific book chapter.  But structural metadata has an even more important purpose than making big content more manageable.  It can unbundle the content, so that the content doesn’t need to stay together. People don’t want to start with the whole book and then navigate through it to get to a small part in which they are interested. They want only that part.

In his recent book Metadata, Richard Gartner touches on a more current role for structural metadata: “it defines structures that bring together simpler components into something larger that has meaning to a user.” He adds that such information “builds links between small pieces of data to assemble them into a more complex object.”

In web content, structural metadata plays an important role assembling content. When content is unbundled, it can be  rebundled in various ways.  Structural metadata identifies the components within content types.  It indicates role of the content, such as whether the content is an introduction or a summary.

Structural metadata plays a different role today than it did in the past, when the assumption was that there was one fixed piece of large content that would be broken into smaller parts, identified by structural metadata.  Today, we may compose many larger content items, leveraging structural metadata, from smaller parts.

The idea of assembling content from smaller parts has been promoted in particular by DITA evangelists such as Anne Rockley (DITA is a widely used framework for technical documentation). Rockley uses the phrase “semantic structures” to refer to structural metadata, which she says “enable(s) us to understand ‘what’ types of content are contained within the documents and other content types we create.”  Rockley’s discussion helpfully makes reference to content types, which some other definitions don’t explicitly mention.  She also introduces another concept with a similar sounding name, “semantically rich” content, to refer to a different kind of metadata: descriptive metadata.  In XML (which is used to represent DITA), the term semantic is used generically for any element. Yet the difference between structural and descriptive metadata is significant — though it is often obscured, especially in the XML syntax.

Curiously, semantic web developments haven’t focused much on structural metadata for content (though I see a few indications that this is starting to change).  Never assume that when someone talks about making content semantic, they are talking about adding structural metadata.

Don’t Confuse Structural and Descriptive Metadata

When information professionals refer to metadata, most often they are talking about descriptive metadata concerning people, places, things, and events.  Descriptive metadata indicates the key information included within the content.  It typically describes the subject matter of the content, and is sometimes detailed and extensive.  It helps one discover what the content is about, prior to viewing the content.  Traditionally, descriptive metadata was about creating an external index — a proxy — such as assigning a keywords or subject headings about the content. Over the past 20 years, descriptive metadata has evolved to describing the body of the content in detail, noting entities and their properties.

Richard Gartner refers to descriptive metadata as “finding metadata”: it locates content that contains some specific information.  In modern web technology, it means finding values for a specific field (or property).  These values are part of the content, rather than separate from it.  For example, find smartphones with dual SIMs that are under $400.  The  attributes of SIM capacity and price are descriptive metadata related to the content describing the smartphones.

Structural metadata indicates how people and machines can use the content.  If people see a link indicating a slideshow, they have an expectation of how such content will behave, and will decide if that’s the sort of content they are interested in.  If a machine sees that the content is a table, it uses that knowledge to format the content appropriately on a smartphone, so that all the columns are visible.  Machines rely extensively on structural metadata when stitching together different content components into a larger content item.

diagram showing relationship of structural and descriptive metadataStructural and descriptive metadata can be indicated in the same HTML tag.  This tag indicates the start of an introductory section discussing Albert Einstein.

Structural metadata sometimes is confused with descriptive metadata because many people use vague terms such as “structure” and “semantics” when discussing content. Some people erroneously believe that structuring content makes the content “semantic”.  Part of this confusion derives from having an XML-orientation toward content.  XML tags content with angle-bracketed elements. But XML elements can be either structures such as sections, or they can be descriptions such as names.  Unlike HTML, where elements signify content structure while descriptions are indicated in attributes, the XML syntax creates a monster hierarchical tree, where content with all kinds of roles are nested within elements.  The motley, unpredictable use of elements in XML is a major reason it is unpopular with developers, who have trouble seeing what role different parts of the content have.

The buzzword “semantically structured content” is particularly unhelpful, as it conflates two different ideas together: semantics, or what content means, with structure, or how content fits together.  The semantics of the content is indicated by descriptive metadata, while the structure of the content is indicated by structural metadata.  Descriptive metadata can focus on a small detail in the content, such as a name or concept (e.g., here’s a mention of the Federal Reserve Board chair in this article).  Structural metadata, in contrast, generally focuses on a bigger chunk of content: here’s a table, here’s a sidebar.   To assemble content, machines need to distinguish what the specific content means, from what the structure of the content means.

Interest in content modeling has grown recently, spurred by the desire to reuse content in different contexts. Unfortunately, most content models I’ve seen don’t address metadata at all; they just assume that the content can pieced together.  The models almost never distinguish between the properties of different entities (descriptive metadata), and the properties of different content types (structural metadata). This can lead to confusion.  For example, a place has an address, and that address can be used in many kinds of content.  You may have specific content types dedicated to discussing places (perhaps tourist destinations) and want to include address information.  Alternatively, you may need to include the address information in content types that are focused on other purposes, such as a membership list.  Unless you make a clear distinction in the content model between what’s descriptive metadata about entities, and what’s structural metadata about content types, many people will be inclined to think there is a one-to-one correspondence between entities and content types, for example, all addresses belong the the content type discussing tourist destinations.

Structural metadata isn’t merely a technical issue to hand off to a developer.  Everyone on a content team who is involved with defining what content gets delivered to audiences, needs to jointly define what structural metadata to include in the content.

Three More Reasons Structural Metadata Gets Ignored…

Content strategists have inherited frameworks for working with metadata from librarians, database experts and developers. None of those roles involves creating content, and their perspective of content is an external one, rather than an internal one. These hand-me-down concepts don’t fit the needs of online content creators and publishers very well.  It’s important not to be misled by legacy ideas about structural metadata that were developed by people who aren’t content creators and publishers.  Structural metadata gets sidelined when people fail to focus on the value that content parts can contribute in difference scenarios.

Reason 1: Focus on Whole Object Metadata

Librarians have given little attention to structural metadata, because they’ve been most concerned with cataloging and  locating things that have well defined boundaries, such as books and articles (and most recently, webpages).  Discussion of structural metadata in library science literature is sparse compared with discussions of descriptive and administrative metadata.

Until recently, structural metadata has focused on identifying parts within a whole.  Metadata specialists assumed that a complete content item existed (a book or document), and that structural metadata would be used to locate parts within the content.  Specifying structural metadata was part of cataloging existing materials. But given the availability of free text searching and more recently natural language processing, many developers question the necessity of adding metadata to sub-divide a document. Coding structural metadata seemed like a luxury, and got ignored.

In today’s web, content exists as fragments that can be assembled in various ways.  A document or other content type is a virtual construct, awaiting components. The structural metadata forms part of the plan for how the content can fit together. It’s important to define the pieces first.

Reason 2: Confusion with Metadata Schemas

I’ve recently seen several cases where content strategists and others mix up the concept of structural metadata, with the concept of metadata structure, better known as metadata schemas.  At first I thought this confusion was simply the result of similar sounding terms.  But I’ve come to realize that some database experts refer to structural metadata in a different way than it is being used by librarians, information architects, and content engineers.  Some content strategists seem to have picked up this alternative meaning, and repeat it.

Compared to semi-structured web content, databases are highly regular in structure.  They are composed of tables of rows and columns.  The first column of a row typically identifies what the values relate to.  Some database admins refer to those keys or properties as the structure of the data, or the structural metadata.  For example, the OECD, the international statistical organization, says: “Structural metadata refers to metadata that act as identifiers and descriptors of the data.  Structural metadata are needed to identify, use, and process data matrixes and data cubes.”   What is actually being referred to is the schema of the data table.

Database architects develop many custom schemas to organize their data in tables.  Those schemas are very different from the standards-based structural metadata used in content.  Database tables provide little guidance on how content should be structured.  Content teams shouldn’t rely on a database expert to guide them on how to structure their content.

Reason 3: Treated as Ordinary Code

Web content management systems are essentially big databases built in programming language like PHP or .Net.  There’s a proclivity among developers to treat chunks of content as custom variables.  As one developer noted when discussing WordPress: “In WordPress (WP), the meaning of Metadata is a bit fuzzier.  It stores post metadata such as custom fields and additional metadata added via plugins.”

As I’ve noted elsewhere, many IT systems that manage content ignore web metadata metadata standards, resulting in silos of content that can’t work together. It’s not acceptable to define chunks of content as custom variables. The purpose of structural metadata is to allow different chunks of content to connect with each other.  CMSs need to rely on web standards for their structural metadata.

Current Practices for Structural Metadata

For machines to piece together content components into a coherent whole, they need to know the standards for the structural metadata.

Until recently, structural metadata has been indicated only during the prepublication phase, an internal operation where standards were less important.  Structural metadata was marked up in XML together with other kinds of metadata, and transformed into HTML or PDF.  Yet a study in the journal Semantic Web last year noted: “Unfortunately, the number of distinct vocabularies adopted by publishers to describe these requirements is quite large, expressed in bespoke document type definitions (DTDs). There is thus a need to integrate these different languages into a single, unifying framework that may be used for all content.”

XML continues to be used in many situations.  But a recent trend has been to adopt more light weight approaches, using HTML, to publish content directly.  Bypassing XML is often simpler, though the plainness of HTML creates some issues as well.

As Jeff Eaton has noted, getting specific about the structure of content using HTML elements is not always easy:

“We have workhorse elements like ul, div, and span; precision tools like cite, table, and figure; and new HTML5 container elements like section, aside, and nav. But unless our content is really as simple as an unattributed block quote or a floated image, we still need layers of nested elements and CSS classes to capture what we really mean.”

Because HTML elements are not very specific, publishers often don’t know how to represent structural metadata within HTML.  We can learn from the experience of publishers who have used XML to indicate structure, and who are adapting their structures to HTML.

Scientific research, and technical documentation are two genres where content structure is well-established, and structural metadata is mature.  Both these genres have explored how to indicate the structure of their content in HTML.

Scientific research papers are a distinct content type that follows a regular pattern. The National Library of Medicine’s Journal Article Tag Suite (JATS) formalizes the research paper structure into a content type as an XML schema.  It provides a mixture of structural and descriptive metadata tags that are used to publish biomedical and other scientific research.  The structure might look like:

<sec sec-type="intro"> <sec sec-type="materials|methods"> <sec sec-type="results"> <sec sec-type="discussion"> <sec sec-type="conclusions"> <sec sec-type="supplementary-material" ... >

Scholarly HTML is an initiative to translate the typical sections of a research paper into common HTML.  It uses HTML elements, and supplements them with typeof attributes to indicate more specifically the role of each section.  Here’s an example of some attribute values in their namespace, noted by the prefix “sa”:

<section typeof="sa:MaterialsAndMethods"> <section typeof="sa:Results"> <section typeof="sa:Conclusion"> <section typeof="sa:Acknowledgements"> <section typeof="sa:ReferenceList">

As we can see, these sections overlap with the JATS, since both are describing similar content structures.  The Scholarly HTML initiative is still under development, and it could eventually become a part of the schema.org effort.

DITA — the technical documentation architecture mentioned earlier — is a structural metadata framework that embeds some descriptive metadata.  DITA structures topics, which can be different information types: Task, Concept, Reference, Glossary Entry, or Troubleshooting, for example.  Each type is broken into structural elements, such as title, short description, prolog, body, and related links.  DITA is defined in XML, and uses many idiosyncratic tags.

HDITA is a draft syntax to express DITA in HTML.  It converts DITA-specific elements into HTML attributes, using the custom data-* attribute.  For example a “key definition” element <keydef> becomes an attribute within an HTML element, e.g. <div data-hd-class="keydef”>
.  Types are expressed with the attribute data-hd-type.

The use of the data-* offers some advantages, such as javascript access by clients.  It is not, however, intended for use as a cross-publisher metadata standard. The W3C notes: “A custom data attribute is an attribute in no namespace…intended to store custom data private to the page or application.”  It adds:

“These attributes are not intended for use by software that is not known to the administrators of the site that uses the attributes. For generic extensions that are to be used by multiple independent tools, either this specification should be extended to provide the feature explicitly, or a technology like microdata should be used (with a standardized vocabulary).”

The HDITA drafting committee appears to use “hd” in the data attribute to signify that the attribute is specific to HDITA.  But they have not declared a namespace for these attributes (the XML namespace for DITA is xmlns:ditaarch.)  This will prevent automatic machine discovery of the metadata by Google or other parties.

The Future of Structural Metadata

Most recently, several initiatives have explored possibilities for extending structural metadata in HTML.  These revolve around three distinct approaches:

  1. Formalizing structural metadata as properties
  2. Using WAI-ARIA to indicate structure
  3. Combining class attributes with other metadata schemas
New Vocabularies for Structures

The web standards community is starting to show more interest in structural metadata.  Earlier this year, the W3C released the Web Annotation Vocabulary.  It provides properties to indicate comments about content.  Comments are an important structure in web content that are used in many genres and scenarios. Imagine that readers may be highlighting passages of text. For such annotations to be captured, there must be a way to indicate what part of the text is being referenced.  The annotation vocabulary can reference specific HTML elements and even CSS selectors within a body of text.

Outside of the W3C, a European academic group has developed the Document Components Ontology (DoCO), “a general-purpose structured vocabulary of document elements.”  It is a detailed set of properties for describing common structural features of text content.  The DoCO vocabulary can be used by anyone, though its initial adoption will likely be limited to research-oriented publishers.  However, many specialized vocabularies such as this one have become extensions to schema.org.  If DoCO were in some form adsorbed by schema.org, its usage would increase dramatically.

Diagram showing document ontologyDiagram showing document components ontology  WAI-ARIA

WAI-ARIA is commonly thought of as a means to make functionality accessible.  However, it should be considered more broadly as a means to enhance the functionality of web content overall, since it helps web agents understand the intentions of the content. WAI-ARIA can indicate many dynamic content structures, such as alerts, feeds, marquees, and regions.

The new Digital Publishing WAI-ARIA developed out of the ePub standards, which have a richer set of structural metadata than is available in standard HTML5.  The goal of the Digital Publishing WAI-ARIA is to “produce structural semantic extensions to accommodate the digital publishing industry”.  It has the following structural attributes:

  • doc-abstract
  • doc-acknowledgments
  • doc-afterword
  • doc-appendix
  • doc-backlink
  • doc-biblioentry
  • doc-bibliography
  • doc-biblioref
  • doc-chapter
  • doc-colophon
  • doc-conclusion
  • doc-cover
  • doc-credit
  • doc-credits
  • doc-dedication
  • doc-endnote
  • doc-endnotes
  • doc-epigraph
  • doc-epilogue
  • doc-errata
  • doc-example
  • doc-footnote
  • doc-foreword
  • doc-glossary
  • doc-glossref
  • doc-index
  • doc-introduction
  • doc-noteref
  • doc-notice
  • doc-pagebreak
  • doc-pagelist
  • doc-part
  • doc-preface
  • doc-prologue
  • doc-pullquote
  • doc-qna
  • doc-subtitle
  • doc-tip
  • doc-toc


To indicate an the structure of a text box showing an example:

<aside role="doc-example"> <h1>An Example of Structural Metadata in WAI-ARIA</h1> … </aside>

Content expressing a warning might look like this:

<div role="doc-notice" aria-label="Explosion Risk"> <p><em>Danger!</em> Mixing reactive materials may cause an explosion.</p> </div>

Although book-focused, DOC-ARIA roles provide a rich set of structural elements that can be used with many kinds of content.  In combination with the core WAI-ARIA, these attributes can describe the structure of web content in extensive detail.

CSS as Structure

For a long while, developers have been creating pseudo structures using CSS, such as making infoboxes to enclose certain information. Class is a global attribute of HTML, but has become closely associated with CSS, so much so that some believe that is its only purpose.  Yet Wikipedia notes: “The class attribute provides a way of classifying similar elements. This can be used for semantic purposes, or for presentation purposes.”  Some developers use what are called “semantic classes” to indicate what content is about.  The W3C advises when using the class attribute: “authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.”

Some developers claim that the class attribute should never be used to indicate the meaning of content within an element, because HTML elements will always make that clear. I agree that web content should never use the class attribute as a substitute for using a meaningful HTML element. But the class attribute can sometimes further refine the meaning of an HTML element. Its chief limitation is that class names involve private meanings. Yet if they are self-describing they can be useful.

Class attributes are useful for selecting content, but they operate outside of metadata standards.  However, schema.org is proposing a property that will allow class values to be specified within schema.org metadata.  This has potentially significant implications for extending the scope of structural metadata.

The motivating use case is as follows: “There is a need for authors and publishers to be able to easily call out portions of a Web page that are particularly appropriate for reading out aloud. Such read-aloud functionality may vary from speaking a short title and summary, to speaking a few key sections of a page; in some cases, it may amount to speaking most non-visual content on the page.”

The pending cssSelector property in schema.org can identify named portions of a web page.  The class could be a structure such as a summary or a headline that would be more specific than an HTML element.  The cssSelector has a companion property called xpath, which identifies HTML elements positionally, such as the paragraphs after h2 headings.

These features are not yet fully defined. In addition to indicating speakable content, the cssSelector can indicate parts of a web page. According to a Github discussion: “The ‘cssSelector’ (and ‘xpath’) property would be particularly useful on http://schema.org/WebPageElement to indicate the part(s) of a page matching the selector / xpath.  Note that this isn’t ‘element’ in some formal XML sense, and that the selector might match multiple XML/HTML elements if it is a CSS class selector.”  This could be useful selecting content targeted at specific devices.

The class attribute can identify structures within the web content, working together with entity-focused properties that describe specific data relating to the content.  Both of these indicate content variables, but they deliver different benefits.

Entity-based (descriptive) metadata can be used for content variables about specific information. They will often serve as  text or numeric variables. Use descriptive metadata variables when choosing what informational details to put in a message.

Structural metadata can be used phrase-based variables, indicating reusable components.    Phrases can be either blocks (paragraphs or divs), or snippets (a span).  Use structural metadata variables when choosing the wording to convey a message in a given scenario.

A final interesting point about cssSelector’s in schema.org.  Like other properties in schema.org, these can be expressed either as inline markup in HTML (microdata) or as an external JSON-LD script.  This gives developers the flexibility to choose whether to use coding libraries that are optimized for arrays (JSON-flavored), or ones focus on selectors.  For too long, what metadata gets included has been influenced by developer preferences in coding libraries.  The fact that CSS attributes can be expressed as JSON suggests that hurdle is being transcended.


Structural metadata is finally getting some love in the standards community, even though awareness of it remains low among developers.  I hope that content teams will consider how they can use structural metadata to be more precise in indicating what their content does, so that it can be used flexibly in emerging scenarios such as voice interactions.

— Michael Andrews

The post Structural Metadata: Key to Structured Content appeared first on Story Needle.

Books from 1923 to 1941 Now Liberated!

Internet Archive - 10 oktober 2017 - 9:18pm

The Internet Archive is now leveraging a little known, and maybe never used, provision of US copyright law, Section 108h, which allows libraries to scan and make available materials published 1923 to 1941 if they are not being actively sold. Elizabeth Townsend Gard, a copyright scholar at Tulane University calls this “Library Public Domain.”  She and her students helped bring the first scanned books of this era available online in a collection named for the author of the bill making this necessary: The Sonny Bono Memorial Collection. Thousands more books will be added in the near future as we automate. We hope this will encourage libraries that have been reticent to scan beyond 1923 to start mass scanning their books and other works, at least up to 1942.

While good news, it is too bad it is necessary to use this provision.

Trend of Maximum U.S. General Copyright Term by Tom W Bell

If the Founding Fathers had their way, almost all works from the 20th century would be public domain by now (14-year copyright term, renewable once if you took extra actions).

Some corporations saw adding works to the public domain to be a problem, and when Sonny Bono got elected to the House of Representatives, representing part of Los Angeles, he helped push through a law extending copyright’s duration another 20 years to keep things locked-up back to 1923.  This has been called the Mickey Mouse Protection Act due to one of the motivators behind the law, but it was also a result of Europe extending copyright terms an additional twenty years first. If not for this law, works from 1923 and beyond would have been in the public domain decades ago.

Lawrence Lessig

Lawrence Lessig

Creative Commons founder, Larry Lessig fought the new law in court as unreasonable, unneeded, and ridiculous.  In support of Lessig’s fight, the Internet Archive made an Internet bookmobile to celebrate what could be done with the public domain. We drove the bookmobile across the country to the Supreme Court to make books during the hearing of the case. Alas, we lost.

Internet Archive Bookmobile in front of
Carnegie Library in Pittsburgh: “Free to the People”

But there is an exemption from this extension of copyright, but only for libraries and only for works that are not actively for sale — we can scan them and make them available. Professor Townsend Gard had two legal interns work with the Internet Archive last summer to find how we can automate finding appropriate scanned books that could be liberated, and hand-vetted the first books for the collection. Professor Townsend Gard has just released an in-depth paper giving libraries guidance as to how to implement Section 108(h) based on her work with the Archive and other libraries. Together, we have called them “Last Twenty” Collections, as libraries and archives can copy and distribute to the general public qualified works in the last twenty years of their copyright.  

Today we announce the “Sonny Bono Memorial Collection” containing the first books to be liberated. Anyone can download, read, and enjoy these works that have been long out of print. We will add another 10,000 books and other works in the near future. “Working with the Internet Archive has allowed us to do the work to make this part of the law usable,” reflected Professor Townsend Gard. “Hopefully, this will be the first of many “Last Twenty” Collections around the country.”

Now it is the chance for libraries and citizens who have been reticent to scan works beyond 1923, to push forward to 1941, and the Internet Archive will host them. “I’ve always said that the silver lining of the unfortunate Eldred v. Ashcroft decision was the response from people to do something, to actively begin to limit the power of the copyright monopoly through action that promoted open access and CC licensing,” says Carrie Russell, Director of ALA’s Program of Public Access to Information. “As a result, the academy and the general public has rediscovered the value of the public domain. The Last Twenty project joins the Internet Archive, the HathiTrust copyright review project, and the Creative Commons in amassing our public domain to further new scholarship, creativity, and learning.”

We thank and congratulate Team Durationator and Professor Townsend Gard for all the hard work that went into making this new collection possible. Professor Townsend Gard, along with her husband, Dr. Ron Gard, have started a company, Limited Times, to assist libraries, archives, and museums implementing Section 108(h), “Last Twenty” collections, and other aspects of the copyright law.

Prof. Elizabeth
Townsend Gard

Tomi Aina
Law Student

Stan Sater
Law Student







Hundreds of thousands of books can now be liberated. Let’s bring the 20th century to 21st-century citizens. Everyone, rev your cameras!

Internet Archive’s Annual Bash this Wednesday! — Get your tickets now before we run out!

Internet Archive - 10 oktober 2017 - 12:32am

Limited tickets left for 20th Century Time Machine — the Internet Archive’s Annual Bash – happening this Wednesday at the Internet Archive from 5pm-9:30pm. In case you missed it, here’s our original announcement.

Tickets start at $15 here.

Once tickets sell out, you’ll have the opportunity to join the waitlist. We’ll release tickets as spaces free up and let you know via email.

We’d love to celebrate with you!

History is happening, and we’re not just watching

Internet Archive - 10 oktober 2017 - 12:02am
  1. Which recent hurricane got the least amount of attention from TV news broadcasters?
    1. Irma
    2. Maria
    3. Harvey
  2. Thomas Jefferson said, “Government that governs least governs best.”
    1. True
    2. False
  3. Mitch McConnell shows up most on which cable TV news channel?
    1. CNN
    2. Fox News
    3. MSNBC

Answers at end of post.

The Internet Archive’s TV News Archive, our constantly growing online, free library of TV news broadcasts, contains 1.4 million shows, some dating back to 2009, searchable by closed captioning. History is happening, and we preserve how broadcast news filters it to us, the audience, whether it’s through CNN’s Jake Tapper, Fox’s Bill O’Reilly, MSNBC’s Rachel Maddow or others. This archive becomes a rich resource for journalists, academics, and the general public to explore the biases embedded in news coverage and to hold public officials accountable.

Last October we wrote how the Internet Archive’s TV News Archive was “hacking the election,” then 13 days away. In the year since, we’ve been applying our experience using machine learning to track political ads and TV news coverage in the 2016 elections to experiment with new collaborations and tools to create more ways to analyze the news.

Helping fact-checkers

Since we launched our Trump Archive in January 2017, and followed in August with the four congressional leaders, Democrat and Republican, as well as key executive branch figures, we’ve collected some 4,534 hours of curated programming and more than 1,300 fact-checks of material on subjects ranging from immigration to the environment to elections.


The 1,340 fact-checks–and counting–represent a subset of the work of partners FactCheck.orgPolitiFact and The Washington Post’s Fact Checker, as we link only to fact-checks that correspond to statements that appear on TV news. Most of the fact-checks–524–come from PolitiFact; 492 are by FactCheck.org, and 324 from The Washington Post’s Fact Checker.

We’re also proud to be part of the Duke Reporter’s Lab’s new Tech & Check collaborative, where we’re working with journalists and computer scientists to develop ways to automate parts of the fact-checking process.  For example, we’re creating processes to help identify important factual claims within TV news broadcasts to help guide fact-checkers where to concentrate their efforts. The initiative received $1.2 million from the John S. and James L. Knight Foundation, the Facebook Journalism Project and the Craig Newmark Foundation.

See the TrumpUS Congress, and executive branch archives and collected fact-checks.

TV News Kitchen

We’re collaborating with data scientists, private companies and nonprofit organizations, journalists, and others to cook up new experiments available in our TV News Kitchen, providing new ways to analyze TV news content and understand ourselves.

Dan Schultz, our senior creative technologist, worked with the start-up Matroid to develop Face-o-Matic, which tracks faces of selected high level elected officials on major TV cable news channels: CNN, Fox News, MSNBC, and BBC News. The underlying data are available for download here. Unlike caption-based searches, Face-o-Matic uses facial recognition algorithms to recognize individuals on TV news screens. It is sensitive enough to catch this tiny, dark image of House Minority Leader Nancy Pelosi, D., Calif., within a graphic, and this quick flash of Senate Minority Leader Chuck Schumer, D., N.Y., and Senate Majority Leader Mitch McConnell, R., Ky.

The work of TV Architect Tracey Jaquith, our Third Eye project scans the lower thirds of TV screens, using OCR, or optical character recognition, to turn these fleeting missives into downloadable data ripe for analysis. Launched in September 2017, Third Eye tracks BBC News, CNN, Fox News, and MSNBC, and collected more than four million chyrons captured in just over two weeks, and counting.

Download Third Eye data. API and TSV options available.

Follow Third Eye on Twitter.

Vox news reporter Alvin Chang used the Third Eye chyron data to report how Fox News paid less attention to Hurricane Maria’s destruction in Puerto Rico than it did to Hurricanes Irma and Harvey, which battered Florida and Texas. Chang’s work followed a similar piece by Dhrumil Mehta for FiveThirtyEight, which used Television Explorer, a tool developed by data scientist Kalev Leetaru to search and visualize closed captioning on the TV News Archive.


FiveThirtyEight used TV News Archive captions to create this look at how cable networks covered recent hurricanes.

CNN’s Brian Stelter followed up with a similar analysis on “Reliable Sources” October 1.

We’re also working with academics who are using our tools to unlock new insights. For example, Schultz and Jaquith are working with Bryce Dietrich at the University of Iowa to apply the Duplitron, the audiofingerprinting tool that fueled our political ad airing data, to analyze floor speeches of members of Congress. The study identifies which floor speeches were aired on cable news programs and explores the reasons why those particular clips were selected for airing. A draft of the paper was presented in the 2017 Polinfomatics Workshop in Seattle and will begin review for publication in the coming months.

What’s next? Our plans include making more than a million hours of TV news available to researchers from both private and public institutions via a digital public library branch of the Internet Archive’s TV News Archive. These branches would be housed in computing environments, where networked computers provide the processing power needed to analyze large amounts of data. Researchers will be able to conduct their own experiments using machine learning to extract metadata from TV news. Such metadata could include, for example, speaker identification–a way to identify not just when a speaker appears on a screen, but when she or he is talking. Metadata generated through these experiments would then be used to enrich the TV News Archive, so that any member of the public could do increasingly sophisticated searches.

Going global

We live in an interdependent world, but we often lack understanding about how other cultures perceive us. Collecting global TV could open a new window for journalists and researchers seeking to understand how political and policy messages are reported and spread across the globe. The same tools we’ve developed to track political ads, faces, chyrons, and captions can help us put news coverage from around the globe into perspective.

We’re beginning work to expand our TV collection to include more channels from around the globe. We’ve added the BBC and recently began collecting Deutsche Welle from Germany and the English-language Al Jazeera. We’re talking to potential partners and developing strategy about where it’s important to collect TV and how we can do so efficiently.

History is happening, but we’re not just watching. We’re collecting, making it accessible, and working with others to find new ways to understand it. Stay tuned. Email us at tvnews@archive.org. Follow us @tvnewsarchive, and subscribe to our weekly newsletter here.

Answer Key

  1. b. (See: “The Media Really Has Neglected Puerto Rico,” FiveThirtyEight.
  2. b. False. (See: Vice President Mike Pence statement and linked PolitiFact fact-check.)
  3. c. MSNBC. (See: Face-O-Matic blog post.)

Members of the TV News Archive team: Roger Macdonald, director; Robin Chin, Katie Dahl, Tracey Jaquith, Dan Schultz, and Nancy Watzman.

TV News Record: 1,340 fact checks collected and counting

Internet Archive - 5 oktober 2017 - 2:07pm

A weekly round up on what’s happening and what we’re seeing at the TV News Archive by Katie Dahl and Nancy Watzman. Additional research by Robin Chin.

In an era when social media algorithms skew what people see online, the Internet Archive TV News Archive’s collections of on-the-record statements by top political figures serves  as a powerful model for how preservation can provide a deep resource for who really said what, when, and where.

Since we launched our Trump Archive in January 2017, and followed in August with the four congressional leaders, Democrat and Republican, as well as key executive branch figures, we’ve collected some 4,534 hours of curated programming and more than 1,300 fact-checks of material on subjects ranging from immigration to the environment to elections.

The 1,340 fact-checks–and counting–represent a subset of the work of partners FactCheck.org, PolitiFact and The Washington Post’s Fact Checker, as we link only to fact-checks that correspond to statements that appear on TV news. Most of the fact-checks–524–come from PolitiFact; 492 are by FactCheck.org, and 324 from The Washington Post’s Fact Checker.

As a library, we’re dedicated to providing a record – sometimes literally, as in the case of 78s! – that can help researchers, journalists, and the public find trustworthy sources for our collective history. These clip collections, along with fact-checks, now largely hand-curated, provide a quick way to find public statements made by elected officials.

See the Trump, US Congress, and executive branch archives and collected fact-checks.

The big picture

Given his position at the helm of the government, it is not surprising that Trump garners most of the fact-checking attention.  Three out of four, or 1008 of the fact-checks, focus on Trump’s statements. Another 192 relate to the four congressional leaders: Senate Majority Leader Mitch McConnell, R., Ky.; Senate Minority Leader Chuck Schumer, D., N.Y.; House Speaker Paul Ryan, R., Wis.; and House Minority Leader Nancy Pelosi, D., Calif. We’ve also logged 140 fact-checks related to key administration figures such as Sean Spicer, Jeff Sessions, and Mike Pence.

pie chart

The topics

The topics covered by fact-checkers run the gamut of national and global policy issues, history, and everything in between. For example, the debate on tax reform is grounded with fact-checks of the historical and global context posited by the president. Fact-checkers have also examined his aides’ claims on the impact of the current reform proposal on the wealthy and on the deficit. They’ve also followed the claims made by House Speaker Paul Ryan, R., Wis., the leading GOP policy voice on tax reform.

Another large set of fact-checks cover health care, going back as far as this claim made in 2010 by Pelosi about job creation under healthcare reform (PolitiFact rated it “Half True.”) The most recent example is the Graham-Cassidy bill that aimed to repeal much of Obamacare. One of the most sharply contested debates about that legislation was whether or not it would require coverage of people with pre-existing conditions. Fact-checkers parsed the he-said he-said debate as it unfolded on TV news, for example examining dueling claims by Schumer and Trump.

Browse or download  fact-checked TV clips by topic

The old stuff

The collection of Trump fact checks include a few dating back to 2011, long before his successful presidential campaign. Here he is at the CPAC conference that year claiming no one remembered now-former President Barack Obama from school, part of his campaign to question Obama’s citizenship. (PolitiFact rated: “Pants on Fire!”) And here he is with what FactCheck.org called a “100 percent wrong” claim about the Egyptian people voting to overturn a treaty with Israel.

This fact-check of McConnell dates back to 2009, when PolitiFact rated “false” his claim of how much federal spending occurred under Obama’s watch: “In just one month, the Democrats have spent more than President Bush spent in seven years on the war in Iraq, the war in Afghanistan and Hurricane Katrina combined.”

Meanwhile, this 2010 statement by Schumer, rated “mostly false” by PolitiFact, asserted that the U.S. Supreme Court “decided to overrule the 100-year-old ban on corporate expenditures.” The ban on giving directly to candidates is still in place; however,  corporations are free to spend unlimited funds on elections providing they do so separate from a candidate’s official campaign.

The repetition

Twenty-four million people will be forced off their health insurance, young farmers have to sell the farm to pay estate tax, NATO members owe the United States money, millions of women turn to Planned Parenthood for mammograms, and sanctuary cities lead to higher crime. These are all examples of claims found to be inaccurate or misleading, but that continued or continue to be repeated by public officials.

The unexpected

Whether you lean one political direction or another, there are always surprises from the fact-checkers that can keep all our assumptions in check. For example, if you’re opposed to building a wall on the southern border to keep people from crossing into the U.S., you might guess Trump’s claim that people use catapults to toss drugs over current walls is an exaggeration. In fact, that statement was rated “mostly true” by PolitiFact. Or if you’re conservative, you might be surprised to learn an often repeated quote ascribed to Thomas Jefferson, in this case by Vice President Mike Pence, is in fact falsely attributed to him.

How to find

If you’re looking for the most recent TV news statements with fact-checks, you can see the latest offerings on the TV Archive’s homepage by scrolling down.

screen grab of place on tv homepageYou can review whole speeches, scanning for just the fact-checked claims by looking for the fact-check icon  on a program timeline. For example, starting in the Trump Archive, you can choose a speech or interview and see if and how many of the statements were checked by reporters.

screen grab of timeline w icons

You can also find the fact-checks in the growing table, also available to download, which includes details on the official making the claim, the topic(s) covered, the url for the corresponding TV news clip, and the link to the fact-checking article.

image of fact-checks table

To receive the TV News Archive’s email newsletter, subscribe here.


Wayback Machine Playback… now with Timestamps!

Internet Archive - 5 oktober 2017 - 6:16am

The Wayback Machine has an exciting new feature: it can list the dates and times, the Timestamps, of all page elements compared to the date and time of the base URL of a page.  This means that users can see, for instance, that an image displayed on a page was captured X days before the URL of the page or Y hours after it.  Timestamps are available via the “About this capture” link on the right side of the Wayback Toolbar.  Here is an example:

The Timestamps list includes the URLs and date and time difference compared to the current page for the following page elements: images, scripts, CSS and frames. Elements are presented in a descending order. If your cursor over a list element on the page, it will be highlighted and if you click on it you will be shown a playback of just that element.

Under the hood

Web pages are usually a composition of multiple elements such as images, scripts and CSS. The Wayback Machine tries to archive and playback web pages in the best possible manner, including all their original elements.  Each web page element has its own URL and Timestamp, indicating the exact date and time it was archived. Page elements may have similar Timestamps but they could also vary significantly for various reasons which depend on the web crawling process. By using the new Timestamps feature, users can easily learn the archive date and time for each element of a page.

Why this is important

The Wayback Machine is increasingly used in critical procedures such as legal evidence or political debate material.  It is important that what is presented is clear and transparent, even in the light of a web that was not designed to be archived. One of the ways a web archive could be confusing is via anachronisms, displaying content from different dates and times than the user expects. For example, when a archived page is played back, it could include some images from the current web, making it look like the image came from the past when it did not. We implemented Timestamps to provide users with more context about, and in turn hopefully greater confidence in, what they are seeing.

The Lazy Person’s Guide to Text Wrangling

Story Needle - 4 oktober 2017 - 7:52am

Writing in plain text is increasingly popular. Writers of all kinds are adopting Markdown, the plain text writing format. They swear by the zen-like benefits of writing in plain text. WYSIWYG isn’t cool any more. Yet little attention has been given to how to work with plain text when editing material originating from many different sources.  The growing popularity of plain text opens new opportunities for the creative reuse of text content, because plain text is inherently portable, able to move between different applications easily. This post will describe how to edit plain text content when different sources provide raw text that is used to develop new content.

Reusing and repurposing text is helpful in many situations.  For a personal side project, I’ve been exploring content design options, using representative content, for a prototype.  I want to reuse public information from different sources and in different formats (e.g., lists, tables, descriptive paragraphs). This content offers possibilities to combine and remix information to highlight specific themes. Even if I were a fast and accurate typist, retyping large volumes of text that already exists is not desirable or feasible.  Cutting and pasting text is tedious, especially if the text is formatted.  I wanted better tools to manipulate text encoded in different formats, and change the structure of the content.

Digital text is generally not plain text.  Digital formats can, however, be converted to some flavor of plain text.  Content designers may acquire digital text that exists as HTML, as CSV (a barebones spreadsheet), and even as PDF.  Each format involves implicit forms of structure, at various levels of granularity.  CSV assumes tabular information.  Plain text assumes linear information.  HTML can describe text content that contains different levels of information:

  • Names, words, dates, numbers and other discrete strings
  • Phrases that combine several strings together
  • Sentences
  • Paragraphs
  • Headings and other structural elements

We can edit and fine tune all these levels using plain text.  Text wrangling makes it possible.

What is Text Wrangling?

Text wrangling converts and transforms information at different levels of granularity.  For example, information in a list could be converted into a table, or vice versa.  Text wrangling can restructure and renarrate information. It can also clean up content from different sources, such as standardizing spelling or wording.

Word processors (Word, iA writer, Scrivener, etc.) are designed for people writing fresh content.  They aren’t designed to support the reuse and repurposing of existing content.  When trying to manipulate text, word processors are rather clumsy.  Word processors fail when wanting to:

  • Generate content variations
  • Explore alternative wordings
  • Make multiple changes simultaneously
  • Ingest content from different sources that may be in different formats
  • Clean up text acquired from different sources.

Text wrangling differs from normal editing.  Instead of editing a single document, text wrangling involves gathering text from many sources, and rewriting and consolidating that text into a unified document or content repository.  Text wrangling applies large scale changes to text, by automating some low level transformations.  It uses functionality available in different applications to reduce typing and cut-and-paste operations.  This editing occurs during a “pre-drafting” phase, before the text evolves into a readable “draft”.  Editors can wrangle “raw” text fragments, to define themes and structures, and unify editorial consistency.

Tools for Text Wrangling

Many applications have useful features to wrangle plain text.   Ironically, none of these applications was designed for writers; most were designed for coders or data geeks.  As writing in plain text becomes more popular (in Markdown, Textile, AsciiDoc, or reStructuredText), more people are using coding tools to write.  These tools have enhanced editing features lacking in word processors, particularly the many “distraction free” apps designed for writing in Markdown.

Because none of the wrangling applications was designed specifically for text prose, no one application does everything I want. I use a combination of tools, and switch between them, depending on which is easiest to use for a specific purpose.  That sounds complicated, but it isn’t.  Plain text can be opened in many applications, and can be copied easily between them.

I use three kinds of tools to rework text:

  1. Spreadsheets (Google Sheets, Excel)
  2. Text editors that are primarily designed for coders (TextWrangler, Brackets, Sublime Text)
  3. Global utilities that are available to use within any application (TextSoap, Paste).

Before I share some tips, a few caveats (remember, I promised a lazy approach). These tips are not a comprehensive review of available apps and functionality.  Other apps provide alternative approaches, and some will be unknown to me. Because I use a Mac, my experience is limited to that platform  My preferences are motivated by a desire to find an easy way to perform a text task, without needing to learn anything fiddly.  Most developers would use something called Regex scripting to clean text, which is a powerful option for those comfortable with scripting.  I’ve opted for a quick and dirty approach, even if it is occasionally a messy one.  Lastly, apologies to my tech writer friends for my bloggerly presentation — I’m assuming everyone can locate more specific instructions elsewhere.   You’ll learn more from a Google search than I can provide you through a single link.

General Approach to Wrangling Plain Text

The basic approach to text wrangling is to work with lines of text.  Text editors organize text by line, as do spreadsheets (which call them rows). To take advantage of the functionality these tools offer, we convert text into lines.  Everything, whether a word, a sentence, a paragraph or a heading, can become a line that can be transformed.  Lines can be split, and they can be joined, to form different levels of meaning.

All text can be considered as a line, that can be transformed into different structuresAll text can be considered as a line, which can be transformed into different structures

Some lines of text are just words or phrases — these lines are lists.  Some lines are complete sentences.  Working with sentences on separate lines is flexible.  It is easier to perform operations on sentences, such as changing the capitalization, when each sentence is on a separate line.  It is also easier to reorder sentences.  When finished with editing, individual sentences can be joined together into paragraphs.  Brackets has a “Join Lines” function (an extension) that makes it easy.  Just highlight all the lines you want joined together, and they become a single line of several sentences forming a paragraph.

Stripping out HTML and other markup

The first task is to get the content into plain text.  Working in plain text makes it easy to focus on the text.   If you have text content that’s encoded in HTML, you’ll want to get it into plain text, without all the distracting markup.  Even if you are comfortable reading HTML, you’ll find CSS and Javascript markup that’s irrelevant to the text.

Sometimes you can get plain text by selecting the “Reader View” in your browser, and copying or emailing the text and saving it.  Alternatively, you may be able to acquire tabular or structured text on websites from within Google Sheets, using “ImportXML” or “ImportHTML”.  These functions take a moment to learn, but can be very helpful when you need to get a little bit of text from many different webpages.

When you are working with text files instead of live content, you want a way to directly clean the text without having to first view it in a browser.  Open the file in text editor, highlight text, and use TextSoap to strip out the HTML markup.  TextSoap is a handy app that can clean text, and can be used with any Mac application.

Breaking Apart Phrases

Phrases are the foundation of sentences, labels, and headings.  If you need to wordsmith many words or phrases, it may be easiest to get these into a list, and work with them in a spreadsheet.  A list is essentially a one column spreadsheet.  By breaking apart the text in one column into different columns, you can modify different segments of the text, changing their order, standardizing wording and formatting, or extracting a sub-string of text within a longer string.  A common, simple example relates to people’s names.  Do you want to list an author’s name as a single phrase (“Ellen B. Smith”)? Or would you like to separate given name(s) and family name?  What order to you want the given names and surnames?

Spreadsheets can split or extract the text in one column into one or more new columns.  You can create a new column for each distinct word, which allows you to group-edit distinct words.  Or you may want to extract and put into a new column what’s distinctive or unique in each line.

Google Sheets has a function called “Split” that breaks the text in a column into separate columns.  When words are in separate columns, it can be easy to make changes to specific words.  “Substitute” allows specific words to be swapped.  “Replace” allows a sub-string to be changed.

Consolidating Phrases

Spreadsheets are good not just for breaking apart words, but for combining them.  You can take words in different columns on the same row and combine them together using “Join.”  “Concatenate” allows words from different columns and different rows to be combined.   This is an even more flexible option, because it lets you combine unlimited combinations of words in different cells.  For example, you could play with different word hierarchies (broader or narrower words on different rows within a single column), or array a range of related verbs or adjectives across a singe row in different columns.  “Concatenate” can enable simple sentence generation.

A different situation occurs when you want to take information that’s within a table, and express it as a list.  A matrix table will have a column header, a row header, and a value associated with the column-row combination.  Suppose you have a table listing rainfall, with the columns representing months, and rows representing years, and the cell representing the amount of rainfall.  This information can be transformed into a single row or line.  In Excel, a function called “Unpivot” does this (Google Sheets lacks this functionality).  It presents all the information in a single row, such as “May | 2017| 2 cm”, which can be joined together.  These values can be transformed further into a list of complete sentences, such as “In May 2017, the rainfall was 2 cm”.  That list of sentences could become the beginning of separate paragraphs that discuss the implications of each month’s rainfall on the local economy.

Removing Redundancy

It’s helpful to put each discrete idea in the text on a separate line.  These may be names of topics, phrases, or facts.  During the text development phase, you’ll want to collect all text strings of interest.  Text wrangling tools can help you collect everything of potential interest, and worry later about whether you’ve already covered these items.

Suppose you need content about all of your products.  You can create a list of all your products, with each product on a separate line.  If you have many product variations that sound similar, it can be confusing to know if it’s already in the list.  If the list is a spreadsheet, it is easy to remove duplicates.  If the list is a text file, TextWrangler has a feature called “Kill Duplicates”.  To spot near-duplicates, sorting the lines will often reveal suspiciously-similar items.

Spotting duplicate or redundant paragraphs takes an extra step.  To compare alternative paragraphs, put them in separate files.  TextWrangler allows you to compare two files using the function “Find Differences”.  Both files are displayed side-by-side, with their differences highlighted.  This approach is more flexible than a word processor, which can assume you want to merge documents, or choose which document as the right one.

Harmonizing Style

A big task when assembling text from many sources is harmonizing the style.  Different texts may use different terminology.

Text editors, like word processors, have “Find and Replace” functionality, but they offer even better tools.  Find and Replace is inefficient because it assumes you have one word you already know you need to replace with another word.  Suppose instead you have many different words referring to the same concept?  Suppose you aren’t sure what would be the best replacement?  This is where the magic of “multiple selections” comes into play.

Brackets’ Multiple Selections feature can let you make many edits at once.  (Sublime Text has a similar feature).  All you need to do is highlight all the words that you want to change.  Then, you type over all the highlighted text at once, and see the changes happen as you type.  Words change on many lines at once, and you can try out different text to decide which works best across different sentences.  And to repeat: the highlighted words being changed don’t need to be the same word.  If you have some sentences that talk about dogs, some sentences that talk about canines, and some sentences that talk about mutts, you can highlight all these words (dogs, canines, mutts), and change them all to “hounds” — before deciding to say “man’s best friend” instead.

Brackets has a related feature called “Multiple Cursors” that is also amazing.  It allows you to place your cursor on multiple lines, and edit multiple lines at once.  Suppose you want to decide the best construction for some headings.  You want to know if saying “{Product X} helps you {Benefit Y}” is best, or “{Product X} makes it possible to {Benefit Y}”.  You list all the products and their respective benefits on separate lines.  Then you can edit all the headings at once, and try out each variation to see which sounds best.

Shifting Perspective

Many wording changes involve changing voice, or flipping emphasis.  Do you want to discuss a task using the imperative “invest” to emphasize action, or by using the gerund “investing” to emphasize a series of activities?  If you have many such tasks, you might want to put them in a list of statements, and try out both options.  You can then decide on a consistent approach.

In addition to the multiple cursor approach, you can edit multiple lines of text using the “Prefix/Suffix” functionality available in TextWrangler.  This allows you to either insert or remove either a prefix or a suffix to a line.  This could be useful with deciding on the wording of headings.  Maybe you want to see what the headings would sound like if they began “Case Study:” or whether they should end with “(Case Study)”.

Skeleton Frameworks

Plain text tools can help you reuse text elements again and again.  This can be useful if you have a template or framework you are using to collect text.

Sublime Text has a feature called “Snippets”, where you can store any text you want to reuse, and inject it into any file you are working with.

Another option is a small utility called Paste, which works with any application on a Mac.  It is like a huge clipboard, where you can store large snippets of text, give these snippets names, and reuse them wherever you may need them.

Adding Markup to Plain Text

Plain text is great for writing and editing.  But eventually it will need some markup to become more useful.  Several options are available to turn plain text into web text.

Many writers have adopted Markdown. You can add Markdown syntax to the text, and convert the text to HTML.

You can also add basic HTML elements to plain text using TextSoap, which is a utility that can be used in any Mac application. You simply highlight the words you want to tag, and choose the HTML element you want to use. This option may be desirable if you need elements that aren’t well supported in Markdown.

The most robust option is to use the tagging functionality available in some text editors.  You can add markup using Brackets’s “Surround” extension, where you highlight your text, then define any tagging you want to place around the text.  Sublime Text has a similar feature: “Tag > Wrap Selection.”  These features let you add metadata beyond simple HTML elements; for example, to indicate in what language a phrase is.

Limitations of Text Wrangling

Text wrangling techniques can be handy in many situations, but will be inefficient in others.  They are intended for early content development work.  As I’ve discovered, they can be helpful for assembling text to prototype content.

These techniques aren’t efficient for editing massive content repositories, or editing single documents that aren’t very long.  If you need to migrate large volumes of content, you’ll want some custom scripts written to transform that content appropriately.

Text wrangling focuses on redrafting raw text, rather than collaboration, which will generally get delegated to another platform such as GitHub. Word processing apps offer better support for the review of well-defined drafts, where comments and change tracking is important.  If you are editing or rewriting individual documents, especially in collaboration with others, tools such as Google Docs that track comments will be a better option.

Someday, I hope someone will develop the perfect tool to edit text.  Until then, using a combination of tools is the best option.

—Michael Andrews

The post The Lazy Person’s Guide to Text Wrangling appeared first on Story Needle.

TV News Record: Wayback Machine saves deleted prez tweets

Internet Archive - 29 september 2017 - 4:17pm

A weekly round up on what’s happening and what we’re seeing at the TV News Archive by Katie Dahl and Nancy Watzman. Additional research by Robin Chin.

In this week’s TV News Archive roundup, we explain how presidential tweets are forever, show how different TV cable news networks summarized NFL protests via Third Eye chyron data, and present FiveThirtyEight’s analysis of hurricane coverage (hint: Puerto Rico got less attention.)

Wayback Machine preserved deleted prez tweets; PolitiFact fact-checks legality of prez tweet deletions (murky)

The Internet Archive’s Wayback Machine has preserved President Donald Trump’s deleted tweets praising failed GOP Alabama U.S. Senate candidate Luther Strange following his defeat by Roy Moore on September 26. So does the Pulitzer Prize-winning investigative journalism site ProPublica, through its Politwoops project.

Kudos @propublica saving @realDonaldTrump deleted tweets. Also @internetarchive on Waybackhttps://t.co/FMkJNZ4xNShttps://t.co/xAPRTzCCb0 pic.twitter.com/zXkHzDvkLP

— TV News Archive (@TVNewsArchive) September 27, 2017

The story of Trump’s deleted tweets about Strange was reported far and wide, including this segment on MSNBC’s “Deadline Whitehouse” that aired on September 27.

In a fact-check on the legality of a president deleting tweets, linked in the TV News Archive clip above, John Kruzel, reports for PolitiFact that the law is murky but still being fleshed out:

Experts were split over how much enforcement power courts have in the arena of presidential record-keeping, though most seemed to agree the president has the upper hand.

“One of the problems with the Presidential Records Act is that it does not have a lot of teeth,” said Douglas Cox, a professor at the City University of New York School of Law. “The courts have held that the president has wide and almost unreviewable discretion to interpret the Presidential Records Act.”

That said, many of the experts we spoke to are closely monitoring how the court responds to the litigation around Trump administration record-keeping.

He also provides background on that litigation, a lawsuit brought by Citizens for Responsibility and Ethics in Washington. The case is broadly about requirements for preserving presidential records, and a previous set of deleted presidential tweets is a part of it.

Fact Check: NFL attendance and ratings are way down because people love their country (Mostly false)

Speaking of Trump’s tweets, the president ignited an explosion of coverage with an early morning tweet on Sunday, Sept. 24, ahead of a long day of football games: “NFL attendance and ratings are WAY DOWN. Boring games yes, but many stay away because they love our country.”

Manuela Tobias of PolitiFact rated this claim as “mostly false,” reporting, “Ratings were down 8 percent in 2016, but experts said the drop was modest and in line with general ratings for the sports industry. The NFL remains the most watched televised sports event in the United States.” “As for political motivation, there’s little evidence to suggest people are boycotting the NFL. Most of the professional sports franchises are dealing with declines in popularity.”

How did different cable TV news networks cover the NFL protests?

We first used the Television Explorer tool to see where there was a spike in the use of the word “NFL” near the word “Trump.” It looked like Sunday showed the most use of these words. After a  closer look, we saw MSNBC, Fox News, and CNN all showed highest mentions of these terms around 2 pm Pacific.

Spike at 2 pm (PST) for CNN, MSNBC, and CNN

Then we downloaded data from the new Third Eye project, which turns TV News chyrons into data, filtering for that date and hour. We were able to see how the three cable news networks were summarizing the news at that particular point in time.

At about 2:02, CNN broadcast this chyron: “NFL teams kneel, link arms in defiance of Trump.”

Screen grab of chyron caught by Third Eye from 2:02 pm 9/24/17 on CNN

Fox News chose the following, also seen below tweeted from one of the Third Eye twitter bots: “Some NFL owners criticize Trump’s statements on player protests, link arms with players”


— The Third Eye (@tvThirdEyeF) September 24, 2017

Meanwhile, MSNBC chose a different message:  “Taking a knee: NFL teams send a message.”

Screen grab of chyron caught by Third Eye from 2:02 pm 9/24/17 on MSNBC

About eight minutes later, all three cable channels were still reporting on the NFL protests:

Puerto Rico’s hurricane Maria got less media attention than hurricanes Harvey & Irma

Writing for FiveThirtyEight.com, Dhrumil Mehta demonstrated that both online news sites and TV news broadcasters paid less attention to Puerto Rico’s hurricane Marie than to hurricanes Harvey and Irma, which hit mainland U.S. primarily in Texas and Florida. Mehta used TV News Archive data via Television Explorer, as well as data from Media Cloud on online news coverage, to help make his case:

While Puerto Rico suffers after Hurricane Maria, much of the U.S. media (FiveThirtyEight not excepted) has been occupied with other things: a health care bill that failed to pass, a primary election in Alabama, and a spat between the president and sports players, just to name a few. Last Sunday alone, after President Trump’s tweets about the NFL, the phrase “national anthem” was said in more sentences on TV news than “Puerto Rico” and “Hurricane Maria” combined.

To receive the TV News Archive’s email newsletter, subscribe here.



Experiments Day Hackathon 2017

Internet Archive - 22 september 2017 - 4:45pm

Join us this Saturday, September 23 @ 10:30am PT for our Experiments Day Hackathon

It’s almost that time again — October 11 — the day the Internet Archive invites you to celebrate another year of preserving our cultural heritage and the progress our community has made towards  building tools that facilitate universal access to all knowledge.

Making these collections as discoverable and accessible as possible is a huge task, and we need your help! It’s often our community members who bring our items to life.

Now’s your chance!

Champions of open-access, unite: This Saturday, September 23 @ 10:30am PT,  join us in person at the  Internet Archive HQ or joins us remotely online for an Experiments Day Hackathon; a day of camaraderie and civic action fuelled by fresh ground coffee and abundant amounts of pizza.

Let’s team up to prototype experimental interfaces, remix content, and build tools to make knowledge more accessible to those who need it most.

What experiments would you craft with 2M hours or tv news, 5B archived images, 3M books, and petabytes of free storage? Proposed themes include #decentralization, #accessibility, #books, #scholarly-papers, #annotations.

We’ve helped backup over 2M hours of television news, hundreds of billions of webpages through time, audio for tens of thousands of live music concerts and 78rpms, and have helped digitize and lend millions of public domain and modern books. The breadth, archival quality, and uniqueness of our collections make the Internet Archive a rich terrain for experimentation and hacking.  Many of our top engineers will be on hand to guide you through the APIs to build with.

Then, be sure to come back on October 11 for our annual celebration, where many of our experiments will be on display!

Register/RSVP: https://www.eventbrite.com/e/internet-archive-experiments-hackathon-2017-tickets-37012125263

Join our chat: https://gitter.im/ArchiveExperiments/Lobby

Schedule + more event details: https://experiments.archivelab.org/hackathon

Watch remotely: https://www.youtube.com/watch?v=IJov5X5Sht4

TV news chyron data provide ways to explore breaking news reports & bias

Internet Archive - 21 september 2017 - 2:14pm

Today the Internet Archive’s TV News Archive announces a new way to plumb our TV news collections to see how news stories are reported: data feeds for the news that appears as chyrons on the lower thirds of TV screens.  Our Third Eye project scans the lower thirds of TV screens, using OCR, or optical character recognition, to turn these fleeting missives into downloadable data ripe for analysis.  At launch, Third Eye tracks BBC News, CNN, Fox News, and MSNBC, and contains more than four million chyrons captured in just over two weeks.

Download Third Eye data. API and TSV options available.

Follow Third Eye on Twitter.

Third Eye joins a growing suite of TV News Archive tools that help researchers, journalists, and the public analyze how news is filtered through TV and presented to the public. These include Face-o-Matic, created through a partnership with Matroid, which uses facial recognition to find top political leaders on TV news shows; and Television Explorer, an interface created by data scientist Kalev Leetaru that allows easy searching and visualization of TV News Archive closed captioning. The Political TV Ad Archive used audio fingerprinting to find airings of political ads in the 2016 elections, and the Trump and U.S. Congress archives provide a quick way to see news clips featuring top political figures, alongside associated fact checks by FactCheck.org, PolitiFact, and The Washington Post‘s Fact Checker.

Breaking news often appears as chyrons on TV before newscasters begin reporting or video is available, whether the subject is a hurricane or a breaking political story. Which chyrons a TV news network chooses to display often reveals editorial decisions that can demonstrate a particular slant on the news. With Third Eye data, investigations by journalists, fact-checkers, researchers, can explore how messages are delivered to the public in near real-time.

Third Eye on Twitter tweets the most clear, representative chyron from a one-minute period on a particular TV news channel. This can serve as a alert system, showing how TV networks are reporting news.

For example, on September 6, 2017, in the midst of a heavy news day featuring Hurricane Irma, the debate over a deal on immigration, and other stories, TV news cable networks began to show the breaking news that Facebook had turned over information about $100,000 in ads purchased by Russian sources during the 2016 elections to Robert S. Mueller III, the special counsel investigating ties between the Trump campaign and Russia. Our Third Eye CNN Twitter bot tweeted out this chyron recorded at 2:38 pm Pacific Standard Time.


— The Third Eye (@tvThirdEye) September 6, 2017

Here is the corresponding clip as it appears on the TV News Archive.

At 2:51 p.m., MSNBC ran this chyron: “FACEBOOK: WE SOLD POLITICAL ADS DURING ELECTION TO COMPANY LIKELY OPERATED IN RUSSIA.” The corresponding clip is below.

However, our data do not show Fox News running any chyrons on the Facebook ad news that day. To cross-check, we used Television Explorer, a tool for searching TV News Archive closed captions. (Captions differ from chyrons; captions capture what news anchors are actually saying, as opposed to chyrons, which feature text chosen by the TV channel to run at the bottom of the screen.) Television Explorer shows CNN and MSNBC covering the story on September 6, but not Fox News.

However, the Facebook ad story did make it on to the Fox News website during the 2 p.m. hour, as this search on the Wayback Machine shows.

This is just one example of the way that researchers might use Third Eye chyron data in conjunction with other tools to explore how a particular story is portrayed on TV news. We’d love for others to dig in, explore, and give us feedback on this new public data source.

More on Third Eye data

The work of the Internet Archive’s TV architect Tracey Jaquith, the Third Eye project applies OCR to the “lower thirds” of TV cable news screens to capture the text that appears there. The chyrons are not captions, which provide the text for what people are saying on screen, but rather are narrative display text that accompanies news broadcasts.

Created in real-time by TV news editors, chyrons sometimes include misspellings. The OCR process also frequently adds another element where text is not rendered correctly, leading to entries that may be garbled. To make sense out of the noise, Jaquith applies algorithms that choose the most representative chyrons from each channel collected over 60-second increments. This cleaned-up feed is what fuels the Twitter bots that post which chyrons are appearing on TV news screens.

We provide options to download this filtered feed and/or the raw feed nearly as soon as it appears on the TV screen. Both may be useful depending on the type of project. In addition, the Twitter feed itself is a good source to see what the filtered feed looks like.

Some notes:

  • Chryons are derived in near real-time from the TV News Archive‘s collection of TV news. The constantly updating public collection contains 1.4 million TV news shows, some dating back to 2009.
  • At launch, Third Eye captures four TV cable news channels: BBC News, CNN, Fox News, and MSNBC.
  • Data can be affected by temporary collection outages, which typically can last minutes or hours, but rarely more. If you are concerned about a specific time gap in a feed and would like to know if it’s the result of an outage, please inquire at tvnews@archive.org.
  • The “raw feed” option provides all of the OCR’ed text from chryons at the rate of approximately one entry per second. The “filtered tweets feed” provides the data that fuels our Twitter bots; this has been filtered to find the most representative, clearest chyrons from a 60-second period, with no more than one entry/tweet per minute (though the duration may be shorter than 60 seconds.) The filtered feed relies on algorithms that are a work in progress; we invite you to share your ideas on how to effectively filter the noise from the raw data.
  • Dates/times are in UTC (Coordinated Universal Time) in raw feeds, PST (Pacific Standard Time) in Twitter feed.
  • Because the size of the raw data is so large (about 20 megabytes per day), we limit results to seven days per request.
  • We began collecting raw data on August 25, 2017; the filtered feed begins on September 7, 2017.
  • “Duration” column is in seconds–the amount of time that particular chyron appeared on the screen.
  • To view clips in context on the TV News Archive, paste “https://archive.org/details/” before the field that begins with a channel name. For example, “FOXNEWSW_20170919_100000_FOX__Friends/start/792” becomes “https://archive.org/details/FOXNEWSW_20170919_100000_FOX__Friends/start/792”

We want to hear from you! Please contact us with questions, feedback, concerns – and also to tell us what project you’ve done with the TV News Archive’s Third Eye project: tvnews@archive.org. Follow us @tvnewsarchive, and subscribe to our weekly newsletter here.

Thanks to Robin Chin, Katie Dahl, Dan Schultz, and the TV News Archive director, Roger Macdonald, for contributing to this project.



MacArthur Foundation’s $100 Million Award Finalists

Internet Archive - 19 september 2017 - 8:09pm

Today, the MacArthur Foundation announced the finalists for its 100&Change competition, awarding a single organization $100 million to solve one of the world’s biggest problems. The Internet Archive’s Open Libraries project, one of eight semifinalists, did not make the cut to the final round. Today we want congratulate the 100&Change finalists and thank the MacArthur Foundation for inspiring us to think big. For the last 15 months, the Internet Archive team has been building the partnerships that can transform US libraries for the digital age and put millions of ebooks in the hands of more than a billion learners. We’ve collaborated with the world’s top copyright experts to clarify the legal framework for libraries to digitize and lend their collections. And we’ve learned an amazing amount from the leading organizations serving the blind and people with disabilities that impact reading.  

To us, that feels like a win.

In the words of MacArthur Managing Director, Cecilia Conrad:

The Internet Archive project will unlock and make accessible bodies of knowledge currently located on library shelves across the country. The proposal for curation, with the selection of books driven not by commercial interests but by intellectual and cultural significance, is exciting. Though the legal theory regarding controlled digital lending has not been tested in the courts, we found the testimony from legal experts compelling. The project has an experienced, thoughtful and passionate team capable of redefining the role of the public library in the 21st Century.

Copyright scholar and Berkeley Law professor, Pam Samuelson (center), convenes a gathering of more than twenty legal experts to help clarify the legal basis for libraries digitizing and lending physical books in their collections.

So, the Internet Archive and our partners are continuing to build upon the 100&Change momentum. We are meeting October 11-13 to refine our plans, and we invite interested stakeholders to join us at the Library Leaders Forum. If you are a philanthropist interested in leveraging technology to provide more open access to information—well, we have a project for you.

For 20 years, at the Internet Archive we have passionately pursued one goal: providing universal access to knowledge. But there is almost a century of books missing from our digital shelves, beyond the reach of so many who need them. So we cannot stop. We now have the technology, the partners and the plan to transform library hard copies into digital books and lend them as libraries always have. So all of us building Open Libraries are moving ahead.

Members of the Open Libraries Team at the Internet Archive headquarters, part of a global movement to provide more equitable access to knowledge.

Remember: a century ago, Andrew Carnegie funded a vast network of public libraries because he recognized democracy can only exist when citizens have equal access to diverse information. Libraries are more important than ever, welcoming all of society to use their free resources, while respecting readers’ privacy and dignity. Our goal is to build an enduring asset for libraries across this nation, ensuring that all citizens—including our most vulnerable—have equal and unfettered access to knowledge.

Thank you, MacArthur Foundation, for inspiring us to turn that idea into a well thought-out project.


–The Open Libraries Team

TV News Record: Debt ceiling, hurricane funding, GDP

Internet Archive - 14 september 2017 - 7:21pm

A weekly round up on what’s happening and what we’re seeing at the TV News Archive by Katie Dahl and Nancy Watzman. Additional research by Robin Chin.

In this week’s TV News Archive roundup, we examine the latest Face-O-Matic data (you can too!); and present our partner’s fact-checks on Sen. Ted Cruz’s claims that Hurricane Sandy emergency funding was filled with “unrelated pork” and President Donald Trump’s claims about other country’s GDPs.

What got political leaders sustained face-time on TV news last week?

What got Trump, McConnell, Schumer, Ryan, and Pelosi the longest clips on TV cable news screens this past week? Thanks to our new trove of Face-O-Matic data developed with the start-up Matroid’s facial recognition algorithms, reporters and researchers can get quick answers to questions like these.

House Minority Leader Nancy Pelosi, D., Calif., got almost six minutes – an unusually large amount of sustained face-time for the Democrat from California – from “MSNBC Live” on September 7 covering her press conference following President Donald Trump’s surprise deal with congressional Democrats on the debt ceiling.

Senate Majority Leader Mitch McConnell, R., Ky., also enjoyed his longest sustained face-time segment last week on the debt ceiling, clocking in at 34 seconds on MSNBC’s “Morning Joe.”

House Speaker Paul Ryan, R., Wis., got 11 minutes on September 7 on Fox News’ “Happening Now,” for his weekly press conference, where he was shown discussing a variety of topics, including Hurricane Harvey, tax reform, and also debt relief. For Sen. Majority Leader Chuck Schumer, D., N.Y., the topic that got him the most sustained time–21 seconds–was also his unexpected deal with the president on the debt ceiling.

For President Donald Trump, however, who never lacks for TV news face-time, his longest sustained appearance on TV news this past week was his speech at this week’s 9/11 memorial at the Pentagon.

Fact-check: Hurricane Sandy relief was 2/3 filled with pork and unrelated spending (false)

In the aftermath of Hurricane Harvey, Sen. Ted Cruz, R., Texas, came under criticism by supporting federal funding for Harvey victims while having opposed such funding for victims of Hurricane Sandy in 2013.  Cruz defended himself by saying, “The problem with that particular bill is it became a $50 billion bill that was filled with unrelated pork. Two-thirds of that bill had nothing to do with Sandy.”

But Lori Robertson of FactCheck.org labeled this claim as “false,” noting that a Congressional Research Service study pegged at least 69 percent of that bill’s funding as related to Sandy, and that even more of the money could be attributed to hurricane relief funding: “Cruz could have said he thought the Sandy relief legislation included too many non-emergency items. That’s fair enough, and his opinion. But he was wrong to specifically say two-thirds of the bill “had nothing to do with Sandy,” or “little or nothing to do with Hurricane Sandy.”

Fact-check: Trump spoke with world leader unhappy with nine percent GDP growth rate (three Pinocchios)

At a recent press conference on his tax reform plan, President Donald Trump remarked that other foreign leaders are unhappy with higher rates of growth in gross domestic product (GDP) than the U.S. has. “I spoke to a leader of a major, major country recently. Big, big country. They say ‘our country is very big, it’s hard to grow.’ Well believe me this country is very big. How are you doing, I said. ‘Cause I have very good relationships believe it or not with the leaders of these countries. I said, how are you doing? He said ‘not good, not good at all. Our GDP is 7 percent.’ I say 7 percent? Then I speak to another one. ‘Not good. Not good. Our GDP is only 9 percent.’”

Nicole Lewis of The Washington Post’s Fact Checker gave this claim “three Pinocchios”: “Of the 58 heads of state he’s met or spoke with since taking office, not one can claim 9 percent GDP growth. Perhaps Trump misheard. Or perhaps the other leader was fibbing. Or maybe Trump just thought the pitch for a tax cut sounded better if he could quote two leaders….In any case, Trump is making a major economic error in comparing the GDP of a developed country to a developing one. For his half-truths, and for comparing apples to oranges, Trump receives Three Pinocchios.”

To receive the TV News Archive’s email newsletter, subscribe here.

Rubbing the Internet Archive

Internet Archive - 13 september 2017 - 7:58am

In July 2017, Los Angeles-based artist Katie Herzog visited our headquarters in San Francisco and created Rubbing the Internet Archive — a 10-foot high by 84-foot wide rubbing of the exterior of the building, made using rubbing wax on non-fusible interfacing. The imposing 1923 building—formerly a Christian Science church and now a library—features an intricate facade that translated well into two dimensions.

The drawing is now adhered to the walls of Klowden Mann’s main exhibition space allowing the to-scale exterior of the Internet Archive to form the interior built-environment of the gallery.

Rubbing the Internet Archive is on view at Klowden Mann, 6023 Washington Blvd., Culver City, California, through October 14th.

Content Velocity, Scope, and Strategy

Story Needle - 12 september 2017 - 8:12am

I want to discuss three related concepts.

First, the velocity of the content, or how quickly the essence of content changes.

Second, the scope that the content addresses, that is, whether it is meant for one person or many people, and whether it is intended to have a short or a long life span.

Lastly, how those factors affect publishing strategy, and the various capabilities publishers need.

These concepts — velocity, scope and strategy — can help publishers diagnose common problems in content operations.  Many organizations produce too much content, and find they have repetitive or dated content.  Others struggle to implement approaches that were developed and successfully used in one context, such as technical documentation, and apply them to another, such as marketing communication.  Some publishers don’t have clear criteria addressing the utility of content, such as when new content is needed, or how long published content should be kept online.  Instead of relying on hunches to deal with these issues, publishers should structure their operations to reflect the ultimate purpose of their content.

Content should have a core rationale for why it exists.  Does the organization publish content to change minds — to get people to perceive topics differently, learn about new ideas, or to take an action they might not otherwise take without the content?  Or does it publish content to mind the change — to keep audiences informed with the most current information, including updates and corrections, on topics they already know they need to consult?

When making content, publishers should be able to answer: In what way is the content new?  Is it novel to the audience, or just an update on something they already know about?  The concept of content velocity can help us understand how quickly the information associated with content changes, and the extent to which newly created  content provides new information.

Content Velocity: Assessing Newness and Novelty

All content is created based on the implicit assumption that it says something new, or better, than existing content that’s available.  Unfortunately, much new content gets created without ever questioning whether it is completely necessary.  True, people need information about a topic to support a task or goal they have.  But is new content really necessary? Or could existing content be revised to address these needs?

Let’s walk through the process by which new content gets created.

Is new content necessary, or should existing content be revised?

The first decision is whether the topic or idea warrants the creation of new content, or whether existing content covers much of the same material.  If the topic or idea is genuinely new and has not been published previously, then new content is needed.  If the publisher has only a minor update to material they’ve previously published, they should update the existing content, and not create new content.  They may optionally issue an alert indicating that a change has been made, but such a notification won’t be part of the permanent record.  Too often, publishers decide to write new articles about minor changes that get added to the permanent stock of content.  Since the changes were minor, most such articles repeat information already published elsewhere, resulting in duplication and confusion for all concerned.

The next issue is to decide if the new content is likely to be viewed by an individual more than once. This is the shelf life of the content, considered from the audience’s perspective. Some content is disposable: its value is negligible after being viewed, or if never viewed by a certain date.  Content strategists seldom discuss short-lived, disposable content, except to criticize it as intrinsically wasteful. Yet some content, owing to its nature, is short lived. Like worn razor blades or leftover milk, it won’t be valuable forever.  It needs to disappear from the individual’s field of vision when it is no longer useful. If the audience considers the content disposable, then the publisher needs to treat it that way as well, and have a process for getting the content off the shelf.  Other content is permanent: it always needs to be available, because people may need to consult it more than once.

Publishers must also decide whether the content is either custom (intended for a specific individual), or generic (intended for many people).  We will return to custom and generic content shortly.

If the publisher already has content covering the topic, it needs to ask whether new information has emerged that requires existing content to be updated. We’d also like to know if some people may have seen this existing content previously, and will be interested in knowing what’s changed.  For example, I routinely consult W3C standards drafts.  I may want to know what’s different between one revision compared with the prior one, and appreciate when that information is called out.  For content I don’t routinely consult, I am happy to simply know that all details are current as of a certain date when the content was last revised.

One final case exists, which is far too common.  The publisher has already covered the topic or idea, and has no new information to offer.  Instead, they simply repackage existing content, giving it a veneer of looking new.  While repackaged content is sometimes okay if it involves a genuinely different approach to presenting the information, it is generally not advisable.  Repackaged content results from the misuse of the concept of content reuse.  Many marketing departments have embraced content reuse as a way to produce ever more content, saying the same thing, in the hopes that some of this content will be viewed.  The misuse of content reuse, particularly the automated creation of permanent content, is fueling an ever growing content bubble.   Strategic content reuse, in contrast, involves the coordination of different content elements into unique bundles of information, especially customized packages of information that support targeted needs or interests.

Once publishers decide to create new content, they need to decide content scope, the content’s expected audience and expected use.

Content Scope: Assessing Uniqueness and Specificity

Content scope refers to how unique or specific newly created content is.  We can consider uniqueness in terms of audiences (whether the content is for a specific individual, or a collective group), and in terms of time (is the content meant to be used at a specific moment only, or will be viewed again).   Content that is intended for a specific individual, or for viewing at a specific time, is more unique, and has narrower range of uses, than content that’s been created for many people, or for viewing multiple times by the same person. How and when the audience uses the content will influence how the publisher will need to create and manage that content.

Scope can vary according to four dimensions:

  1. The expected frequency of use
  2. The expected audience size
  3. The archival and management approach (which will mirror the expected frequency of use)
  4. The content production approach (which will mirror the expected audience size)
How content scope can vary

The expected frequency of use looks at whether someone is likely to want to view content again after seeing it once.  This looks at relevance from an individual’s perspective, rather than a publisher’s perspective.  Publishers may like to think they are creating so-called evergreen content that people will always find relevant, but from an audience perspective, most content, once viewed, will never be looked at again.  When audiences encounter content they’ve previously viewed, they are likely to consider it as clutter, unless they’ve a specific reason to view it again.  Audiences are most likely to consider longer, more substantive content on topics of enduring interest as permanent content.  They are likely to consider most other content as disposable.

Disposable content is often event driven.  Audiences need content that addresses a specific situation, and what is most relevant to them is content that addresses their needs at that specific moment.  Typically this content is either time sensitive, or customized to a specific scenario.  Most news has little value unless seen shortly after it is created.  Customized information can deliver only the most essential details that are relevant to that moment.  Customers may not want to know everything about their car lease — they only want to know about the payment for this month.  Once this month’s payment question has been answered, they no longer need that information.  This scenario shows how disposable content can be a subset of permanent content.  Audiences may not want to view all the permanent content, and only want to view a subset of it.  Alerts are one way to deliver disposable content that highlights information that is relevant, but only for a short time.

The expected audience refers to whether the content is intended for an individual, or addresses the interests of a group of individuals.  Historically, nearly all online content addressed a group of people, frequently everyone.  More recently, digital content has become more customized to address individual situational needs and interests, where the content one person views will not be the same as the content another views, even if the content covers the same broad topic.  The content delivered can consider factors such as geolocation, viewing history, purchase history, and device settings to provide content that is more relevant to a specific individual.  By extension, the more that content is adjusted to be relevant to a specific individual, the less that same content will be relevant to other individuals.

A tradeoff exists, between how widely viewed content is, and how helpful it might be to a specific individual.  Generic reference content may generate many views, and be helpful to many people, but it might not provide exactly what any one of those people want.  Single use content created for an individual may provide exactly what that person needed, at the specific time they viewed the content.  But that content will be helpful to an single person only, unless such customization is scalable across time and different individuals.

Disposable content is moment-rich, but duration-poor.  Marketing emails highlight the essential features of disposable content.  People never save marketing emails, and they rarely forward them to family and friends.  They rarely even open and read them, unless they are checking their email at a moment of boredom and want a distraction — fantasizing about some purchase they may not need, or wanting to feel virtuous for reading a tip they may never actually use.  Disposable content sometimes generates zero views by an individual, and almost never will generate more than one view.  If there’s ever a doubt about whether someone might really need the information later, publishers can add a “save for later” feature — but only when there’s a strong reason to believe a identifiable minority has a critical need to access the content again.

Publishers face two hurdles with disposable content: being able to quickly produce new content, and being able to deliver time-sensitive or urgent content to the right person when it is needed.  They don’t need to worry about archiving the content, since it is no longer valuable.  Disposable content is always changing, so that different people on different days will receive different content.

With permanent content, publishers need to worry about managing existing content, and having a process for updating it.    Publishers become concerned with consistency, tracking changes, and versioning.  These tasks are less frenetic than those for disposable content, but they can be more difficult to execute well.  It is easy to keep adding layers of new material on top of old material, while failing to indicate what’s now important, and for whom.

Content that’s used repeatedly, but is customized to specific individual needs, can present tricky information architecture challenges.  These can be addressed by having a user login to a personal account, where their specific content is stored and accessible.

Strategies for Fast and Slow Content: Operations Fit to Purpose

All publishers operate somewhere along a spectrum.  One end emphasizes quick turn-around, short-lived content (such as news organizations), and the other end emphasizes slowly evolving, long-lived content (such as healthcare advice for people with chronic conditions.) Many organizations will publish a mix of fast and slow content.  But it’s important for organizations to understand whether they are primarily a fast or slow content publisher, so that they can decide on the best strategy to support their publishing goals.

Most organizations will be biased toward either fast or slow content.  Fast moving consumer goods, unsurprisingly, tend to create fast content.  In contrast, heavy equipment manufacturers, whose products may last for decades, tend to generate slow content that’s revised and used over a long period.

Different roles in organizations gravitate toward either fast or slow content.  Consider a software company.  Marketers will blitz customers with new articles talking about how revolutionary the latest release of their software is.  Customer support may be focused on revising existing content about the product, and reassuring customers that the changes aren’t frightening, but easy to learn and not disruptive.  Even if the new release generates a new instance of technical documentation devoted to that release, the documentation will reuse much of the content from previous releases, and will essentially be a revision to existing content, rather than fundamentally new content.

Fast content is different from slow content

Some marketers want their copywriters to become more like journalists, and have set up “newsrooms” to churn out new content.  When emulating journalists, marketers are sticking with the familiar fast content paradigm, where content is meant to be read once only, preferably soon after it’s been created.  Most news gets old quickly, unless it is long form journalism that addresses long-term developments.  Marketing content frequently has a lifespan of a mosquito.

Marketing content tends to focus on:

  • Creating new content, or
  • Repackaging existing content, and
  • Making stuff sound new (and therefore attention worthy)

For fast content, production agility is essential.

Non-marketing content has a different profile. Non-marketing content includes advisory information from government or health organizations, and product content, such as technical documentation, product training, online support content, and other forms of UX content such as on-screen instructions.  Such content is created for the long term, and tends to emphasize that it is solid, reliable and up-to-date.  Rather than creating lots of new content, existing content tends to evolve.  It gets updated, and expands as products gain features or knowledge grows.  It may lead with what’s new, but will build on what’s been created already.

Much non-marketing content is permanent content about a fixed set of topics. The key task is not brainstorming new topics to talk about, but keeping published information up-to-date.  New permanent topics are rare.  When new topics are necessary, it’s common for new topics to emerge as branches of an existing topic.

Fast and slow content are fundamentally different in orientation.  Organizations are experimenting with ways to bridge these differences.  Organizations may try to make their marketing content more like product content, or conversely, make their product content more like marketing content.

Some marketing organizations are adopting technical communications methods, for example, content management practices developed for technical documentation such as DITA.  Marketing communications are seeking to leverage lessons from slow content practices, and apply them to fast content, so that they can produce more content at a larger scale.

Marketers want their content to become more targeted.  They want to componentize content so they can reuse content elements in endless combinations.  They embrace reuse, not as a path to revise existing content, but as a mechanism to push out new content quickly, using automation.  At its best, such automation can address the interests of audiences more precisely.  At its worst, content automation becomes a fatigue-inducing, attention-fragmenting experience for audiences, who are constantly goaded to view messages without ever developing an understanding .  Content reuse is a poor strategy for getting attention from audiences. New content, when generated from the reuse of existing content components, never really expresses new ideas.  It just recombines existing information.

Some technical communicators, who develop slow content, are implementing practices associated with marketing communications.  Rather than only producing permanent documents to read, technical communication teams are seeking to push specific disposable messages to resolve issues.  Technical communication teams are embracing more push tactics, such as developing editorial calendars, to highlight topics to send to audiences, instead of waiting for audiences to come to them. These teams are seeking to become more agile, and targeted, in the content they produce.

As the boundaries between the practices of fast and slow content begin to overlap, delivery becomes more important.  Publishers need to choose between targeted verses non-targeted delivery. They must decide of their content will be customized and dynamically created according to user variables, or pre-made to anticipate user needs.

The value of fast content depends above all on the accuracy of its targeting.  There is no point creating disposable content if it doesn’t resolve a problem for users.  If publishers rely on fast content, but can’t deliver it to the right users at the right time, the user may never find out the answer to their question, especially if permanent content gets neglected in the push for instant content delivery.

Generic fast content is becoming ever more difficult to manage.  Individuals don’t want to see content they’ve viewed already, or decided they weren’t interested in viewing to begin with.  But because generic content is meant for everyone, it is difficult to know who has seen or not seen content items.  Fast generic content still has a role. Targeting has its limits.  Publishers are far from being able to produce personalized content for everyone that is useful and efficient.  Much content will inevitably have no repeat use.  Yet fast generic content can easily become a liability that is difficult to manage.  Recommendation engines based on user viewing behaviors and known preferences can help prioritize this content so that more relevant content surfaces. But publishers should be judicious when creating fast generic content, and should enforce strict rules on how long such content stays available online.

Automation is making new content easier to create, which is increasing the temptation to create more new content.  Unfortunately, digital content can resemble plastic shopping bags, which are useful when first needed, but which generally never get used again, becoming waste. Publishers need to consider content reuse not just from their own parochial perspective, but from the perspective of their audiences.  Do their audiences want to view their content more than once?   Marketing content is the source of most fast content. Most marketing content is never read more than once.  Can that ever change?  Are marketers capable of producing content that has long term value to their audiences?  Or will they insist on controlling the conversation, directing their customers on what content to view, and when to view it?

Creating new content is not always the right approach.  Automation can make it more convenient for publishers to pursue the wrong strategy, without scrutinizing the value of such content to the organization, and its customers.   Content production agility is valuable, but having robust content management is an even more strategic capability.

— Michael Andrews

The post Content Velocity, Scope, and Strategy appeared first on Story Needle.

Face-O-Matic data show Trump dominates – Fox focuses on Pelosi; MSNBC features McConnell

Internet Archive - 6 september 2017 - 5:06pm

For every ten minutes that TV cable news shows featured President Donald Trump’s face on the screen this past summer, the four congressional leaders’ visages were presented  for one minute, according an analysis of Face-O-Matic downloadable, free data fueled by the Internet Archive’s TV News Archive and made available to the public today.

Face-O-Matic is an experimental service, developed in collaboration with the start-up Matroid, that tracks the faces of selected high level elected officials on major TV cable news channels: CNN, Fox News, MSNBC, and the BBC. First launched as a Slack app in July, the TV News Archive, after receiving feedback from journalists, is now making the underlying data available to the media, researchers, and the public. It will be updated daily here.

Unlike caption-based searches, Face-O-Matic uses facial recognition algorithms to recognize individuals on TV news screens. Face-O-Matic finds images of people when TV news shows use clips of the lawmakers speaking; frequently, however, the lawmakers’ faces also register if their photos or clips are being used to illustrate a story, or they appear as part of a montage as the news anchor talks.  Alongside closed caption research, these data provide an additional metric to analyze how TV news cable networks present public officials to their millions of viewers.

Our concentration on public officials and our bipartisan tracking is purposeful; in experimenting with this technology, we strive to respect individual privacy and extract only information for which there is a compelling public interest, such as the role the public sees our elected officials playing through the filter of TV news. The TV News Archive is committed to doing this right by adhering to these Artificial Intelligence principles for ethical research developed by leading artificial intelligence researchers, ethicists, and others at a January 2017 conference organized by the Future of Life Institute. As we go forward with our experiments, we will continue to explore these questions in conversations with experts and the public.

Download Face-O-Matic data here.

We want to hear from you:

What other faces would you like us to track? For example, should we start by adding the faces of foreign leaders, such as Russia’s Vladimir Putin and South Korea’s Kim Jong-un? Should we add former President Barack Obama and contender Hillary Clinton? Members of the White House staff? Other members of Congress?

Do you have any technical feedback? If so, please let us know what they are by contacting tvnews@archive.org or participating in the GitHub Face-O-Matic page.

Trump dominates, Pelosi gets little face time

Overall, between July 13 through September 5, analysis of Face-O-Matic data show:

  • All together, we found 7,930 minutes, or some 132 hours, of face-time for President Donald Trump and the four congressional leaders. Of that amount, Trump dominated with 90 percent of the face time. Collectively, the four congressional leaders garnered 15 hours of face time.
  • House Minority leader Nancy Pelosi, D., Calif., got the least amount of time on the screen: just 1.4 hours over the whole period.
  • Of the congressional leaders, Senate Majority Leader Mitch McConnell’s face was found most often: 7.6 hours, compared to 3.8 hours for House Speaker Paul Ryan, R., Wis.; 1.7 hours for Senate Minority Leader Chuck Schumer, D., N.Y., and 1.4 hours for Pelosi.
  • The congressional leaders got bumps in coverage when they were at the center of legislative fights, such as in this clip of McConnell aired by CNN, in which the senator is shown speaking on July 25 about the upcoming health care reform vote. Schumer got coverage on the same date from the network in this clip of him talking about the Russia investigation. Ryan got a huge boost on CNN when the cable network aired his town hall on August 21.

Fox shows most face time for Pelosi; MSNBC, most Trump and McConnell

The liberal cable network MSNBC gave Trump more face time than any other network. Ditto for McConnell. A number of these stories highlight tensions between the senate majority leader and the president. For example, here, on August 25, the network uses a photo of McConnell, and then a clip of both McConnell and Ryan, to illustrate a report on Trump “trying to distance himself” from GOP leaders. In this excerpt, from an August 21 broadcast, a clip of McConnell speaking is shown in the background to illustrate his comments that “most news is not fake,” which is interpreted as “seem[ing] to take a shot at the president.”

MSNBC uses photos of both Trump and McConnell in August 12 story on “feud” between the two.

While Pelosi does not get much face time on any of the cable news networks examined, Fox News shows her face more than any other. In this commentary report on August 20, Jesse Waters criticizes Pelosi for favoring the removal of confederate statues placed in the Capitol building. “Miss Pelosi has been in Congress for 30 years. Now she speaks up?” On August 8, “Special Report With Bret Baier” uses a clip of Pelosi talking in favor of women having a right to choose the size and timing of her family as an “acid test for party base.”

Example of Fox News using a photo of House Minority Leader Nancy Pelosi to illustrate a story, in this case about a canceled San Francisco rally.

While the BBC gives some Trump face time, it gives scant attention to the congressional leaders. Proportionately, however, the BBC gives Trump less face time than any of the U.S. networks.

On July 13 the BBC’s “Outside Source” ran a clip of Trump talking about his son, Donald Trump, Jr.’s, meeting with a Russian lobbyist.

For details about the data available, please visit the Face-O-Matic page. The TV News Archive is an online, searchable, public archive of 1.4 million TV news programs aired from 2009 to the present.  This service allows researchers and the public to use television as a citable and sharable reference. Face-o-Matic is part of ongoing experiments in generating metadata for reporters and researchers, enabling analysis of the messages that bombard us daily in public discourse.


Why Bitcoin is on the Internet Archive’s Balance Sheet

Internet Archive - 2 september 2017 - 6:09pm


A foundation was curious as to why we have Bitcoin on our balance sheet, and I thought I would explain it publicly.

The Internet Archive explores how bitcoin and other Internet innovations can be useful in the non-profit sphere– this is part of it. We want to see how donated bitcoin can be used, not just sold off. We are doing this publicly so others can learn from us.   And it is fun.  And it is interesting.

We started receiving donations in bitcoin in 2012, the first year we got about 2,700 and we sold them to an employee who was heavily involved (for the prevailing $2 per bitcoin). The next year, we held onto them and offered them to employees as an optional way to get their salary– ⅓ took some. We set up an ATM at the Internet Archive. We got the sushi place next door to take bitcoins, and encouraged our employees to buy books at Green Apple Books in bitcoin. We set up a vanity address. Started taking bitcoin in our swag store. Tried (and failed) to get our credit union to help bitcoin firms.

Another year we gave a small amount to people as an xmas bonus to those that set up a wallet (from a matching grant of bitcoins from me).

We paid vendors and contractors in bitcoin when they wanted it. Starting getting micropayments from the Brave Browser. Hosted a movie with filmmakers on living on bitcoin. We publicly tested if people are stealing bitcoins like the press was saying (didn’t steal ours).

A few years later, the price had gone up so much, I personally bought some at the going rate to decrease financial risk to the Internet Archive, but then I did not just cash those in for dollars. We may seem like we are geniuses, bit we are not, we saw the price go down as well and we did not sell out then either.

Recently Zcash folks helped us set up a Zcash address, and would love people to donate there.

What we are doing is trying to “play the game” and see how it works for non-profits. It is not an investment for us, it is testing a technology in an open way. If you want to see the donations to us in bitcoin, they are here. Zcash here.

Bitcoin donations have been decreasing in recent years, which may reflect we are not moving with the times. I am hoping that someone will say, gosh, I will donate a thousand bitcoins to these guys who have been so good :). Here is to hoping.

So the Internet Archive has some bitcoin on its balance sheet to be a living example of an organization that is trying this innovative Internet technology. We do the same with bittorrent, tor, and decentralized web tech.

Please donate and we will the them to good use supporting the Internet Archive’s mission.


The Internet Archive’s Annual Bash – Come Celebrate With Us!

Internet Archive - 30 augustus 2017 - 10:50pm

What’s your personal rabbit hole?

78 rpm recordings?
20th Century women writers?
Friendster sites?
Vintage software?
Educational films from the 50s?

Find out at the Internet Archive’s Annual Bash:

The Internet Archive invites you to enter our 20th Century Time Machine to experience the audio, books, films, web sites, ephemera and software fast disappearing from our midst. We’ll be connecting the centuries—transporting 20th century treasures to curious minds in the 21st. Come explore the possibilities at our annual bash on Wednesday, October 11, 2017, from 5-9:30 pm.

Tickets start at $15 here.

We’ll kick off the evening with cocktails, food trucks and hands-on demos of our coolest collections. Come scan a book, play in a virtual reality arcade, or spin a 78 rpm recording. When you arrive, be sure to get your library card. If you “check out” all the stations on your card, we’ll reward you with a special Internet Archive gift.

Starting at 7 p.m., we’ll unveil the latest media the Internet Archive has to offer, presented by the artists, writers, and scientists who lose themselves in our collections every day. And to keep you dancing into the evening, DJ Phast Phreddie the Boogaloo Omnibus, will once again be spinning records from 8-9:30. Come join our celebration!

Event Info:                    Wednesday, October 11th
5pm: Cocktails, food trucks, and hands-on demos
7pm: Program
8pm: Dessert and Dancing

Location:  Internet Archive, 300 Funston Avenue, San Francisco

Get your tickets now!



Abonneren op Informatiebeheer  aggregator - Beschikbaarstellen