U bent hier
A biweekly round up on what’s happening at the TV News Archive by Katie Dahl and Nancy Watzman.
This week we dive into research resources, fact-checks, and backgrounders on the sexual harassment charges sweeping the worlds of media, politics, and entertainment.Past TV news appearances by O’Reilly, Franken, Halperin & more preserved, searchable
The names of well-known and influential men accused of sexual harassment continue to pile up: Louis C.K.; Rep. John Conyers, D., Mich.; Sen. Al Franken, D., Minn.; Mark Halperin of ABC; Garrison Keillor of Prairie Home Companion fame; NBC’s Matt Lauer, Alabama GOP Senate candidate Roy Moore; Fox’s Bill O’Reilly; Charlie Rose of CBS and PBS; Leon Wieseltier, formerly of The New Republic; Harvey Weinstein, and of course the president himself, Donald Trump.
Some of these men made their living on TV and all of them have many TV appearances preserved in the TV News Archive. These include, for example, clips making the rounds in this week’s stories about Lauer’s television history: this one where NBC staff did a fake joke story about Lauer himself being harassed by a colleague; and this interview with the actress Anne Hathaway after a photographer took a photo up her skirt and a tabloid printed it.
We’ve also got this September 2017 interview Lauer did of Bill O’Reilly, in which he asks the former Fox News host: “Have you done some soul searching? Have you done some self-reflection and have you looked at the way you treated women that you think now or think about differently now than you did at the time?” O’Reilly answers, “My conscience is clear.”
Cable TV news programs mentions of the term “sexual harassment” are picking up, but not yet at the level they were back in November 2011, according to a search of TV News Archive caption data via Television Explorer. What was big news then? Accusations of sexual harassment against then presidential candidate Herman Cain, who the following month dropped out of the 2012 race. Cain was back on TV this week, in an interview on the current claims of sexual harassment against powerful men, on Fox News’ The Ingraham Angle.
Cain says: “Now, what’s different about my situation, and let’s just say Roy Moore’s situation, is that they came after me with repeatedly attacks and accusations, but no confirmation. They now believe that if they throw more and more and more mud on the wall, that eventually people are going to believe it. But that has backfired because, as you know, the latest poll shows Roy Moore is now back in the lead in Alabama, and the people in Alabama are going to have to decide.”
Our fact-checking partners have produced numerous fact-checks and background pieces on sexual harassment charges and statements.
“How politicians react to such charges often appears to reflect who is being accused. Democrats are quick to jump on allegations about Republicans — and vice versa. But the bets start to get hedged when someone in the same party falls under scrutiny,” writes Meg Kelly for The Washington Post‘s Fact Checker.
For example, here’s House Minority Leader Nancy Pelosi, D. Calif., on “Meet the Press” in 1998, defending then-President Bill Clinton. At the time, Pelosi said, referring to the investigation by Ken Starr into allegations against Clinton: “The women of America are just like other Americans, in that they value fairness, they value privacy, and do not want to see a person with uncontrolled power, uncontrolled time, uncontrolled – unlimited money investigating the president of the United States.”
And here’s President Donald Trump on the White House south lawn, defending GOP Alabama Senate candidate Roy Moore: “Roy Moore denies this. That’s all I can say. And by the way, he totally denies it.”
Kelly also wrote a round-up of sexual harassment charges against the president himself: “During the second presidential debate, Anderson Cooper asked then-candidate Trump point blank whether he had “actually kiss[ed] women without consent or grope[d] women without consent?” Trump asserted that “nobody has more respect for women” and Cooper pushed him, asking, “Have you ever done those things?” Trump denied that he had, responding: “No, I have not.”…But it’s not as simple as that. Many of the women have produced witnesses who say they heard about these incidents when they happened — long before Trump’s political aspirations were known. Three have produced at least two witnesses.”
In another piece, Glenn Kessler, editor of The Washington Post’s Fact Checker, writes in a round up of corroborators, “Such contemporaneous accounts are essential to establishing the credibility of the allegation because they reduce the chances that a person is making up a story for political purposes. In the case of sexual allegations, such accounts can help bolster the credibility of the “she said” side of the equation.” One of the statements of denial he quotes is this one from White House Press Secretary Sarah Huckabee Sanders, who says when asked if all of the accusers are lying: “Yeah, we’ve been clear on that from the beginning, and the president’s spoken on it.”
Here’s FactCheck.org’s Eugene Kiely on Sen. Tim Kaine, D., Va., and his claim that the Clinton campaign can’t give back contributions from Harry Weinstein: “Asked if the Clinton-Kaine campaign will return contributions it received from movie mogul Harvey Weinstein, Sen. Tim Kaine repeatedly said the campaign is over. That’s true, but it doesn’t mean the campaign can’t refund donations.” FactCheck.org deemed this claim “misleading.”
This year something magical happened.
Our film curator, Rick Prelinger, noticed a film for sale on eBay. The description said “taken at a Japanese Internment Camp,” so Rick bought it, suspecting it might be of historical significance. In October, when he digitized the 16mm reel and showed it to me, I couldn’t believe it.
On the screen was a home movie shot in 1944 at the WWII camp in Jerome, Arkansas where 8,500 Japanese Americans were incarcerated. This American concentration camp was once the fifth largest town in Arkansas. Rick thinks the film was shot by a camp administrator and hidden away for the last 73 years.
There are only a handful of movies ever shot inside the camps—I know, because my mother and grandparents were locked up in a similar camp for three and a half years.
What a miracle, then, that the Internet Archive found this film and preserved it while there are still people who can bear witness to what we see on screen. One of them, Sab Masada, was 12 years old when a truck came to haul his family away from their farm in Fresno. At our annual event, Sab remembered:
We were shipped to Jerome, Arkansas. It turned extremely cold, the beginning of November. In fact, we had some snow. The camp was still being completed so our barracks had no heat and my father caught pneumonia. 21 days after we arrived, he died in a makeshift barrack hospital…
This film will tell America that these concentration camps we were in—it wasn’t a myth! They were real. So it’s a historical record: proof of what really happened to 120,000 Americans and legal residents.
The best part for me? That this film will live on at archive.org, accessible to the public, forever, for free. Filmmakers can download it. Scholars can study it. Teachers can weave it into their lessons.
Here’s our promise to you: the Internet Archive will keep updating these files every time a major, new format emerges. We will preserve them for the long term, against fire, neglect and all types of more human disaster. We will cherish your stories as if they were our own.
I’m part of a small staff of 150, running a site the whole world depends on. I’ve worked at huge media corporations where the only things that matter are the ratings, because ratings = profits. At the end of the day, I couldn’t stomach it. I wanted to do more. I wanted to work at a place aligned with my values: creating a world where everyone has equal access to knowledge, because knowledge equals power. It’s society’s great leveler—at least that’s how it worked for my family.
This is why I give to the Internet Archive: because I believe every story deserves to be saved for the future.
When you give, you’re helping make sure the whole world can access the books, concerts, radio shows, web pages and yes, the home movies, that tell our human story.
So please, support our mission by donating today. Right now, a very generous supporter will match your donation 3-to-1, so you can triple your impact.
Think of it as an investment in our children, and their children, so that one day in the future, they might understand our joys and learn from our mistakes.
—Wendy Hanamura, Director of Partnerships, Internet Archive
Many are Latin American music from the David Chomowicz and Esther Ready Collection.
Others are square dance music, with and without the calls, from the Larry Edelman Collection. (Thank you David Chomowicz, Esther Ready, and Larry Edelman for the donations.)
We are still working on some of the display issues with this month’s materials, so some changes are yet to come.
Unfortunately we have only found dates for about 1/2 of this month’s batch using our automatic techniques of looking through 78disography.com, 45worlds, discogs, DAHR, and full text searching of Cashbox Magazine. There are currently over 2,000 songs with missing dates.
If you like internet sleuthing, or leveraging our scanned discographies or your discographies and would like to join in on finding dates and reviews, please jump in. We have a slack channel of those doing this.
Congratulations to B George’s group, George Blood’s group, and the collections group at the Internet Archive for another large batch of largely disappeared 78’s.
Join Public Knowledge to Discuss…
Net neutrality is on the chopping block. Public Knowledge has spent nearly ten years fighting for an open internet, but we expect that at the Federal Communications Commission’s December 14th meeting a majority of Commissioners will vote to eliminate our strong net neutrality rules. The current FCC has made dominant industry interests a priority, putting startup and consumer interests at risk. As one of the world’s largest centers for businesses and individuals financing, creating, and building the technology and inputs to the digital economy, Silicon Valley will be directly impacted by these new policies. At the same time, AT&T is seeking to merge with Time Warner, and Sinclair Broadcasting with the Tribune Company — potentially undermining video and broadband competition as net neutrality rules disappear. So, Public Knowledge is coming to California to discuss these important political shifts with engaged individuals, and to build new connections with individuals who want to learn more about standing up for an open internet.
The event will kick off with a reception from 6-7pm and will include a discussion from 7-8pm. We hope that you will join us and stick around afterwards for food and drinks until 9pm.
Please be sure to register, spread the word via Twitter, and keep an eye out for additional information coming soon.GET TICKETS HERE
Tuesday December 5th, 2017
6:00 pm Reception
7:00 pm Program
8:00 pm Reception
300 Funston Ave.
San Francisco, CA 94118
This week, as you’ve watched your Bitcoin Cash and Bitcoin rise and fall and rise again, perhaps you’ve been wondering: how can I put my cryptocurrencies to good use? Should I buy a new car or yacht? Plow it into Amazon stock? Well, at least some of you have turned to the Internet Archive—a place where you can donate your cryptocurrencies directly to help ensure that the Web is free, secure and backed up for all time.
At the Internet Archive, we are big fans of the cryptocurrency movement and have been trying to do our part to test and support alternative means of commerce. We’ve been accepting Bitcoin donations since 2012, and starting this week, we are now accepting donations of Bitcoin Cash and Zcash.
This week it all started when UKcryptocurrency tweeted us asking the Internet Archive to start accepting Bitcoin Cash. We love a good challenge and got that link up within hours.
Here’s how you can donate in cryptocurrency:
Bitcoin is an experimental, cryptographically secure, semi-anonymous method of transferring value between parties. Introduced in 2008, it has been successfully used as a token system between thousands of people. The Internet Archive has proudly experimented with bitcoin including paying some employees with it and encouraging local businesses to experiment as well.
Internet Archive Bitcoin Address: 1Archive1n2C579dMsAu3iC6tWzuQJz8dN
When Bitcoin was first created, developers and miners questioned whether the cryptocurrency could scale properly. To ensure its future, on August 1, 2017, developers and miners initiated what’s known as a ‘hard fork’ and created a new currency called Bitcoin Cash. For more information, visit: https://www.bitcoincash.org/
Internet Archive Bitcoin Cash Address: 12PRZjrLo5yqnHMmUCtPUse4kCyuneby3S
Zcash is the first-of-its-kind cryptocurrency that offers both privacy and selective transparency for all transactions. Zcash gives you the option of using a transparent addresses (what the Internet Archive uses, listed below) or shielded addresses which keep sender, receiver, and amounts private. For more information on Zcash visit https://z.cash/, and for details on the differences between transparent and shielded transactions visit: https://z.cash/support/security/privacy-security-recommendations.html
Internet Archive Zcash Address: t1W6JqMECmbqmDGZ9uLSXWQZ6EgxFkfuty8
We are a non-profit organization with a huge mission: to give everyone access to all knowledge—the books, web pages, audio, television and software of our shared humanity. Forever. For Free. But to build this digital library of the future, we need your help. If you’re feeling flush from a cryptocurrency-windfall, please consider giving to the Internet Archive today.
More info on why cryptocurrencies matter to us: http://blog.archive.org/2017/09/02/why-bitcoin-is-on-the-internet-archives-balance-sheet/
A biweekly round up on what’s happening at the TV News Archive by Katie Dahl and Nancy Watzman
Meanwhile, House Speaker Paul Ryan, R., Wis., keeps talking about tax reform being a “once in a generation opportunity,” and, coincidence!, so does Sen. Majority Leader Mitch McConnell, R., Ky. It’s a recurring theme.
These types of repeated phrases, often vetted via communication staff, are known as “talking points,” and it’s the way politicians, lobbyists, and other denizens of the nation’s capital sell policy. The TV News Archive is working toward the goal of applying artificial intelligence (AI) to our free, online library of TV news to help ferret out talking points so we can better understand how political messages are crafted and disseminated.
For now, we don’t have an automated way to identify such repeated phrases from the thousands of hours of television news coverage. However, searching within our curated archives of top political leaders can provide a quick way to check for a phrase you think you’re hearing often. Visit archive.org/tv to find our Trump archive, executive branch archive, and congressional archives, click into an archive, then search for the phrase within that archive.Funny, you look familiar
Wait, is this former President George W. Bush trying out a new look?
No, it’s not. This is Bob Massi, a legal analyst for Fox Business News and host of “Bob Massi is the Property Man.” In a test run of new faces for our Face-o-Matic facial detection tool, Massi’s uncanny resemblance (minus the hair) to the former president earned him a “false positive” – the algorithm identified this appearance as Bush incorrectly.
This doesn’t get us too worried, as we still include human testers and editors in our secret sauce: we’ll retrain our algorithm to disregard photos of Massi in the TV news stream. It does point toward why we want to be very careful, particularly with facial recognition, where a private individual may be tracked inadvertently or a public official misrepresented. Our concern about developing ethical practices with facial recognition is why, for the present, we are restricting our face-finding to elected officials. We invite discussion with the greater community about ethical practices in applying AI to the TV News Archive at email@example.com.
In our current Face-o-Matic set we track the faces of President Donald Trump and the four congressional leaders in their TV news appearances. After receiving feedback from journalists and researchers, our next set will include living ex-presidents and recent major presidential party nominees: Jimmy Carter, Bill Clinton, George H.W. Bush, George W. Bush, Barack Obama, Hillary Clinton, John McCain, and Mitt Romney. Stay tuned, while we fine tune our model.
Fact-check: everyone will get a tax cut (false)
In an interview on November 7, on Fox News’s new “The Ingraham Angle,” House Speaker Paul Ryan, R., Wis., says: “Everyone enjoys a tax cut all across the board.”
Pulling in information from the Tax Policy Center and a tax model created by the American Enterprise Institute, The Washington Post’s Fact Checker Glenn Kessler counters Ryan’s claim: “In the case of married families with children — whom Republicans are assiduously wooing as beneficiaries of their plan — about 40 percent are estimated to receive tax hikes by 2027, even if the provisions are retained.”
Ryan changed his language, according to Kessler, following an inquiry on November 8 from the Fact Checker. Now he is saying, “the average taxpayer in all income levels gets a tax cut.”
In an interview on November 12 on CNN’s “State of the Union,” Senate Minority Whip Dick Durbin, D., Ill., claimed that the GOP tax plan is “not being scored by the Congressional Budget Office, as it is traditionally. It’s because it doesn’t add up.”
“Under the most obvious interpretation of that statement, Durbin is incorrect. The nonpartisan analysis for tax bills is actually a task handled by the Joint Committee on Taxation, and the committee has been actively analyzing the Republican tax bills,” reported Louis Jacobson of PolitiFact.
We are honored to announce that the Internet Archive and artists Paul D. Miller (aka DJ Spooky) and Greg Niemeyer have been awarded one of the first Hewlett 50 Art Commissions to support the creation of “Sonic Web”—an acoustic portrait of the Internet. Sampling from the millions of hours of audio preserved in the Internet Archive, these experimental composers and artists will collaborate to create an 11-movement multimedia production for a string quartet, vocalist and original electronic instruments about the origins of the Internet and what needs to happen to keep it accessible, neutral, and free.
“Art is always a reflection of the changing dynamics of any society. Leonardo Da Vinci once said ‘Learning never exhausts the mind,'” explained DJ Spooky. “I think that we have so many things to learn from these kinds of interdisciplinary projects, and the William and Flora Hewlett Foundation is collaborating with Artists to show how these initiatives can affect the entire spectrum of the creative economy.”
The Internet Archive team is among the first 10 recipients of the Hewlett 50 Arts Commissions, an $8 million commissioning initiative that is the largest of its kind in the United States. These $150,000 grants support Bay Area nonprofits working with world-class artists on major new music compositions spanning myriad genres including chamber, electronic, jazz, opera, and hip hop. These commissions honor the Hewlett Foundation’s 50th anniversary, commemorating decades of leadership in the Bay Area arts world.
“The Hewlett 50 Arts Commissions are a symbol of the foundation’s longstanding commitment to performing arts in the Bay Area,” said Larry Kramer, president of the Hewlett Foundation. “We believe the awards will fund the creation of new musical works of lasting significance that are as dynamic and diverse as the Bay Area communities where they will premiere.”
“Sonic Web” is conceived to push boundaries in both music and technology. New media artist, Greg Niemeyer, will build an original Sonic Web Instrument —a large touchscreen with a software tool to draw network diagrams. It will enable DJ Spooky to build and take apart simple networks using sampled sounds from the Internet Archive, further layered by a vocalist and string quartet.
“Sonic Web will dig into the big crate of the Internet Archive and remix internet history in a new, networked way,” says Greg Niemeyer. “We will break out of linear musical structures towards a more networked and connected sound.”
The artists will also take these tools on the road, partnering with Berkeley Center for New Media, Stanford Live, Youth Radio, and Bay Area high schools for music and technology workshops and a service learning course at UC Berkeley.
The work will premiere at the Internet Archive Great Room during the summer of 2018. We will also provide free global access to a downloadable Sonic Web album with music videos and the livestream of the premiere at archive.org.
NOTE: DJ Spooky, Niemeyer and the Internet Archive collaborated in 2016 to create “Memory Palace,” a new multimedia work performed at our own 20th anniversary celebration. For a taste of what’s to come, watch this.
A biweekly round up on what’s happening at the TV News Archive by Katie Dahl and Nancy WatzmanFox News downplayed Mueller indictment, according to NYT editorial chyron analysis
In the most intensive use the Internet Archive’s Third Eye data to date, The New York Times editorial page analyzed chyron data to show how Fox News downplayed this week’s news of the indictment of former Trump campaign manager and other legal developments. The graphic-heavy opinion piece was featured at the top of the online homepage [most/much] of the day on Wednesday, Nov. 1:
Though it is far from the only possible way to evaluate news coverage, the chyron has become something of a touchstone for media analysts, being both the most obvious visual example of spin or distraction and the most shareable. Any negative coverage of the president usually prompts a flurry of tweets cataloguing the differences among networks in their chyron text. While CNN, MSNBC and the BBC are typically in alignment, Monday morning was a particularly stark example of how Fox News pushes its own version of reality.
Fox News actively tried to “plant doubt in viewers’ minds” as Mueller brought charges against former Trump campaign officials, according to an analysis of a week’s worth of closed captions by Alvin Chang of Vox News. Chang used Television Explorer, fueled by TV News Archive data, to crunch the numbers behind charts such as the one below.
And The Trace, an independent, nonprofit news organization that focuses on gun violence, used TV News Archive caption data via Television Explorer to show how TV news coverage of mass shootings declines quickly.
— Daniel Nass (@dnlnss) October 27, 2017Face-o-Matic captures congressional leaders reactions on indictments
In the 24 hours following news breaking about the indictments, our Face-o-Matic data feed captured cable news networks’ editorial choices on how much face-time to allot to congressional leaders’ reactions. The answer: not much.
All together the four congressional leaders’ faces were shown for a total of 2.5 minutes on indictment-related reporting on screen by CNN, Fox News, and MSNBC. Ryan got the lion’s share of the attention. Much of this was devoted to airings of his photo in connection with his official statement,“[N]othing is going to derail what we are doing in Congress, because we are working on solving people’s problems.”
The image of Senate Majority Leader Mitch McConnell, K., Ky., was not featured by any network. House Minority Leader Nancy Pelosi, D., Calif., got attention only from Fox News, which featured her photo with discussion of her statement, in which she said despite the news, “we still need an outside fully independent investigation.”Fact-check:Papadopoulos had a limited role in Trump campaign (had seat at table/not the whole story)
One of the most parsed statements this week was White House press secretary Sarah Huckabee Sanders’ claim that George Papadopolous, who pleaded guilty to lying to the FBI, had an “extremely limited” role in the campaign. “It was a volunteer position,” she said. “And again, no acitvity was ever done in an official capacity on behalf of the campaign.”
“Determining how important Papadopoulos was on the Trump team is open to interpretation, so we won’t put this argument to the Truth-O-Meter,” wrote Louis Jacobson, reporting for PolitiFact. Jacobson, however, laid out the known facts. For example, in March 2016, then presidential candidate Donald Trump tweeted out a photo of himself and advisors sitting at a table, saying it was a “national security meeting.” Papadopoulos is seen at the table sitting near future Attorney General Jeff Sessions. However, Jacobson also writes,“There is some evidence to support the argument that Papadopoulos was freelancing by pushing the Russia connection.”
Reviwing Sanders’ claim, as well as a Trump tweet along similar lines, Robert Farley and Eugene Kiely took a similar tack for FactCheck.org, concluding that Papadopoulos had a “seat at the table” in the campaign, but it was beyond licking envelopes and posting lawn signs: “What we do know is that during this time — from late March to mid-August — Papadopoulos was in regular contact with senior Trump campaign officials and attended a national security meeting with Trump. We will let readers decide if this constitutes a ‘low-level volunteer.'”Embed TV News Archive clips on web annotations
Now you can embed TV News Archive news clips when commenting and annotating the web, thanks to a new integration from Hypothes.is. From the Hypothesis.is blog:
This integration makes it easy for journalists, fact-checkers, educators, scholars and anyone that wants to relate specific text in a webpage, PDF, or EPUB to a particular snippet of video news coverage. All you need to do to use it is copy the URL of a TV News Archive video page, paste it into the Hypothesis annotation editor and save your annotation. You can adjust the start and end of the video to include any exact snippet. The video will then automatically be available to view in your annotation alongside the annotated text.
See a live example of the integration in this annotation with an embedded news video of Senator Charles Schumer at a news conference over a post that checks the facts in one of his statements.
“This integration means that one of the world’s most valuable resources — the news that the Internet Archive captures across the world everyday — will be able to be brought into close context with pages and documents across the web,” said Hypothesis CEO Dan Whaley. “For instance, a video of a politician making an actual statement next to an excerpt that claims the opposite, or a video of a newsworthy event next to a deeper analysis of it.”
Please take Hypothes.is for a spin and let us know what you think: firstname.lastname@example.org.
The Internet Archive will host the “Dodging the Memory Hole” (DTMH) forum Nov 15 and 16th. This will be the fifth in the series of outreach efforts over the past four years. Presented by the Donald W. Reynolds Journalism Institute, with support from the Institute of Museum and Library Services, the conference will address issues related to archiving and access to online news.
We are happy to be able to present a range of people, and projects, involved in a wide cross-section of activities related to news archiving, representing local, national and world-wide efforts. As a bonus, our special guest speaker, Daniel Ellsberg, will highlight the value of the First Amendment and the need to make sure the public has free access to accurate information in the digital age.
News has been called the “first rough draft of history.” Some think the risk to this history is at an all time high. The possibility exists that large portions of our cultural record, as captured by journalists and others, will be lost forever if no action is taken to provide long-term solutions for access. The loss of digital records is happening at an unprecedented pace – faster than the loss of comparable print and analog resources. Access and preservation are two sides of the same coin in this regard.
The Internet Archive has become increasingly important as a means of collecting and preserving online news content. As if the challenges of capturing more traditional news sources such as newspapers and television stations aren’t enough, the rise of social media as major distribution channels has made it even more difficult to address the complex set of issues involved. Since many of the challenges end up being technical in nature, bringing Internet Archive staff together with the DTMH community offers the chance to identify problems and approach solutions to some of the stumbling blocks we’ve encountered at this point in the journey.
Journalists, memory institutions, technologists, historians, political scientists and anyone with an interest in having long-term access to a trustworthy and accurate record of life in the digital age will find this gathering of interest. I urge anyone interested in this urgent and important issue to come join us at the Internet Archive on Nov. 15-16. We have a limited number of seats available. Registration is required, but it is free. If you want register in time to allow us to order food for you, please register by Monday, Oct. 30. Final cutoff for registrations is Nov. 5. I hope to see you there!
For more information, and to register, click here.
Chatbots and voice interaction are hot topics right now. New services such as Facebook Messenger and Amazon Alexa have become popular quickly. Publishers are exploring how to make their content multimodal, so that users can access content in varied ways on different devices. User interactions may be either screen-based or audio-based, and will sometimes be hands-free.
Multimodal content could change how content is planned and delivered. Numerous discussions have looked at one aspect of conversational interaction: planning and writing sentence-level scripts. Content structure is another dimension relevant to voice interaction, chatbots and other forms of multimodal content. Structural metadata can support the reuse of existing web content to support multimodal interaction. Structural metadata can help publishers escape the tyranny of having to write special content for each distinct platform.Seamless Integration: The Challenge for Multimodal Content
In-Vehicle Infotainment (IVI) systems such as Apple’s CarPlay illustrate some of challenges of multimodal content experiences. Apple’s Human Interface Guidelines state: “On-screen information is minimal, relevant, and requires little decision making. Voice interaction using Siri enables drivers to control many apps without taking their hands off the steering wheel or eyes off the road.” People will interact with content hands-free, and without looking. CarPlay includes six distinct inputs and outputs:
- Car Data
- Knobs and Controls
- Voice (Siri)
The CarPlay UIKit even includes “Drag and Drop Customization”. When I review these details, much seems as if it could be distracting to drivers. Apple states with CarPlay “iPhone apps that appear on the car’s built-in display are optimized for the driving environment.” What that iPhone app optimization means in practice could determine whether the driver gets in an accident.CarPlay: if it looks like an iPhone, does it act like an iPhone? (screenshot via Apple)
Multimodal content promises seamless integration between different modes of interaction, for example, reading and listening. But multimodal projects carry a risk as well if they try to port smartphone or web paradigms into contexts that don’t support them. Publishers want to reuse content they’ve already created. But they can’t expect their current content to suffice as it is.
In a previous post, I noted that structural metadata indicates how content fits together. Structural metadata is a foundation of a seamless content experience. That is especially true when working with multimodal scenarios. Structural metadata will need to support a growing range of content interactions, involving distinct modes. A mode is form of engaging with content, both in terms of requesting and receiving information. A quick survey of these modes suggests many aspects of content will require structural metadata.Platform Example Input Mode Output Mode Chatbots Typing Text Devices with Mic & Display Speaking Visual (Video, Text, Images, Tables) or Audio Smart Speakers Speaking Audio Camera/IoT Showing or Pointing Visual or Audio
Multimodal content will force content creators to think more about content structure. Multimodal content encompasses all forms of media, from audio to short text messages to animated graphics. All these forms present content in short bursts. When focused on other tasks, users aren’t able to read much, or listen very long. Steven Pinker, the eminent cognitive psychologist, notes that humans can only retain three or four items in short term memory (contrary to the popular belief that people can hold 7 items). When exploring options by voice interaction, for example, users can’t scan headings or links to locate what they want. Instead of the user navigating to the content, the content needs to navigate to the user.
Structural metadata provides information to machines to choose appropriate content components. Structural metadata will generally be invisible to users — especially when working with screen-free content. Behind the scenes, the metadata indicates hidden structures that are important to retrieving content in various scenarios.Metadata is meant to be experienced, not seen. A photo of an Amazon customer’s Echo Show, revealing code (via Amazon) Optimizing Content With Structural Metadata
When interacting with multimodal content, users have limited attention, and a limited capacity to make choices. This places a premium on optimizing content so that the right content is delivered, and so that users don’t need to restate or reframe their requests.
Existing web content is generally not optimized for multimodal interaction — unless the user is happy listening to a long article being read aloud, or seeing a headline cropped in mid-sentence. Most published web content today has limited structure. Even if the content was structured during planning and creation, once delivered, the content lacks structural metadata that allows it to adapt to different circumstances. That makes it less useful for multimodal scenarios.
In the GUI paradigm of the web, users are expected to continually make choices by clicking or tapping. They see endless opportunities to “vote” with their fingers, and this data is enthusiastically collected and analyzed for insights. Publishers create lots of content, waiting to see what gets noticed. Publishers don’t expect users to view all their content, but they expect users to glance at their content, and scroll through it until users have spotted something enticing enough to view.
Multimodal content shifts the emphasis away from planning delivery of complete articles, and toward delivering content components on-demand, which are described by structural metadata. Although screens remain one facet of multimodal content, some content will be screen-free. And even content presented on screens may not involve a GUI: it might be plain text, such as with a chatbot. Multimodal content is post-GUI content. There are no buttons, no links, no scrolling. In many cases, it is “zero tap” content — the hands will be otherwise occupied driving, cooking, or minding children. Few users want to smudge a screen with cookie dough on their hands. Designers will need to unlearn their reflexive habit of adding buttons to every screen.
Users will express what they want, by speaking, gesturing, and if convenient, tapping. To support zero-tap scenarios successfully, content will need to get smarter, suggesting the right content, in the right amount. Publishers can no longer present an endless salad bar of options, and expect users to choose what they want. The content needs to anticipate user needs, and reduce demands on the user to make choices.
Users will aways want to choose what topics they are interested in. They may be less keen on actively choosing the kind of content to use. Visiting a website today, you find articles, audio interviews, videos, and other content types to choose from. Unlike the scroll-and-scan paradigm of the GUI web, multimodal content interaction involves an iterative dialog. If the dialog lasts too long, it gets tedious. Users expect the publisher to choose the most useful content about a topic that supports their context.Pattern: after saying what you want information about, now tell us how you’d like it (screenshot via Google News)
In the current use pattern, the user finds content about a topic of interest (topic criteria), then filters that content according to format preferences. In future, publishers will be more proactive deciding what format to deliver, based on user circumstances.
Structural metadata can help optimize content, so that users don’t have to choose how they get information. Suppose the publisher wants to show something to the user. They have a range of images available. Would a photo be best, or a line drawing? Without structural metadata, both are just images portraying something. But if structural metadata indicates the type of image (photo or line diagram), then deeper insights can be derived. Images can be A/B tested to see which type is most effective.
A/B testing of content according to its structural properties can yield insights into user preferences. For example, a major issue will be learning how much to chunk content. Is it better to offer larger size chunks, or smaller ones? This issue involves the tradeoffs for the user between the costs of interaction, memory, and attention. By wrapping content within structural metadata, publishers can monitor how content performs when it is structured in alternative ways.Component Sequencing and Structural Metadata
Multimodal content is not delivered all at once, as is the case with an article. Multimodal content relies on small chunks of information, which act as components. How to sequence these components is important.Alexa showing some cards on an Echo Show device (via Amazon)
Screen-based cards are a tangible manifestation of content components. A card could show the current weather, or a basketball score. Cards, ideally, are “low touch.” A user wants to see everything they need on a single card, so they don’t need to interact with buttons or icons on the card to retrieve the content they want. Cards are post-GUI, because they don’t rely heavily on forms, search, links and other GUI affordances. Many multimodal devices have small screens that can display a card-full of content. They aren’t like a smartphone, cradled in your hand, with a screen that is scrolled. An embedded screen’s purpose is primarily to display information rather than for interaction. All information is visible on the card [screen], so that users don’t need to swipe or tap. Because most of us are accustomed to using screen-based cards already, but may be less familiar with screen-free content, cards provide a good starting point for considering content interaction.
Cards let us consider components both as units (providing an amount of content) and as plans (representing a purpose for the content). User experiences are structured from smaller units of content, but these units need have a cohesive purpose. Content structure is more than breaking content into smaller pieces. It is about indicating how those pieces can fit together. In the case of multimodal content, components need to fit together as an interaction unfolds.
Each card represents a specific type of content (recipe, fact box, news headline, etc.), which is indicated with structural metadata. The cards also present information in a sequence of some sort.1 Publishers need to know how various types of components can be mixed, and matched. Some component structures are intended to complement each other, while other structures work independently.
Content components can be sequenced in three ways. They can be:
Truly modular components can be sequenced in any order; they have no intrinsic sequence. They provide information in response to a specific task. Each task is assumed to be unrelated. A card providing an answer to the question of “What is the height of Mount Everest?” will be unrelated to a card answering the question “What is the price of Facebook stock?”
The technical documentation community uses an approach known as topic-based writing that attempts to answer specific questions modularly, so that every item of content can be viewed independently, without need to consult other content. In principle, this is a desirable goal: questions get answered quickly, and users retrieve the exact information they need without wading through material they don’t need. But in practice, modularity is hard to achieve. Only trivial questions can be answered on a card. If publishers break a topic into several cards, they should indicate the relations between the information on each card. Users get lost when information is fragmented into many small chunks, and they are forced to find their way through those chunks.
Modular content structures work well for discrete topics, but are cumbersome for richer topics. Because each module is independent of others, users, after viewing the content, need to specify what they want next. The downside of modular multimodal content is that users must continually specify what they want in order to get it.
Components can sequenced in a fixed order. An ordered list is a familiar example of structural metadata indicating a fixed order. Narratives are made from sequential components, each representing an event that happens over time. The narrative could be a news story, or a set of instructions. When considered as a flow, a narrative involves two kinds of choices: whether to get details about an event in the narrative, or whether to get to the next event in the narrative. Compared with modular content, fixed sequence content requires less interaction from the user, but longer attention.
Adaptive sequencing manages components that are related, but can be approached in different orders. For example, content about an upcoming marathon might include registration instructions, sponsorship info, a map, and event timing details, each as a separate component/card. After viewing each card, users need options that make sense, based on content they’ve already consumed, and any contextual data that’s available. They don’t want too many options, and they don’t want to be asked too many questions. Machines need to figure out what the user is likely to need next, without being intrusive. Does the user need all the components now, or only some now?
Adaptive sequencing is used in learning applications; learners are presented with a progression of content matching their needs. It can utilize recommendation engines, suggesting related components based on choices favored by others in a similar situation. An important application of adaptive sequencing is deciding when to ask a detailed question. Is the question going to be valuable for providing needed information, or is the question gratuitous? A goal of adaptive sequencing is to reduce the number of questions that must be asked.
Structural metadata generally does not explicitly address temporal sequencing, because (until now) publishers have assumed all content would be delivered at once on a single web page. For fixed sequences, attributes are needed to indicate order and dependencies, to allow software agents to follow the correct procedure when displaying content. Fixed sequences can be expressed by properties indicating step order, rank order, or event timing. Adaptive sequencing is more programmatic. Publishers need to indicate the relation of components to parent content type. Until standards catch up, publishers may need to indicate some of these details in the data-* attribute.
The sequencing of cards illustrates how new patterns of content interaction may necessitate new forms of structural metadata.Composition and the Structure of Images
One challenge in multimodal interaction is how users and systems talk about images, as either an input (via a camera), or as an output. We are accustomed to reacting to images by tapping or clicking. We now have the chance to show things to systems, waving an object in front of a camera. Amazon has even introduced a hands-free voice activated IoT camera that has no screen. And when systems show us things, we may need to talk about the image using words.
Machine learning is rapidly improving, allowing systems to recognize objects. That will help machines understand what an item is. But machines still need to understand the structural relationship of items that are in view. They need to understand ordinary concepts such as near, far, next to, close to, background, group of, and other relational terms. Structural metadata could make images more conversational.
Vector graphics are composed of components that can represent distinct ideas, much like articles that are composed of structural components. That means vector images can be unbundled and assembled differently. The WAI-ARIA standard for web accessibility has an SVG Graphics Module that covers how to markup vector images. It includes properties to add structural metadata to images, such as group (a role indicating similar items in the image) and background (a label for elements in the image in the background). Such structural metadata could be useful for users interacting with images using voice commands. For example, the user might want to say, “Show me the image without a background” or “with a different background”.
Photos do not have interchangeable components the way that vector graphics do. But photos can present a structural perspective of a subject, revealing part of a larger whole. Photos can benefit from structural metadata that indicates the type of photo. For example, if a user wants a photo of a specific person, they might have a preference for a full-length photo or for a headshot. As digital photography has become ubiquitous, many photos are available of the same subject that present different dimensions of the subject. All these dimensions form a collection, where the compositions of individual photos reveal different parts of the subject. The IPTC photo metadata schema includes a controlled vocabulary for “scenes” that covers common photo compositions: profile, rear view, group, panoramic view, aerial view, and so on. As photography embraces more kinds of perspectives, such as aerial drone shots and omnidirectional 360 degree photographs, the value of perspective and scene metadata will increase.
For voice interaction with photo images to become seamless, machines will need to connect conversational statements with image representations. Machines may hear a command such as “show me the damage to the back bumper,” and must know to show a photo of the rear view of a car that’s been in an accident. Sometimes users will get a visual answer to a question that’s not inherently visual. A user might ask: “Who will be playing in Saturday’s soccer game?”, and the display will show headshots of all the players at once. To provide that answer, the platform will need structural metadata indicating how to present an answer in images, and how to retrieve player’s images appropriately.
Structural metadata for images lags behind structural metadata for text. Working with images has been labor intensive, but structural metadata can help with the automated processing of image content. Like text, images are composed of different elements that have structural relationships. Structural metadata can help users interact with images more fluidly.Reusing Text Content in Voice Interaction
Voice interaction can be delivered in various ways: through natural language generation, through dedicated scripting, and through the reuse of existing text content. Natural language generation and scripting are especially effective in short answer scenarios — for example, “What is today’s 30 year mortgage rate? ” Reusing text content is potentially more flexible, because it lets publishers address a wide scope of topics in depth.
While reusing written text in voice interactions can be efficient, it can potentially be clumsy as well. The written text was created to be delivered and consumed all at once. It needs some curation to select which bits work most effectively in a voice interaction.
The WAI-ARIA standards for web accessibility offer lessons on the difficulties and possibilities of reusing written content to support audio interaction. By becoming familiar with what ARIA standards offer, we can better understand how structural metadata can support voice interactions.
ARIA standards seek to reduce the burdens of written content for people who can’t scan or click through it easily. Much web content contains unnecessary interaction: lists of links, buttons, forms and other widgets demanding attention. ARIA encourages publishers to prioritize these interactive features with the TAB index. It offers a way to help users fill out forms they must submit to get to content they want. But given a choice, users don’t want to fill out forms by voice. Voice interaction is meant to dispense with these interactive elements. Voice interaction promises conversational dialog.
Talking to a GUI is awkward. Listening to written web content can also be taxing. The ARIA standards enhance the structure of written content, so that content is more usable when read aloud. ARIA guidelines can help inform how to indicate structural metadata to support voice interaction.
The ARIA encourages publishers to curate their content: to highlight the most important parts that can be read aloud, and to hide parts that aren’t needed. ARIA designates content with landmarks. Publishers can indicate what content has role=“main”, or they can designate parts of content by region. The ARIA standard states: “A region landmark is a perceivable section containing content that is relevant to a specific, author-specified purpose and sufficiently important that users will likely want to be able to navigate to the section easily and to have it listed in a summary of the page.” ARIA also provides a pattern for disclosure, so that not all text is presented at once. All of these features allow publishers to indicate more precisely the priority of different components within the overall content.
ARIA supports screen-free content, but it is designed primarily for keyboard/text-to-speech interaction. Its markup is not designed to support conversational interaction — schema.org’s pending speakable specification, mentioned in my previous post, may be a better fit. But some ARIA concepts suggest the kinds of structures that written text need to work effectively as speech. When content conveys a series of ideas, users need to know what are major and minor aspects of text they will be hearing. They need the spoken text to match the time that’s available to listen. Just like some word processors can provide an “auto summary” of a document by picking out the most important sentences, voice-enabled text will need to identify what to include in a short version of the content. The content might be structured in an inverted pyramid, so that only the heading and first paragraph are read in the short version. Users may even want the option of hearing a short version or a long version of a story or explanation.Structural metadata and User Intent in Voice Interaction
Structural metadata will help conversational interactions deliver appropriate answers. On the input side, when users are speaking, the role of structural metadata is indirect. People will state questions or commands in natural language, which will be processed to identify synonyms, referents, and identifiable entities, in order to determine the topic of the statement. Machines will also look at the construction of the statement to determine the intent, or the kind of content sought about the topic. Once the intent is known — what kind of information the user is seeking — it can be matched with the most useful kind of content. It is on the output side, when users view or hear an answer, that structural metadata plays an active role selecting what content to deliver.
Already, search engines such as Google rely on structural metadata to deliver specific answers to speech queries. A user can ask Google the meaning of a word or phrase (What does ‘APR’ mean?) and Google locates a term that’s been tagged with structural metadata indicating a definition, such as with the HTML element <dfn>.
When a machine understands the intent of a question, it can present content that matches the intent. If a user asks a question starting with the phrase Show me… the machine can select a clip or photograph about the object, instead of presenting or reading text. Structural metadata about the characteristics of components makes that matching possible.
Voice interaction supplies answers to questions, but not all answers will be complete in a single response. Users may want to hear alternative answers, or get more detailed answers. Structural metadata can support multi-answer questions.
Schema.org metadata indicates content that answers questions using the Answer type, which is used by many forums and Q&A pages. Schema.org distinguishes between two kinds of answers. The first, acceptedAnswer, indicates the best or most popular answer, often the answer that received most votes. But other answers can be indicated with a property called suggestedAnswer. Alternative answers can be ranked according to popularity as well. When sources have multiple answers, users can get alternative perspectives on a question. After listening to the first “accepted” answer, the user might ask “tell me another opinion” and a popular “suggested” answer could be read to them.
Another kind of multi-part answer involves “How To” instructions. The HowTo type indicates “instructions that explain how to achieve a result by performing a sequence of steps.” The example the schema.org website provides to illustrate the use of this type involves instructions on how to change a tire on a car. Imagine car changing instructions being read aloud on a smartphone or by an in-vehicle infotainment system as the driver tries to change his flat tire along a desolate roadway. This is a multi-step process, so the content needs to be retrievable in discrete chunks.
Schema.org includes several additional types related to HowTo that structure the steps into chunks, including preconditions such as tools and supplies required. These are:
- HowToSection : “A sub-grouping of steps in the instructions for how to achieve a result (e.g. steps for making a pie crust within a pie recipe).”
- HowToDirection : “A direction indicating a single action to do in the instructions for how to achieve a result.”
- HowToSupply : “A supply consumed when performing the instructions for how to achieve a result.”
- HowToTool : “A tool used (but not consumed) when performing instructions for how to achieve a result.”
These structures can help the content match the intent of users as they work through a multi-step process. The different chunks are structurally connected through the step property. Only the HowTo type ( and its more specialized subtype, the Recipe) currently accepts the step property and thus can address temporal sequencing.Content Agility Through Structural Metadata
Chatbots, voice interaction and other forms of multimodal content promise a different experience than is offered by screen-centric GUI content. While it is important to appreciate these differences, publishers should also consider the continuities between traditional and emerging paradigms of content interaction. They should be cautious before rushing to create new content. They should start with the content they have, and see how it can be adapted before making content they don’t have.
A decade ago, the emergence of smartphones and tablets triggered an app development land rush. Publishers obsessed over the discontinuity these new devices presented, rather than recognizing their continuity with existing web browser experiences. Publishers created multiple versions of content for different platforms. Responsive web design emerged to remedy the siloing of development. The app bust shows that parallel, duplicative, incompatible development is unsustainable.
Existing content is rarely fully ready for an unpredictable future. The idealistic vision of single source, format free content collides with the reality of new requirements that are fitfully evolving. Publishers need an option between the extremes of creating many versions of content for different platforms, and hoping one version can serve all platforms. Structural metadata provides that bridge.
Publishers can use structural metadata to leverage content they have already that could be used to support additional forms of interaction. They can’t assume they will directly orchestrate the interaction with the content. Other platforms such as Google, Facebook or Amazon may deliver the content to users through their services or devices. Such platforms will expect content that is structured using standards, not custom code.
Sometimes publishers will need to enhance existing content to address the unique requirements of voice interaction, or differences in how third party platforms expect content. The prospect of enhancing existing content is preferable to creating new content to address isolated use case scenarios. Structural metadata by itself won’t make content ready for every platform or form of interaction. But it can accelerate its readiness for such situations.
— Michael Andrews
- Dialogs in chatbots and voice interfaces also involve sequences of information. But how to sequence a series of cards may be easier to think about than a series of sentences, since viewing cards doesn’t necessarily involve a series of back and forth questions. ↩︎
The post Seamless: Structural Metadata for Multimodal Content appeared first on Story Needle.
A weekly round up on what’s happening and what we’re seeing at the TV News Archive by Katie Dahl and Nancy Watzman. Additional research by Robin Chin.
All three major U.S. cable news networks covered President Donald Trump’s impromptu press conference with Sen. Mitch McConnell, R., Ky., on Monday, October 16, but there were notable differences in their editorial choices for chyrons – the captions that appear in real-time on the bottom third of the screen – throughout the broadcast. We used the TV News Archive’s new Third Eye chryon extraction data tool to demonstrate these differences, similar to how The Washington Post examined FBI director James B. Comey’s hearing in June 2017.
The beauty of the Third Eye tool is you can do this too, any time there is breaking news or other widely covered live events, like yesterday’s Senate judiciary committee hearing where AG Jeff Sessions testified (7:31am-9:46am PT) or the October 5 White House briefing about Puerto Rico (11:20am-11:48am PT). Third Eye data – which includes chyrons from BBC News, CNN, Fox News, and MSNBC – is available for data download, via API, in both raw and filtered formats. (Get into the weeds over on the Third Eye collection page.) Please take Third Eye for a spin, and let us know if you have questions: email@example.com or @tvnewsarchive.
For example, at 11:03 PT, Trump began answering a question about pharmaceutical companies “making money.” MSNBC chooses a chyron that characterizes Trump’s statements as a claim, whereas Fox News displays Trump’s assertion that Obamacare is a disaster. CNN goes with a chyron saying that Trump is “very happy” to end Obamacare subsidies. In the following minute, 11:04, Fox News chooses other bold statements from Trump: “I do not need pharma money” and “I want tax reform this year.” CNN’s chyron instead says Trump “would like to seee” tax reform, a less bold statement.
(Note: these are representative chryons from the minute period and did not necessarily display for the full 60-second period.)
Later in the press conference, the discussion turns to natural disasters before then focusing on the proposed wall on the border with Mexico. Again, Fox News features Trump making bold, simple assertions: “we are getting high marks for our hurricane response,” and “PR was in bad shape before the storm hit.” MSNBC instead uses the word “claims”: “Trump claims Puerto Rico now has more generators than any place in the world.”
The day following The Washington Post-60 Minutes report on legislation passed by Congress and signed by President Barack Obama to weaken the authority of the Drug Enforcement Agency, Sen. Claire McCaskill, D., Mo., called for repeal of the law. In an interview, she also said, “Now, I did not go along with this. I wasn’t here at the time. I was actually out getting breast cancer treatment. I don’t know that I would have objected. I like to believe I would have, but the bottom line is, once the DEA [Drug Enforcement Administration] kind of, the upper levels at the DEA obviously said it was okay, that’s what gave it the green light.”
But “despite her claim that she ‘wasn’t here at the time,’ McCaskill was clearly back at the Senate, participating in votes and hearings,” according to The Washington Post‘s Fact Checker’s Glenn Kessler. “McCaskill’s staff acknowledged the error, saying that they had forgotten she had come back at that time. ‘It was sloppy on our part, and we take responsibility,’ a spokesman said.”Fact-check: Pressure from Trump led to stepped up NATO members’ defense spending (half true)
In an interview on October 15, Secretary of State Rex Tillerson said, “The president early on called upon NATO member countries to step up their contributions — step up their commitment to NATO, modernize their own forces… He’s been very clear, and as a result of that countries have stepped up contributions toward their own defense.”
PolitiFact reporter Allison Graves found that “25 NATO allies plan to increase spending in real terms in 2017.” And “according to NATO, over the last 3 years, European allies and Canada spent almost $46 billion more on defense, meaning increases in spending have occurred before Trump’s presidency. Experts said it’s possible that Trump’s pressure has contributed to the continuation of the upward trend, but Tillerson’s explanation glazes over the other factors that have led to increases, including the conflict in the Ukraine in 2014.”
by Lisa Rein, Cofounder and Coordinator, Aaron Swartz Day
In memory of Aaron Swartz, whose social, technical, and political insights still touch us daily, Lisa Rein, in partnership with the Internet Archive, will be hosting a weekend of events on Saturday, November 4th and Sunday, November 5th. Friends, collaborators, and hackers can participate in a two-day Hackathon and Aaron Swartz Day Evening Reception.
Schedule of events held at the Internet Archive:
Saturday, November 4th, from 10 am – 6 pm and Sunday, November 5th, from 11am – 5pm — Participate in the hackathon, which will focus on SecureDrop, the whistleblower submission system originally created by Aaron just before he passed away, and other projects inspired by Aaron’s work.
Saturday night, November 4th, from 6:00pm – 9:30pm — Celebrate and remember Aaron, and also the grand tradition of working hard to make the world a better place, at the Aaron Swartz Day Evening Celebration:
Reception: 6:00pm – 7:00pm – Come mingle with the speakers and enjoy nectar, wine & tasty nibbles.
Migrate your way upstairs: 7:10-7:30pm – Finish your nibbles and wine at the reception, exchange contact info, and make your way upstairs to grab a seat to watch the speakers, which will begin promptly at 7:30 pm – a half hour earlier than usual, because we have so many amazing speakers this year.
Speakers 7:30-9:30 pm (Break 8:15-8:30pm)
- Chelsea Manning (Network Security Expert, Former Intelligence Analyst)
- Lisa Rein (Chelsea Manning’s Archivist, Co-founder Creative Commons, Co-founder Aaron Swartz Day)
- Daniel Rigmaiden (Transparency Advocate)
- Barrett Brown (Journalist, Activist, Founder of the Pursuance Project) (via SKYPE)
- Jason Leopold (Senior Investigative Reporter, Buzzfeed News)
- Jennifer Helsby (Lead Developer, SecureDrop, Freedom of the Press Foundation)
- Cindy Cohn (Executive Director, Electronic Frontier Foundation)
- Gabriella Coleman (Hacker Anthropologist, Author, Researcher, Educator)
- Caroline Sinders (Designer/Researcher, Wikimedia Foundation, Creative Dissent Fellow, YBCA)
- Brewster Kahle (Co-founder and Digital Librarian, Internet Archive, Co-founder Aaron Swartz Day)
- Steve Phillips (Project Manager, Pursuance)
- Mek Karpeles (Citizen of the World, Internet Archive)
- Brenton Cheng (Senior Engineer, Open Library, Internet Archive)
Saturday, November 4th, 2017
10:00 am Hackathon
6:00 pm Reception
7:30 pm Program
Sunday, November 5th, 2017
11:00 am Hackathon
300 Funston Ave.
San Francisco, CA 94118
For more information, contact:
by Nancy Watzman & Katie Dahl
With the turn of a dial, some flashing lights, and the requisite puff of fog, emcees Tracey Jaquith, TV architect, and Jason Scott, free range archivist, cranked up the Internet Archive 20th Century Time Machine on stage before a packed house at the Internet Archive’s annual party on October 11.
Eureka! The cardboard contraption worked! The year was 1912, and out stepped Alexis Rossi, director of Media and Access, her hat adorned with a 78 rpm record.1912
“Close your eyes and listen,” Rossi asked the audience. And then, out of the speakers floated the scratchy sounds of Billy Murray singing “Low Bridge, Everybody Down” written by Thomas S. Allen. From 1898 to the 1950s, some three million recordings of about three minutes each were made on 78rpm discs. But these discs are now brittle, the music stored on them precious. The Internet Archive is working with partners on the Great 78 Project to store these recordings digitally, so that we and future generations can enjoy them and reflect on our music history. New collections include the Tina Argumedo and Lucrecia Hug 78rpm Collection of dance music collected in Argentina in the mid-1930s.1927
Next to emerge from the Time Machine was David Leonard, president of the Boston Public Library, which was the first free, municipal library founded in the United States. The mission was and remains bold: make knowledge available to everyone. Knowledge shouldn’t be hidden behind paywalls, restricted to the wealthy but rather should operate under the principle of open access as public good, he explained. Leonard announced that the Boston Public Library would join the Internet Archive’s Great 78 Project, by authorizing the transfer of 200,000 individual 78s to digitize for the 78rpm collection, “a collection that otherwise would remain in storage unavailable to anyone.”
Brewster Kahle, founder and digital librarian of the Internet Archive, then came through the time machine to present the Internet Archive’s Internet Archive Hero Award to Leonard. “I am inspired every time I go through the doors,” said Kahle of the library, noting that the Boston Public Library was the first to digitize not just a presidential library, of John Quincy Adams, but also modern books. Leonard was presented with a tablet imprinted with the Boston Public Library homepage.1942
Kahle then set the Time Machine to 1942 to explain another new Internet Archive initiative: liberating books published between 1923 to 1941. Working with Elizabeth Townsend Gard, a copyright scholar at Tulane University, the Internet Archive is liberating these books under a little known, and perhaps never used, provision of US copyright law, Section 108h, which allows libraries to scan and make available materials published 1923 to 1941 if they are not being actively sold. The name of the new collection: the Sony Bono Memorial Collection, named for the now deceased congressman and former representative who led the passage of the Copyright Term Extension Act of 1998, which had the effect of locking up most books from the public domain back to 1923.
One of these books includes “Your Life,” a tome written by Kahle’s grandfather, Douglas E. Lurton, a “guide to a desirable living.” “I have one copy of this book and two sons. According to the law, I can’t make one copy and give it to the other son. But now it’s available,” Kahle explained.1944
The Time Machine cranked to 1944, out came Rick Prelinger, Internet Archive board president, archivist, and filmmaker. Prelinger introduced a new addition to the Internet Archive’s film collection: long-forgotten footage of an Arkansas Japanese internment camp from 1944. As the film played on the screen, Prelinger welcomed Sab Masada, 87, who lived at this very camp as a 12 year old.
Masada talked about his experience at the camp and why it is important for people today to remember it, “Since the election I’ve heard echoes of what I heard in 1942. Using fear of terrorism to target the Muslims and people south of the border.”1972
Next to speak was Wendy Hanamura, the director of partnerships. Hanamura explained how as a sixth grader she discovered a book at the library, Executive Order 9066, published in 1972, which told the tale of Japanese internment camps during World War II.
“Before I was an internet archivist, I was a daughter and granddaughter of American citizens who were locked up behind barbed wires,” said Hanamura. That one book – now out of print – helped her understand what had happened to her family.
Inspired by making it to the semi-final round of the MacArthur 100&Change initiative with a proposal that provides libraries and learners with free digital access to four million books, the Internet Archive is forging ahead with plans despite not winning the $100 million grant. Among the books the Internet Archive is making available: Executive Order 9066.1985
The year display turned to 1985, Jason Scott reappeared on stage, explaining his role as a software curator. New this year to the Internet Archive are collections of early Apple software, he explained, with browser emulation allowing the user to experience just what it was like to fire up a Macintosh computer back in its hay day. This includes a collection of the then wildly popular “HyperCards,” a programmatic tool that enabled users to create programs that linked materials in creative ways, before the rise of the world wide web.2017
After this tour through the 20th century, the Time Machine was set to present day, 2017. Mark Graham, director of the Wayback Machine and Vinay Goel, senior data engineer, stepped on stage. Back in 1996, when the Wayback Machine began archiving websites on the still new world wide web, the entire thing amounted to 2.2 terabytes of data. Now the Wayback Machine contains 20 petabytes. Graham explained how the Wayback Machine is preserving tweets, government websites, and other materials that could otherwise vanish. One example: this report from The Rachel Maddow Show, which aired on December 16, 2016, about Michael Flynn, then slated to become national security advisor. Flynn deleted a tweet he had made linking to a falsified story about Hillary Clinton, but the Internet Archive saved it through the Wayback Machine.
Goel took the microphone to announce new improvements to Wayback Machine 2.0 search. Now it’s possible to search for keywords, such as “climate change,” and find not just web pages from a particular time period mentioning these words, but also different format types — such as images, pdfs, or yes, even an old Internet Archive favorite, gifs from the now-defunct GeoCities–including snow globes!
Thanks to all who came out to celebrate with the Internet Archive staff and volunteers, or watched online. Please join our efforts to provide Universal Access to All Knowledge, whatever century it is from.
We are pleased to announce that the Internet Archive and OCLC have agreed to synchronize the metadata describing our digital books with OCLC’s WorldCat. WorldCat is a union catalog that itemizes the collections of thousands of libraries in more than 120 countries that participate in the OCLC global cooperative.
What does this mean for readers?
When the synchronization work is complete, library patrons will be able to discover the Internet Archive’s collection of 2.5 million digitized monographs through the libraries around the world that use OCLC’s bibliographic services. Readers searching for a particular volume will know that a digital version of the book exists in our collection. With just one click, readers will be taken to archive.org to examine and possibly borrow the digital version of that book. In turn, readers who find a digital book at archive.org will be able, with one click, to discover the nearest library where they can borrow the hard copy.
There are additional benefits: in the process of the synchronization, OCLC databases will be enriched with records describing books that may not yet be represented in WorldCat.
“This work strengthens the Archive’s connection to the library community around the world. It advances our goal of universal access by making our collections much more widely discoverable. It will benefit library users around the globe by giving them the opportunity to borrow digital books that might not otherwise be available to them,” said Brewster Kahle, Founder and Digital Librarian of the Internet Archive. “We’re glad to partner with OCLC to make this possible and look forward to other opportunities this synchronization will present.”
“OCLC is always looking for opportunities to work with partners who share goals and objectives that can benefit libraries and library users,” said Chip Nilges, OCLC Vice President, Business Development. “We’re excited to be working with Internet Archive, and to make this valuable content discoverable through WorldCat. This partnership will add value to WorldCat, expand the collections of member libraries, and extend the reach of Internet Archive content to library users everywhere.”
We believe this partnership will be a win-win-win for libraries and for learners around the globe.
Better discovery, richer metadata, more books borrowed and read.
Boston Public Library’s Sound Archives Coming to the Internet Archive for Preservation & Public Access
Today, the Boston Public Library announced the transfer of significant holdings from its Sound Archives Collection to the Internet Archive, which will digitize, preserve and make these recordings accessible to the public. The Boston Public Library (BPL) sound collection includes hundreds of thousands of audio recordings in a variety of historical formats, including wax cylinders, 78 rpms, and LPs. The recordings span many genres, including classical, pop, rock, jazz, and opera – from 78s produced in the early 1900s to LPs from the 1980s. These recordings have never been circulated and were in storage for several decades, uncataloged and inaccessible to the public. By collaborating with the Internet Archive, Boston Public Libraries audio collection can be heard by new audiences of scholars, researchers and music lovers worldwide.
“Through this innovative collaboration, the Internet Archive will bring significant portions of these sound archives online and to life in a way that we couldn’t do alone, and we are thrilled to have this historic collection curated and cared for by our longtime partners for all to enjoy going forward,” said David Leonard, President of the Boston Public Library.
Listening to the 78 rpm recording of “Please Pass the Biscuits, Pappy,” by W. Lee O’Daniel and his Hillbilly Boys from the BPL Sound Archive, what do you hear? Internet Archive Founder, Brewster Kahle, hears part of a soundscape of America in 1938. That’s why he believes Boston Public Library’s transfer is so significant.
“Boston Public Library is once again leading in providing public access to their holdings. Their Sound Archive Collection includes hillbilly music, early brass bands and accordion recordings from the turn of the last century, offering an authentic audio portrait of how America sounded a century ago.” says Brewster Kahle, Internet Archive’s Digital Librarian. “Every time I walk through Boston Public Library’s doors, I’m inspired to read what is carved above it: ‘Free to All.’”
The 78 rpm records from the BPL’s Sound Archives Collection fit into the Internet Archive’s larger initiative called The Great 78 Project. This community effort seeks to digitize all the 78 rpm records ever produced, supporting their preservation, research and discovery. From about 1898 to the 1950s, an estimated 3 million sides were published on 78 rpm discs. While commercially viable recordings will have been restored or remastered onto LP’s or CD, there is significant research value in the remaining artifacts which include often rare 78rpm recordings.
“The simple fact of the matter is most audiovisual recordings will be lost,” says George Blood, an internationally renowned expert on audio preservation. “These 78s are disappearing right and left. It is important that we do a good job preserving what we can get to, because there won’t be a second chance.”
The Internet Archive is working with George Blood LP, and the IA’s Music Curator, Bob George of the Archive of Contemporary Music to discover, transfer, digitize, catalog and preserve these often fragile discs. This team has already digitized more than 35,000 sides. The BPL collection joins more than 20 collections already transferred to the Internet Archive for physical and digital preservation and access. Curated by many volunteer collectors, these collections will be preserved for future generations.
The Internet Archive began working with the Boston Public Library in 2007, and our scanning center is housed at its Central Library in Copley Square. There, as a digital-partner-in-residence, the Internet Archive is scanning bound materials for Boston Public Library, including the John Adams Library, one of the BPL’s Collections of Distinction.
To honor Boston Public Library’s long legacy and pioneering role in making its valuable holdings available to an ever wider public online, we will be awarding the 2017 Internet Archive Hero Award to David Leonard, the President of BPL, at a public celebration tonight at the Internet Archive headquarters in San Francisco.
Structural metadata is the most misunderstood form of metadata. It is widely ignored, even among those who work with metadata. When it is discussed, it gets confused with other things. Even people who understand structural metadata correctly don’t always appreciate its full potential. That’s unfortunate, because structural metadata can make content more powerful. This post takes a deep dive into what structural metadata is, what it does, and how it is changing.
Why should you care about structural metadata? The immediate, self-interested answer is that structural metadata facilitates content reuse, taking content that’s already created to deliver new content. Content reuse is nice for publishers, but it isn’t a big deal for audiences. Audiences don’t care how hard it is for the publisher to create their content. Audiences want content that matches their needs precisely, and that’s easy to use. Structural metadata can help with that too.
Structural metadata matches content with the needs of audiences. Content delivery can evolve beyond creating many variations of content — the current preoccupation of many publishers. Publishers can use structural metadata to deliver more interactive content experiences. Structural metadata will be pivotal in the development of multimodal content, allowing new forms of interaction, such as voice interaction. Well-described chunks of content are like well-described buttons, sliders and other forms of interactive web elements. The only difference is that they are more interesting. They have something to say.
Some of the following material will assume background knowledge about metadata. If you need more context, consult my very approachable book, Metadata Basics for Web Content.What is Structural Metadata?
Structural metadata is data about the structure of content. In some ways it is not mysterious at all. Every time you write a paragraph, and enclose it within a
<p> paragraph element, you’ve created some structural metadata. But structural metadata entails far more than basic HTML tagging. It gives data to machines on how to deliver the content to audiences. When structural metadata is considered as a fancy name for HTML tagging, much of its potency gets missed.
The concept of structural metadata originated in the library and records management field around 20 years ago. To understand where structural metadata is heading, it pays to look at how it has been defined already.
In 1996, a metadata initiative known as the Warwick Framework first identified structural metadata as “data defining the logical components of complex or compound objects and how to access those components.”
In 2001, a group of archivists, who need to keep track of the relationships between different items of content, came up with a succinct definition: “Structural metadata can be thought of as the glue that binds compound objects together.”
By 2004, the National Information Standards Organization (NISO) was talking about structural metadata in their standards. According to their definition in the z39.18 standard, “Structural metadata explain the relationship between parts of multipart objects and enhance internal navigation. Such metadata include a table of contents or list of figures and tables.”
Louis Rosenfeld and Peter Morville introduced the concept of structural metadata to the web community in their popular book, Information Architecture for the World Wide Web — the “Polar Bear” book. Rosenfeld and Morville use the structural metadata concept as a prompt to define the information architecture of a websites:
“Describe the information hierarchy of this object. Is there a title? Are there discrete sections or chunks of content? Might users want to independently access these chunks?”
A big theme of all these definitions is the value of breaking content into parts. The bigger the content, the more it needs breaking down. The structural metadata for a book relates to its components: the table of contents, the chapters, parts, index and so on. It helps us understand what kinds of material is within the book, to access specific sections of the book, even if it doesn’t tell us all the specific things the book discusses. This is important information, which surprisingly, wasn’t captured when Google undertook their massive book digitization initiative a number of years ago. When the books were scanned, entire books became one big file, like a PDF. To find a specific figure or table within book on Google books requires searching or scrolling to navigate through the book.The contents of scanned books in Google Books lack structural metadata, limiting the value of the content.
Navigation is an important purpose of structural metadata: to access specific content, such as a specific book chapter. But structural metadata has an even more important purpose than making big content more manageable. It can unbundle the content, so that the content doesn’t need to stay together. People don’t want to start with the whole book and then navigate through it to get to a small part in which they are interested. They want only that part.
In his recent book Metadata, Richard Gartner touches on a more current role for structural metadata: “it defines structures that bring together simpler components into something larger that has meaning to a user.” He adds that such information “builds links between small pieces of data to assemble them into a more complex object.”
In web content, structural metadata plays an important role assembling content. When content is unbundled, it can be rebundled in various ways. Structural metadata identifies the components within content types. It indicates role of the content, such as whether the content is an introduction or a summary.
Structural metadata plays a different role today than it did in the past, when the assumption was that there was one fixed piece of large content that would be broken into smaller parts, identified by structural metadata. Today, we may compose many larger content items, leveraging structural metadata, from smaller parts.
The idea of assembling content from smaller parts has been promoted in particular by DITA evangelists such as Anne Rockley (DITA is a widely used framework for technical documentation). Rockley uses the phrase “semantic structures” to refer to structural metadata, which she says “enable(s) us to understand ‘what’ types of content are contained within the documents and other content types we create.” Rockley’s discussion helpfully makes reference to content types, which some other definitions don’t explicitly mention. She also introduces another concept with a similar sounding name, “semantically rich” content, to refer to a different kind of metadata: descriptive metadata. In XML (which is used to represent DITA), the term semantic is used generically for any element. Yet the difference between structural and descriptive metadata is significant — though it is often obscured, especially in the XML syntax.
Curiously, semantic web developments haven’t focused much on structural metadata for content (though I see a few indications that this is starting to change). Never assume that when someone talks about making content semantic, they are talking about adding structural metadata.Don’t Confuse Structural and Descriptive Metadata
When information professionals refer to metadata, most often they are talking about descriptive metadata concerning people, places, things, and events. Descriptive metadata indicates the key information included within the content. It typically describes the subject matter of the content, and is sometimes detailed and extensive. It helps one discover what the content is about, prior to viewing the content. Traditionally, descriptive metadata was about creating an external index — a proxy — such as assigning a keywords or subject headings about the content. Over the past 20 years, descriptive metadata has evolved to describing the body of the content in detail, noting entities and their properties.
Richard Gartner refers to descriptive metadata as “finding metadata”: it locates content that contains some specific information. In modern web technology, it means finding values for a specific field (or property). These values are part of the content, rather than separate from it. For example, find smartphones with dual SIMs that are under $400. The attributes of SIM capacity and price are descriptive metadata related to the content describing the smartphones.
Structural metadata indicates how people and machines can use the content. If people see a link indicating a slideshow, they have an expectation of how such content will behave, and will decide if that’s the sort of content they are interested in. If a machine sees that the content is a table, it uses that knowledge to format the content appropriately on a smartphone, so that all the columns are visible. Machines rely extensively on structural metadata when stitching together different content components into a larger content item.Structural and descriptive metadata can be indicated in the same HTML tag. This tag indicates the start of an introductory section discussing Albert Einstein.
Structural metadata sometimes is confused with descriptive metadata because many people use vague terms such as “structure” and “semantics” when discussing content. Some people erroneously believe that structuring content makes the content “semantic”. Part of this confusion derives from having an XML-orientation toward content. XML tags content with angle-bracketed elements. But XML elements can be either structures such as sections, or they can be descriptions such as names. Unlike HTML, where elements signify content structure while descriptions are indicated in attributes, the XML syntax creates a monster hierarchical tree, where content with all kinds of roles are nested within elements. The motley, unpredictable use of elements in XML is a major reason it is unpopular with developers, who have trouble seeing what role different parts of the content have.
The buzzword “semantically structured content” is particularly unhelpful, as it conflates two different ideas together: semantics, or what content means, with structure, or how content fits together. The semantics of the content is indicated by descriptive metadata, while the structure of the content is indicated by structural metadata. Descriptive metadata can focus on a small detail in the content, such as a name or concept (e.g., here’s a mention of the Federal Reserve Board chair in this article). Structural metadata, in contrast, generally focuses on a bigger chunk of content: here’s a table, here’s a sidebar. To assemble content, machines need to distinguish what the specific content means, from what the structure of the content means.
Interest in content modeling has grown recently, spurred by the desire to reuse content in different contexts. Unfortunately, most content models I’ve seen don’t address metadata at all; they just assume that the content can pieced together. The models almost never distinguish between the properties of different entities (descriptive metadata), and the properties of different content types (structural metadata). This can lead to confusion. For example, a place has an address, and that address can be used in many kinds of content. You may have specific content types dedicated to discussing places (perhaps tourist destinations) and want to include address information. Alternatively, you may need to include the address information in content types that are focused on other purposes, such as a membership list. Unless you make a clear distinction in the content model between what’s descriptive metadata about entities, and what’s structural metadata about content types, many people will be inclined to think there is a one-to-one correspondence between entities and content types, for example, all addresses belong the the content type discussing tourist destinations.
Structural metadata isn’t merely a technical issue to hand off to a developer. Everyone on a content team who is involved with defining what content gets delivered to audiences, needs to jointly define what structural metadata to include in the content.Three More Reasons Structural Metadata Gets Ignored…
Content strategists have inherited frameworks for working with metadata from librarians, database experts and developers. None of those roles involves creating content, and their perspective of content is an external one, rather than an internal one. These hand-me-down concepts don’t fit the needs of online content creators and publishers very well. It’s important not to be misled by legacy ideas about structural metadata that were developed by people who aren’t content creators and publishers. Structural metadata gets sidelined when people fail to focus on the value that content parts can contribute in difference scenarios.Reason 1: Focus on Whole Object Metadata
Librarians have given little attention to structural metadata, because they’ve been most concerned with cataloging and locating things that have well defined boundaries, such as books and articles (and most recently, webpages). Discussion of structural metadata in library science literature is sparse compared with discussions of descriptive and administrative metadata.
Until recently, structural metadata has focused on identifying parts within a whole. Metadata specialists assumed that a complete content item existed (a book or document), and that structural metadata would be used to locate parts within the content. Specifying structural metadata was part of cataloging existing materials. But given the availability of free text searching and more recently natural language processing, many developers question the necessity of adding metadata to sub-divide a document. Coding structural metadata seemed like a luxury, and got ignored.
In today’s web, content exists as fragments that can be assembled in various ways. A document or other content type is a virtual construct, awaiting components. The structural metadata forms part of the plan for how the content can fit together. It’s important to define the pieces first.Reason 2: Confusion with Metadata Schemas
I’ve recently seen several cases where content strategists and others mix up the concept of structural metadata, with the concept of metadata structure, better known as metadata schemas. At first I thought this confusion was simply the result of similar sounding terms. But I’ve come to realize that some database experts refer to structural metadata in a different way than it is being used by librarians, information architects, and content engineers. Some content strategists seem to have picked up this alternative meaning, and repeat it.
Compared to semi-structured web content, databases are highly regular in structure. They are composed of tables of rows and columns. The first column of a row typically identifies what the values relate to. Some database admins refer to those keys or properties as the structure of the data, or the structural metadata. For example, the OECD, the international statistical organization, says: “Structural metadata refers to metadata that act as identifiers and descriptors of the data. Structural metadata are needed to identify, use, and process data matrixes and data cubes.” What is actually being referred to is the schema of the data table.
Database architects develop many custom schemas to organize their data in tables. Those schemas are very different from the standards-based structural metadata used in content. Database tables provide little guidance on how content should be structured. Content teams shouldn’t rely on a database expert to guide them on how to structure their content.Reason 3: Treated as Ordinary Code
Web content management systems are essentially big databases built in programming language like PHP or .Net. There’s a proclivity among developers to treat chunks of content as custom variables. As one developer noted when discussing WordPress: “In WordPress (WP), the meaning of Metadata is a bit fuzzier. It stores post metadata such as custom fields and additional metadata added via plugins.”
As I’ve noted elsewhere, many IT systems that manage content ignore web metadata metadata standards, resulting in silos of content that can’t work together. It’s not acceptable to define chunks of content as custom variables. The purpose of structural metadata is to allow different chunks of content to connect with each other. CMSs need to rely on web standards for their structural metadata.Current Practices for Structural Metadata
For machines to piece together content components into a coherent whole, they need to know the standards for the structural metadata.
Until recently, structural metadata has been indicated only during the prepublication phase, an internal operation where standards were less important. Structural metadata was marked up in XML together with other kinds of metadata, and transformed into HTML or PDF. Yet a study in the journal Semantic Web last year noted: “Unfortunately, the number of distinct vocabularies adopted by publishers to describe these requirements is quite large, expressed in bespoke document type definitions (DTDs). There is thus a need to integrate these different languages into a single, unifying framework that may be used for all content.”
XML continues to be used in many situations. But a recent trend has been to adopt more light weight approaches, using HTML, to publish content directly. Bypassing XML is often simpler, though the plainness of HTML creates some issues as well.
As Jeff Eaton has noted, getting specific about the structure of content using HTML elements is not always easy:
“We have workhorse elements like ul, div, and span; precision tools like cite, table, and figure; and new HTML5 container elements like section, aside, and nav. But unless our content is really as simple as an unattributed block quote or a floated image, we still need layers of nested elements and CSS classes to capture what we really mean.”
Because HTML elements are not very specific, publishers often don’t know how to represent structural metadata within HTML. We can learn from the experience of publishers who have used XML to indicate structure, and who are adapting their structures to HTML.
Scientific research, and technical documentation are two genres where content structure is well-established, and structural metadata is mature. Both these genres have explored how to indicate the structure of their content in HTML.
Scientific research papers are a distinct content type that follows a regular pattern. The National Library of Medicine’s Journal Article Tag Suite (JATS) formalizes the research paper structure into a content type as an XML schema. It provides a mixture of structural and descriptive metadata tags that are used to publish biomedical and other scientific research. The structure might look like:<sec sec-type="intro"> <sec sec-type="materials|methods"> <sec sec-type="results"> <sec sec-type="discussion"> <sec sec-type="conclusions"> <sec sec-type="supplementary-material" ... >
Scholarly HTML is an initiative to translate the typical sections of a research paper into common HTML. It uses HTML elements, and supplements them with typeof attributes to indicate more specifically the role of each section. Here’s an example of some attribute values in their namespace, noted by the prefix “sa”:<section typeof="sa:MaterialsAndMethods"> <section typeof="sa:Results"> <section typeof="sa:Conclusion"> <section typeof="sa:Acknowledgements"> <section typeof="sa:ReferenceList">
As we can see, these sections overlap with the JATS, since both are describing similar content structures. The Scholarly HTML initiative is still under development, and it could eventually become a part of the schema.org effort.
DITA — the technical documentation architecture mentioned earlier — is a structural metadata framework that embeds some descriptive metadata. DITA structures topics, which can be different information types: Task, Concept, Reference, Glossary Entry, or Troubleshooting, for example. Each type is broken into structural elements, such as title, short description, prolog, body, and related links. DITA is defined in XML, and uses many idiosyncratic tags.
HDITA is a draft syntax to express DITA in HTML. It converts DITA-specific elements into HTML attributes, using the custom data-* attribute. For example a “key definition” element <keydef> becomes an attribute within an HTML element, e.g. <div data-hd-class="keydef”>
. Types are expressed with the attribute data-hd-type.
“These attributes are not intended for use by software that is not known to the administrators of the site that uses the attributes. For generic extensions that are to be used by multiple independent tools, either this specification should be extended to provide the feature explicitly, or a technology like microdata should be used (with a standardized vocabulary).”
The HDITA drafting committee appears to use “hd” in the data attribute to signify that the attribute is specific to HDITA. But they have not declared a namespace for these attributes (the XML namespace for DITA is xmlns:ditaarch.) This will prevent automatic machine discovery of the metadata by Google or other parties.The Future of Structural Metadata
Most recently, several initiatives have explored possibilities for extending structural metadata in HTML. These revolve around three distinct approaches:
- Formalizing structural metadata as properties
- Using WAI-ARIA to indicate structure
- Combining class attributes with other metadata schemas
The web standards community is starting to show more interest in structural metadata. Earlier this year, the W3C released the Web Annotation Vocabulary. It provides properties to indicate comments about content. Comments are an important structure in web content that are used in many genres and scenarios. Imagine that readers may be highlighting passages of text. For such annotations to be captured, there must be a way to indicate what part of the text is being referenced. The annotation vocabulary can reference specific HTML elements and even CSS selectors within a body of text.
Outside of the W3C, a European academic group has developed the Document Components Ontology (DoCO), “a general-purpose structured vocabulary of document elements.” It is a detailed set of properties for describing common structural features of text content. The DoCO vocabulary can be used by anyone, though its initial adoption will likely be limited to research-oriented publishers. However, many specialized vocabularies such as this one have become extensions to schema.org. If DoCO were in some form adsorbed by schema.org, its usage would increase dramatically.Diagram showing document components ontology WAI-ARIA
WAI-ARIA is commonly thought of as a means to make functionality accessible. However, it should be considered more broadly as a means to enhance the functionality of web content overall, since it helps web agents understand the intentions of the content. WAI-ARIA can indicate many dynamic content structures, such as alerts, feeds, marquees, and regions.
The new Digital Publishing WAI-ARIA developed out of the ePub standards, which have a richer set of structural metadata than is available in standard HTML5. The goal of the Digital Publishing WAI-ARIA is to “produce structural semantic extensions to accommodate the digital publishing industry”. It has the following structural attributes:
To indicate an the structure of a text box showing an example:<aside role="doc-example"> <h1>An Example of Structural Metadata in WAI-ARIA</h1> … </aside>
Content expressing a warning might look like this:<div role="doc-notice" aria-label="Explosion Risk"> <p><em>Danger!</em> Mixing reactive materials may cause an explosion.</p> </div>
Although book-focused, DOC-ARIA roles provide a rich set of structural elements that can be used with many kinds of content. In combination with the core WAI-ARIA, these attributes can describe the structure of web content in extensive detail.CSS as Structure
For a long while, developers have been creating pseudo structures using CSS, such as making infoboxes to enclose certain information. Class is a global attribute of HTML, but has become closely associated with CSS, so much so that some believe that is its only purpose. Yet Wikipedia notes: “The class attribute provides a way of classifying similar elements. This can be used for semantic purposes, or for presentation purposes.” Some developers use what are called “semantic classes” to indicate what content is about. The W3C advises when using the class attribute: “authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.”
Some developers claim that the class attribute should never be used to indicate the meaning of content within an element, because HTML elements will always make that clear. I agree that web content should never use the class attribute as a substitute for using a meaningful HTML element. But the class attribute can sometimes further refine the meaning of an HTML element. Its chief limitation is that class names involve private meanings. Yet if they are self-describing they can be useful.
Class attributes are useful for selecting content, but they operate outside of metadata standards. However, schema.org is proposing a property that will allow class values to be specified within schema.org metadata. This has potentially significant implications for extending the scope of structural metadata.
The motivating use case is as follows: “There is a need for authors and publishers to be able to easily call out portions of a Web page that are particularly appropriate for reading out aloud. Such read-aloud functionality may vary from speaking a short title and summary, to speaking a few key sections of a page; in some cases, it may amount to speaking most non-visual content on the page.”
The pending cssSelector property in schema.org can identify named portions of a web page. The class could be a structure such as a summary or a headline that would be more specific than an HTML element. The cssSelector has a companion property called xpath, which identifies HTML elements positionally, such as the paragraphs after h2 headings.
These features are not yet fully defined. In addition to indicating speakable content, the cssSelector can indicate parts of a web page. According to a Github discussion: “The ‘cssSelector’ (and ‘xpath’) property would be particularly useful on http://schema.org/WebPageElement to indicate the part(s) of a page matching the selector / xpath. Note that this isn’t ‘element’ in some formal XML sense, and that the selector might match multiple XML/HTML elements if it is a CSS class selector.” This could be useful selecting content targeted at specific devices.
The class attribute can identify structures within the web content, working together with entity-focused properties that describe specific data relating to the content. Both of these indicate content variables, but they deliver different benefits.
Entity-based (descriptive) metadata can be used for content variables about specific information. They will often serve as text or numeric variables. Use descriptive metadata variables when choosing what informational details to put in a message.
Structural metadata can be used phrase-based variables, indicating reusable components. Phrases can be either blocks (paragraphs or divs), or snippets (a span). Use structural metadata variables when choosing the wording to convey a message in a given scenario.
A final interesting point about cssSelector’s in schema.org. Like other properties in schema.org, these can be expressed either as inline markup in HTML (microdata) or as an external JSON-LD script. This gives developers the flexibility to choose whether to use coding libraries that are optimized for arrays (JSON-flavored), or ones focus on selectors. For too long, what metadata gets included has been influenced by developer preferences in coding libraries. The fact that CSS attributes can be expressed as JSON suggests that hurdle is being transcended.Conclusion
Structural metadata is finally getting some love in the standards community, even though awareness of it remains low among developers. I hope that content teams will consider how they can use structural metadata to be more precise in indicating what their content does, so that it can be used flexibly in emerging scenarios such as voice interactions.
— Michael Andrews
The Internet Archive is now leveraging a little known, and maybe never used, provision of US copyright law, Section 108h, which allows libraries to scan and make available materials published 1923 to 1941 if they are not being actively sold. Elizabeth Townsend Gard, a copyright scholar at Tulane University calls this “Library Public Domain.” She and her students helped bring the first scanned books of this era available online in a collection named for the author of the bill making this necessary: The Sonny Bono Memorial Collection. Thousands more books will be added in the near future as we automate. We hope this will encourage libraries that have been reticent to scan beyond 1923 to start mass scanning their books and other works, at least up to 1942.
While good news, it is too bad it is necessary to use this provision.
If the Founding Fathers had their way, almost all works from the 20th century would be public domain by now (14-year copyright term, renewable once if you took extra actions).
Some corporations saw adding works to the public domain to be a problem, and when Sonny Bono got elected to the House of Representatives, representing part of Los Angeles, he helped push through a law extending copyright’s duration another 20 years to keep things locked-up back to 1923. This has been called the Mickey Mouse Protection Act due to one of the motivators behind the law, but it was also a result of Europe extending copyright terms an additional twenty years first. If not for this law, works from 1923 and beyond would have been in the public domain decades ago.
Creative Commons founder, Larry Lessig fought the new law in court as unreasonable, unneeded, and ridiculous. In support of Lessig’s fight, the Internet Archive made an Internet bookmobile to celebrate what could be done with the public domain. We drove the bookmobile across the country to the Supreme Court to make books during the hearing of the case. Alas, we lost.
But there is an exemption from this extension of copyright, but only for libraries and only for works that are not actively for sale — we can scan them and make them available. Professor Townsend Gard had two legal interns work with the Internet Archive last summer to find how we can automate finding appropriate scanned books that could be liberated, and hand-vetted the first books for the collection. Professor Townsend Gard has just released an in-depth paper giving libraries guidance as to how to implement Section 108(h) based on her work with the Archive and other libraries. Together, we have called them “Last Twenty” Collections, as libraries and archives can copy and distribute to the general public qualified works in the last twenty years of their copyright.
Today we announce the “Sonny Bono Memorial Collection” containing the first books to be liberated. Anyone can download, read, and enjoy these works that have been long out of print. We will add another 10,000 books and other works in the near future. “Working with the Internet Archive has allowed us to do the work to make this part of the law usable,” reflected Professor Townsend Gard. “Hopefully, this will be the first of many “Last Twenty” Collections around the country.”
Now it is the chance for libraries and citizens who have been reticent to scan works beyond 1923, to push forward to 1941, and the Internet Archive will host them. “I’ve always said that the silver lining of the unfortunate Eldred v. Ashcroft decision was the response from people to do something, to actively begin to limit the power of the copyright monopoly through action that promoted open access and CC licensing,” says Carrie Russell, Director of ALA’s Program of Public Access to Information. “As a result, the academy and the general public has rediscovered the value of the public domain. The Last Twenty project joins the Internet Archive, the HathiTrust copyright review project, and the Creative Commons in amassing our public domain to further new scholarship, creativity, and learning.”
We thank and congratulate Team Durationator and Professor Townsend Gard for all the hard work that went into making this new collection possible. Professor Townsend Gard, along with her husband, Dr. Ron Gard, have started a company, Limited Times, to assist libraries, archives, and museums implementing Section 108(h), “Last Twenty” collections, and other aspects of the copyright law.
Hundreds of thousands of books can now be liberated. Let’s bring the 20th century to 21st-century citizens. Everyone, rev your cameras!
Limited tickets left for 20th Century Time Machine — the Internet Archive’s Annual Bash – happening this Wednesday at the Internet Archive from 5pm-9:30pm. In case you missed it, here’s our original announcement.
Tickets start at $15 here.
Once tickets sell out, you’ll have the opportunity to join the waitlist. We’ll release tickets as spaces free up and let you know via email.
We’d love to celebrate with you!
- Which recent hurricane got the least amount of attention from TV news broadcasters?
- Thomas Jefferson said, “Government that governs least governs best.”
- Mitch McConnell shows up most on which cable TV news channel?
- Fox News
Answers at end of post.
The Internet Archive’s TV News Archive, our constantly growing online, free library of TV news broadcasts, contains 1.4 million shows, some dating back to 2009, searchable by closed captioning. History is happening, and we preserve how broadcast news filters it to us, the audience, whether it’s through CNN’s Jake Tapper, Fox’s Bill O’Reilly, MSNBC’s Rachel Maddow or others. This archive becomes a rich resource for journalists, academics, and the general public to explore the biases embedded in news coverage and to hold public officials accountable.
Last October we wrote how the Internet Archive’s TV News Archive was “hacking the election,” then 13 days away. In the year since, we’ve been applying our experience using machine learning to track political ads and TV news coverage in the 2016 elections to experiment with new collaborations and tools to create more ways to analyze the news.
Since we launched our Trump Archive in January 2017, and followed in August with the four congressional leaders, Democrat and Republican, as well as key executive branch figures, we’ve collected some 4,534 hours of curated programming and more than 1,300 fact-checks of material on subjects ranging from immigration to the environment to elections.
The 1,340 fact-checks–and counting–represent a subset of the work of partners FactCheck.org, PolitiFact and The Washington Post’s Fact Checker, as we link only to fact-checks that correspond to statements that appear on TV news. Most of the fact-checks–524–come from PolitiFact; 492 are by FactCheck.org, and 324 from The Washington Post’s Fact Checker.
We’re also proud to be part of the Duke Reporter’s Lab’s new Tech & Check collaborative, where we’re working with journalists and computer scientists to develop ways to automate parts of the fact-checking process. For example, we’re creating processes to help identify important factual claims within TV news broadcasts to help guide fact-checkers where to concentrate their efforts. The initiative received $1.2 million from the John S. and James L. Knight Foundation, the Facebook Journalism Project and the Craig Newmark Foundation.
We’re collaborating with data scientists, private companies and nonprofit organizations, journalists, and others to cook up new experiments available in our TV News Kitchen, providing new ways to analyze TV news content and understand ourselves.
Dan Schultz, our senior creative technologist, worked with the start-up Matroid to develop Face-o-Matic, which tracks faces of selected high level elected officials on major TV cable news channels: CNN, Fox News, MSNBC, and BBC News. The underlying data are available for download here. Unlike caption-based searches, Face-o-Matic uses facial recognition algorithms to recognize individuals on TV news screens. It is sensitive enough to catch this tiny, dark image of House Minority Leader Nancy Pelosi, D., Calif., within a graphic, and this quick flash of Senate Minority Leader Chuck Schumer, D., N.Y., and Senate Majority Leader Mitch McConnell, R., Ky.
The work of TV Architect Tracey Jaquith, our Third Eye project scans the lower thirds of TV screens, using OCR, or optical character recognition, to turn these fleeting missives into downloadable data ripe for analysis. Launched in September 2017, Third Eye tracks BBC News, CNN, Fox News, and MSNBC, and collected more than four million chyrons captured in just over two weeks, and counting.
Vox news reporter Alvin Chang used the Third Eye chyron data to report how Fox News paid less attention to Hurricane Maria’s destruction in Puerto Rico than it did to Hurricanes Irma and Harvey, which battered Florida and Texas. Chang’s work followed a similar piece by Dhrumil Mehta for FiveThirtyEight, which used Television Explorer, a tool developed by data scientist Kalev Leetaru to search and visualize closed captioning on the TV News Archive.
CNN’s Brian Stelter followed up with a similar analysis on “Reliable Sources” October 1.
We’re also working with academics who are using our tools to unlock new insights. For example, Schultz and Jaquith are working with Bryce Dietrich at the University of Iowa to apply the Duplitron, the audiofingerprinting tool that fueled our political ad airing data, to analyze floor speeches of members of Congress. The study identifies which floor speeches were aired on cable news programs and explores the reasons why those particular clips were selected for airing. A draft of the paper was presented in the 2017 Polinfomatics Workshop in Seattle and will begin review for publication in the coming months.
What’s next? Our plans include making more than a million hours of TV news available to researchers from both private and public institutions via a digital public library branch of the Internet Archive’s TV News Archive. These branches would be housed in computing environments, where networked computers provide the processing power needed to analyze large amounts of data. Researchers will be able to conduct their own experiments using machine learning to extract metadata from TV news. Such metadata could include, for example, speaker identification–a way to identify not just when a speaker appears on a screen, but when she or he is talking. Metadata generated through these experiments would then be used to enrich the TV News Archive, so that any member of the public could do increasingly sophisticated searches.Going global
We live in an interdependent world, but we often lack understanding about how other cultures perceive us. Collecting global TV could open a new window for journalists and researchers seeking to understand how political and policy messages are reported and spread across the globe. The same tools we’ve developed to track political ads, faces, chyrons, and captions can help us put news coverage from around the globe into perspective.
We’re beginning work to expand our TV collection to include more channels from around the globe. We’ve added the BBC and recently began collecting Deutsche Welle from Germany and the English-language Al Jazeera. We’re talking to potential partners and developing strategy about where it’s important to collect TV and how we can do so efficiently.
History is happening, but we’re not just watching. We’re collecting, making it accessible, and working with others to find new ways to understand it. Stay tuned. Email us at firstname.lastname@example.org. Follow us @tvnewsarchive, and subscribe to our weekly newsletter here.
- b. (See: “The Media Really Has Neglected Puerto Rico,” FiveThirtyEight.
- b. False. (See: Vice President Mike Pence statement and linked PolitiFact fact-check.)
- c. MSNBC. (See: Face-O-Matic blog post.)
Members of the TV News Archive team: Roger Macdonald, director; Robin Chin, Katie Dahl, Tracey Jaquith, Dan Schultz, and Nancy Watzman.
A weekly round up on what’s happening and what we’re seeing at the TV News Archive by Katie Dahl and Nancy Watzman. Additional research by Robin Chin.
In an era when social media algorithms skew what people see online, the Internet Archive TV News Archive’s collections of on-the-record statements by top political figures serves as a powerful model for how preservation can provide a deep resource for who really said what, when, and where.
Since we launched our Trump Archive in January 2017, and followed in August with the four congressional leaders, Democrat and Republican, as well as key executive branch figures, we’ve collected some 4,534 hours of curated programming and more than 1,300 fact-checks of material on subjects ranging from immigration to the environment to elections.
The 1,340 fact-checks–and counting–represent a subset of the work of partners FactCheck.org, PolitiFact and The Washington Post’s Fact Checker, as we link only to fact-checks that correspond to statements that appear on TV news. Most of the fact-checks–524–come from PolitiFact; 492 are by FactCheck.org, and 324 from The Washington Post’s Fact Checker.
As a library, we’re dedicated to providing a record – sometimes literally, as in the case of 78s! – that can help researchers, journalists, and the public find trustworthy sources for our collective history. These clip collections, along with fact-checks, now largely hand-curated, provide a quick way to find public statements made by elected officials.
Given his position at the helm of the government, it is not surprising that Trump garners most of the fact-checking attention. Three out of four, or 1008 of the fact-checks, focus on Trump’s statements. Another 192 relate to the four congressional leaders: Senate Majority Leader Mitch McConnell, R., Ky.; Senate Minority Leader Chuck Schumer, D., N.Y.; House Speaker Paul Ryan, R., Wis.; and House Minority Leader Nancy Pelosi, D., Calif. We’ve also logged 140 fact-checks related to key administration figures such as Sean Spicer, Jeff Sessions, and Mike Pence.
The topics covered by fact-checkers run the gamut of national and global policy issues, history, and everything in between. For example, the debate on tax reform is grounded with fact-checks of the historical and global context posited by the president. Fact-checkers have also examined his aides’ claims on the impact of the current reform proposal on the wealthy and on the deficit. They’ve also followed the claims made by House Speaker Paul Ryan, R., Wis., the leading GOP policy voice on tax reform.
Another large set of fact-checks cover health care, going back as far as this claim made in 2010 by Pelosi about job creation under healthcare reform (PolitiFact rated it “Half True.”) The most recent example is the Graham-Cassidy bill that aimed to repeal much of Obamacare. One of the most sharply contested debates about that legislation was whether or not it would require coverage of people with pre-existing conditions. Fact-checkers parsed the he-said he-said debate as it unfolded on TV news, for example examining dueling claims by Schumer and Trump.
The collection of Trump fact checks include a few dating back to 2011, long before his successful presidential campaign. Here he is at the CPAC conference that year claiming no one remembered now-former President Barack Obama from school, part of his campaign to question Obama’s citizenship. (PolitiFact rated: “Pants on Fire!”) And here he is with what FactCheck.org called a “100 percent wrong” claim about the Egyptian people voting to overturn a treaty with Israel.
This fact-check of McConnell dates back to 2009, when PolitiFact rated “false” his claim of how much federal spending occurred under Obama’s watch: “In just one month, the Democrats have spent more than President Bush spent in seven years on the war in Iraq, the war in Afghanistan and Hurricane Katrina combined.”
Meanwhile, this 2010 statement by Schumer, rated “mostly false” by PolitiFact, asserted that the U.S. Supreme Court “decided to overrule the 100-year-old ban on corporate expenditures.” The ban on giving directly to candidates is still in place; however, corporations are free to spend unlimited funds on elections providing they do so separate from a candidate’s official campaign.The repetition
Twenty-four million people will be forced off their health insurance, young farmers have to sell the farm to pay estate tax, NATO members owe the United States money, millions of women turn to Planned Parenthood for mammograms, and sanctuary cities lead to higher crime. These are all examples of claims found to be inaccurate or misleading, but that continued or continue to be repeated by public officials.The unexpected
Whether you lean one political direction or another, there are always surprises from the fact-checkers that can keep all our assumptions in check. For example, if you’re opposed to building a wall on the southern border to keep people from crossing into the U.S., you might guess Trump’s claim that people use catapults to toss drugs over current walls is an exaggeration. In fact, that statement was rated “mostly true” by PolitiFact. Or if you’re conservative, you might be surprised to learn an often repeated quote ascribed to Thomas Jefferson, in this case by Vice President Mike Pence, is in fact falsely attributed to him.How to find
If you’re looking for the most recent TV news statements with fact-checks, you can see the latest offerings on the TV Archive’s homepage by scrolling down.
You can review whole speeches, scanning for just the fact-checked claims by looking for the fact-check icon on a program timeline. For example, starting in the Trump Archive, you can choose a speech or interview and see if and how many of the statements were checked by reporters.
You can also find the fact-checks in the growing table, also available to download, which includes details on the official making the claim, the topic(s) covered, the url for the corresponding TV news clip, and the link to the fact-checking article.
To receive the TV News Archive’s email newsletter, subscribe here.