Library of Congress to collect Twitter data

Tuesday - 4/20/2010, 8:00pm EDT

Matt Raymond, director of communications, Library of Congress

Click to hear the interview

Download mp3

By Dorothy Ramienski
Internet Editor
Federal News Radio

The Library of Congress is getting a new collection -- Twitter accounts.

The website recently announced that it will donate its digital archive of public tweets to the LOC.

Matt Raymond is director of communications at the Library of Congress and explained why they're interested in this type of information.

"The Library of Congress is the world's largest storehouse of knowledge and information under one roof. We've been collecting universally for about 200 years, and when I say that I mean on every concievable topic in . . . about 470 languages. We have a particular interest in history and culture and this fits very much right in with that."

He explained that, on the surface, many tweets seem mundane or unimportant; however, there is also a lot of historical and relevant information on Twitter.

"In fact, in many ways, Twitter has affected history -- such as the protests in Iran, or [the] journalist who was kidnapped in Egypt -- taken as a prisoner -- and was later released because of what ended up on the Twitter-sphere. So, there is historical relevance [and] a great deal that can be learned about how people viewed society at a particular time in history."

Raymond said the Library of Congress looks at tweets in the same manner that it regards letters from soldiers during the Civil War.

"These are first hand accounts of history. It's important that we document these because websites do come and go -- I'm not saying that Twitter might come or go -- but there is a large loss of data on the Internet that I don't think people think about. This is about digital preservation."

Reactions to the project have been mixed. Raymond said he used Twitter on the day that the Library of Congress made its announcement that it would start collecting tweets. Overall, the reaction was largely positive, but some did express reservations about privacy.

"I would point out that people agree to terms of service that say anything that they submit or post {to Twitter} will be viewed by others users and third party services and websites, and that people need to ensure that they're comfortable with what they're sharing under these terms. People have always had the option of setting their feeds as private and they can pass direct messages from user to user privately. These will not become part of the archive. We have an impeccable record of data storage, retrieval and security here, and we are going to continue that with this collection."

This is not a new trend, either. The Library of Congress has been collecting digital material for years now. Raymond said books amount to one fourth of the overall collection, in addition to prints, photographs, sound recordings, movies and even comic books.

"There are so many things that people might not associate with the Library . . . so we've been collecting in many different formats. Within the past 10 years, we have had a mandate from Congress to preserve what we can that is 'born digital' -- websites. Actually, we've also been leading a national partnership called 'The National Digital Information Infrastructure and Preservation Program', which works with 130 partners to preserve other kinds of data, such as geospatial data, state government records, public broadcasting and the like."

The LOC has already archived 167 terabytes of born digital content so far, primarily through its web capture program. Raymond said the Twitter archive alone will be about five terabytes.

The project has brought up some interesting questions at the LOC. Raymond said the digital age itself has brought up discussions about access to data.

"Our traditional model has been -- you come into the Library, you go into a reading room, you request a book or another item and it's delivered to you. We are not a circulating library, but because of the collections we have, a lot of times we're viewed as the library of last resort, so we have to maintain the collections here. Obviously, with digital collections, there are many different considerations to take into account. The fact that they are very easily copied. . . . They are definitely a beast of another nature and that's something that we're trying to deal with. How do you serve these up? What's the user interface? How easily can researchers slice and dice that data?"

The Library of Congress is currently working with a variety of organizations to answer those questions.

Raymond said, overall, LOC wants to get the word out about its collection and let everyone understand that it is a public institution, "Yes, we are Congress' library, but we are also the nation's library. So, for us its crucial to get out the word when we get a new collection here."