[http://www.hathitrust.org]

Jeremy York, HathiTrust

HathiTrust

I have just prepared to talk about some of the challenges that we have faced in HathiTrust, and I will just go ahead and do that. I am really glad to be a part of this discussion. To me it is one that is very deep and multifaceted and very exciting and challenging. Some of the challenges that we face have to do with incorporating changes that people may make, challenges with liability and authority, and also just with adequate resources.

Here are some quick examples. You may know that a lot of the corpus material that’s in HathiTrust was digitized by Google, and they have certain restrictions on the data that have to do with our ability to distribute the data. So a key challenge that often comes up with regard to OCR correction, which everybody would like, is that if we were to take OCR corrections from the community, they would get out of sync with the materials from Google that we actually re-download as they are constantly improving. We have a kind of classic update problem where we might get improved images, but the OCR is off, and all kinds of things happen there. That is a real challenge for us.

Another challenge is that, as far as I’m aware, a lot of the use cases for improving HathiTrust have to do not with just one text or one book but with a body of works, often for computational research. And it is often a real project to do that. It’s not just something that a person can go and do, it’s often a team of people working on it. And they may end up cleaning the data in one way or another as part of the project, so we have another challenge: that people clean the data in different ways based on what they want to do with it.

And I will offer just a quick example of liability. We have bibliographic metadata that is submitted with all of the digital books that we get, and we make rights determinations based on that bibliographic metadata. When institutions partner with the trust they actually sign an agreement saying that they take responsibility for their bibliographic metadata, meaning if something is opened inappropriately because it was listed as 1917 instead of 1971, they take responsibility for that. It’s a real challenge for community-wide engagement and correction that the institution has the liability. They don’t want anyone to correct their bibliographic metadata.

And then regarding the resources, a few years ago we were working with Google and it’s a real challenge for them that libraries have such varied metadata. We came up with a whole scheme to have people put in their enumeration chronology information (e.g., “Volume One, 1973”). It was beautiful, it was great, but what institution has the resources to undertake that? Those are some real challenges that we’ve faced.

This presentation was a part of the workshop Engaging the Public: Best Practices for Crowdsourcing Across the Disciplines. See the full report here.