In an effort to expand the horizons of artificial intelligence training, technology companies are increasingly turning to a rich trove of historical knowledge rather than the familiar depths of the internet. A significant development in this initiative is the release of nearly a million books, some dating back to the 15th century and encompassing 254 languages, from Harvard University’s vast collection to AI researchers. This initiative also anticipates the forthcoming availability of extensive archives of old newspapers and government documents from Boston’s public library.
By unlocking these historical resources, tech companies can tap into a treasure trove of data, rich with cultural, historical, and linguistic significance. This move is particularly critical as AI systems face legal challenges from authors and artists concerned about the unauthorized use of their copyrighted works for AI training. “Starting with data from the public domain avoids the controversies associated with copyrighted material,” explained Burton Davis, deputy general counsel at Microsoft. This wealth of historical data promises to fill in the gaps not addressed by the more recent online data typically utilized by AI chatbots.
Supporting this initiative are “unrestricted gifts” from Microsoft and OpenAI, creators of ChatGPT. The Institutional Data Initiative, based at Harvard, is collaborating with libraries and museums globally to prepare their collections for AI use and benefit the local communities. “Our goal is to empower these institutions during this pivotal moment for AI,” expressed Aristana Scourtas, head of research at Harvard Law School’s Library Innovation Lab, pointing out that librarians have always been custodians of data and knowledge.
The newly available Harvard dataset, dubbed Institutional Books 1.0, consists of more than 394 million scanned pages. The collection includes works from as early as the 1400s, such as a Korean painter’s musings on horticulture, and prominently features 19th-century literature encompassing topics like philosophy, law, and agriculture. This meticulously preserved and cataloged collection is a boon for AI developers, enhancing the accuracy and reliability of their models.
Greg Leppert, executive director of the data initiative and chief technologist at Harvard’s Berkman Klein Center, noted that the collection offers invaluable material because it goes back to the original sources as scanned by the very institutions that gathered them. Prior to the surge in popularity of AI technologies like ChatGPT, many researchers did not consider the origins of their data, often pulling from diverse platforms with limited emphasis on authenticity, focusing instead on acquiring as many data points, or tokens, as possible.
Harvard’s AI training data set currently includes 242 billion tokens. While substantial, it is merely a fraction of what some major tech companies use in training sophisticated AI systems. Meta, parent company of Facebook, revealed the most recent iteration of its AI language model was developed using over 30 trillion tokens drawn from various formats, including text, images, and videos.
Real libraries are now stepping up to offer their vast and unique content as they cautiously navigate the AI realm. OpenAI, under legal scrutiny from ongoing copyright lawsuits, contributed $50 million to a consortium comprising research institutions, such as Oxford’s venerable Bodleian Library, which is involved in digitizing ancient texts with AI as an aid. Jessica Chapel, head of digital services at Boston Public Library, emphasized that any digitized work would be accessible to everyone, aligning with the library’s mission as OpenAI sought vast training data.
Digitization endeavors are resource-intensive, often involving meticulous work to prepare vast collections like New England’s French-language newspapers, which were popular among Quebec immigrants in the late 19th and early 20th centuries. Now, these texts serve dual purposes—training AI technology and supporting library projects.
Harvard’s collection was initially digitized starting in 2006 for a different tech behemoth, Google, during its controversial project to create an online library of over 20 million books, a project that faced legal challenges over copyright concerns but was eventually settled in 2016. For the first time, Google collaborated with Harvard to return public domain volumes from Google Books to AI researchers.
Mary Rasenberger, Chief Executive Officer of the Authors Guild, praised the initiative, saying, “This dataset will offer unprecedented access and knowledge, democratizing AI model creation.” As the dataset is released Thursday on Hugging Face, the implications for the future of AI remain vast. The collection, less than half of which is in English, offers immense linguistic diversity, heavily featuring European languages.
The archive of 19th-century thought could prove crucial for developing AI systems capable of human-like reasoning and problem-solving. Leppert noted that universities foster environments rich in pedagogical approaches to reasoning and scientific process. Simultaneously, the dataset must also account for outdated or harmful content, offering guidance on mitigating risks associated with such data.
Harvard’s Library Innovation Lab coordinator, Kristi Mukk, emphasized the responsibility of facilitating informed, ethical AI developments while acknowledging the presence of potentially contentious material within the vast dataset.