British Library to archive billions of web pages

06 Apr 2013

The British Library is embarking on one of its biggest expansions with investments of three million pounds in a project to collect archiving material by conducting an "annual trawl" of the UK web domain.

The library said it would start recording the country's burgeoning collection of online cultural and intellectual works.

Regulations coming into force on 6 April will enable six major libraries, including the British Library, the National Library of Scotland, the National Library of Wales, the Bodleian Libraries, Cambridge University Library and Trinity College Library Dublin, to collect, preserve and provide long-term access to the increasing proportion of the nation's cultural and intellectual output that appears in digital form – including blogs, e-books and the entire UK web domain.

Under the new regulations major libraries in the UK will have the right to receive a copy of every UK electronic publication, on the same basis as they have received print publications such as books, magazines and newspapers for several centuries.

The new regulations will ensure that transient data and websites can be collected and preserved for future use, say 50, 100 or even 200 or more years after.

British Library estimates that a billion pages a year could now be amassed along with books, magazines and newspapers every year, which could then be stored for several centuries.
The library is reported to have lost a lot of material, particularly around events such as the 7/7 London bombings or the 2008 financial crisis. Most of that material cannot be traced now, say library sources.

The operation to "capture the digital universe" will begin with an automatic "web harvest" of an initial 4.8 million websites - or one billion web pages - from the UK domain.

''Legal deposit arrangements remain vitally important. Preserving and maintaining a record of everything that has been published provides a priceless resource for the researchers of today and the future. So it's right that these long-standing arrangements have now been brought up to date for the 21st century, covering the UK's digital publications for the first time,'' culture minister Ed Vaizey MP said.

The principle of extending legal deposit beyond print was established with the Legal Deposit Libraries Act of 2003 – the present regulations implement it in practical terms, encompassing electronic publications such as e-journals and e-books, offline (or hand-held) formats like CD-Rom and an initial 4.8 million websites from the UK web domain.

Access to non-print materials, including archived websites, will be offered via on-site reading room facilities at each of the legal deposit libraries. While the initial offering to researchers will be limited in scope, the libraries will gradually increase their capability for managing large-scale deposit, preservation and access over the coming months and years.

By the end of this year, the results of the first of the archived materials of the UK web domain will be available to researchers, along with tens of thousands of e-journal articles, e-books and other materials.