If you have too much to read, or too much information to digest, could a machine do it for you? That is in essence the motivation behind the expert area of content mining, here including both text mining and data mining.
This is a research study commissioned by the Publishing Research Consortium on the topic of Content Mining of Journal Articles. Content mining is defined as the automated processing of large amounts of digital content for purposes of information retrieval, information extraction and meta‐ analysis. This study, carried out between February and May 2011, aims to provide an overview of current practices, players, policies, plans and expectations for text mining and data mining of content in academic journals. The research consisted of a series of 29 interviews with experts and people working on content mining and was concluded by a survey among scholarly publishers.
Overall, experts expect a further acceleration of text and data mining into scholarly content, sparked by a greater availability of digital content corpuses, the ever increasing computer capabilities,improved user‐friendliness of software tools and easier access to content. Semantic annotation ofcontent is expected by some to develop into a new standard for STM content, facilitating better and deeper search and browse facilities into related articles ‐‐ even if use cases and business propositions are at present in infancy stage only and not yet fully developed.
This optimism on Journal Article Mining is generally shared by publishers across the board who expect an increase in publishers mining their own content. Half of the publishers surveyed also already see an increase in mining requests from third parties. The mining requests that publishers receive are not very frequent (mostly less than 10 per year, a good share even less than 5 per year) and come mostly from Abstracting and Indexing services and from corporate customers. Respondents also note a fair amount of illegal crawling and downloading that suggest unreported mining activities.
Publishers tend to treat mining requests from third parties in a liberal way, certainly so for mining requests with a research purpose. Publishers are less permissive if the mining results can replace or compete with the original content. Few publishers have a publicly available mining policy, the large majority handles mining requests on a case‐by‐case basis. Approximately 30 % of publisher respondents allow any kind of mining of their content without restrictions, in most cases as part of their Open Access policies. For the other publishers, nearly all require information about the intent and purpose of the mining request.
Regarding measures to make content mining easier across multi‐publishers content, the interviews generated a broad spectrum of possible improvements: from the creation of one shared content mining platform across publishers, and commonly agreed permission rules for research based mining
Tequests, to collaborate with (national) libraries and standardization of mining‐friendly content formats including basic, common ontologies. Of these options, the suggestion for more cross publisher standardization of content formats received most support in the survey, especially from the (self declared) mining experts. Collaboration with libraries was least popular, while one content mining platform received a good response overall, but faced less positive responses from those respondents who have expertise or experience in content mining.
The survey results were cross analyzed for differences in responses from small and large publishers, for different types of publishers and for experts versus non‐experts. Size and type of publishers showed no statistically significant deviations, except that larger publishers (with more than 50 journals) receive more mining requests, do more mining of their own content and more often have publicly available permission guidelines for content mining. Differences between expert and non‐experts responses were most prominent regarding solutions to make content mining easier: experts were more articulate in their opinions with a higher rating for standardization of content formats and a lower rating for the creation of one shared content mining platform. Experts were more negative than non‐experts about collaboration with libraries.