I went to the JISC Collections Annual General Meeting today. They are the people who procure and negotiate licences for electronic content (electronic journals, search tools like Web of Science and Scopus, and e-books) for UK Higher Education, Research Councils and various bits of lower education. The formal part of the AGM was no more or less dull than these things usually are, but then they laid on an interesting quartet of speakers to talk about text mining. At first glance there may seem to be little connection between procurement and text mining, but it soon became clear that there is an important link. If text mining is to be successful then some new thinking about electronic resource licences is needed.
Liam Earney from JISC Collections briefly introduced text mining, defining it as gathering up big chunks of literature and performing computations on it to learn something new. Cliff Lynch has written that text mining opens up
entirely new ways to think about the scholarly literature (and the underlying evidence that supports scholarship) as an active, computationally enabled representation of knowledge that lives, grows and interacts with its contributors rather than as a passive archive or record.
But there are barriers to text mining – publishers (in general) do not make it easy for their material to be gathered in this way. As Cliff Lynch wrote:
As the scholarly literature moves to digital form, [how can we] move beyond a system that just replicates all of our assumptions that this literature is only read, and read only by human beings, one article at a time? What is needed to allow the application of computational technologies to extract new knowledge, correlations and hypotheses from collections of scholarly literature?
The indefatigable Peter Murray-Rust explained the problems to us very clearly. He started by outlining an appalling story in today’s Guardian about Ordnance Survey and their very restrictive approach to use of their mapping data, then retold a couple of other well-known stories about scientific publishers being overzealous in protecting their rights. This happens despite the main scientific publishers’ organisations having agreed that data in research articles should wherever possible be made freely accessible to other scholars. In reality this is not happening and would-be text miners cannot gain access to the journal source material they need. Peter stated forcefully that ejournal licences should be amended in order to allow text mining. The team from JISC Collections were sympathetic to his request and agreed it was something that must be worked on.
Peter also mentioned two of his projects – Crystaleye , that trawls for and aggregates crystallographic data from selected journals and TheOREm – a system for creating semantic theses.
Sophia Ananiadou, from the National Centre for Text Mining (NaCTeM ) gave an excellent overview of text mining, and its potential for helping to deal with information overload and information overlook (I like that term). She described a process of information retrieval, entity extraction, mining and pattern-finding, finally visualisation. NaCTeM have built a range of tools and services with strange names like TerMine, KLEIO, MEDIE and FACTA (all available from their website) and are looking forward to applying these to UK PubMedCentral.
Richard Kidd, from the Royal Society of Chemistry, described the impressive Project Prospect that has brought semantic enrichment to the RSC journals. He also referred to the Sciborg project – another Peter Murray-Rust project – that the RSC are collaborating with.
Finally Alastair Dunning, from the JISC Digitisation Prgramme, described how text mining is providing new ways to look at the results of digitisation efforts. His main example concerned English newsbooks from the mid-17th century.
All in all, this mini-conference nicely complemented the Open Knowledge Foundation workshop I blogged about earlier , highlighting the link between open knowledge and new knowledge generation.
Frank –
It is very exciting to think of where text mining and improvements in Journal interactivity are going. In an earlier post of mine I have mentioned my own preference. I hope to expand on that thought in the very near future. Thank you though for this great post – the links were excellent.
Surely text mining and data mining have been around a long time. It is the process automation that is the new development. A lot of my research involves trawling the literature for data as one of my interets is the representation of physical properties of materials in dimensionless groupings (normalised by physical constants). I spend a lot of time keyword hunting and following reference trees to the original dataset. Of course the real issue is data filtering and not data mining itself.
However, opening up the data and providing tools to identify graphical data representation (and decoding it) will be very useful.
Sophia Ananiadou said that the term “text mining” was coined in 1999. Of course you are right that manual text trawls are nothing new. The process of systematic reviews (e.g. Cochrane) usually involves a good deal of manual searching as well as literature searches.
But I think the computational approach makes text minign something a bit different, partly because you can analyse a huge amount of text that would be infeasible to do without computers. It’s the patter-finding and clustering that is the interesting part for me.