The first Internet search engine I used, back in about 1990, was Archie. This was an index of content hosted across the internet on ftp servers; mostly software but there were documents and databases too. Archie didn’t feel much like an information tool, but more something for computer specialists. Then came Veronica – an index to content hosted on gopher servers (kind of forerunners of the web). This did feel more like a way to search for information, though its content was still limited – a very small niche. Once the web came along we saw a succession of web search engines. Each came into being in a blaze of superlatives (“bigger and better”), trumpeted as the solution to searching the web, but each lasted just a couple of years and then slowly faded as the next new thing took over (where are you, Hotbot, Lycos, Alta Vista?). I never imagined back then that one of these search tools would grow to become an absolutely key part of the academic information environment with a major presence in every part of the information world.
Google has achieved that position. Beyond its dominant presence as a general internet search engine and software development company, the existence of Google Scholar, tehe Google book digitization project and the recently-launched Google ebook service make it a core part of the library and information landscape. My theme today is Google Scholar but I will come back to the book projects in another post.
A recent article you may have missed, in the International Journal of Cultural Studies, affirms that Google has become an integral part of everyday life, not least in the academic world. But Google’s instincts are not those of the academic world – it has a tendency to secrecy borne of its commercial mission. The press release about the article states:
One of the key points about search engines’ ranking and profiling systems is that these are not open to the same rules as traditional library scholarship methods in the public domain. Automated search systems developed by commercial Internet giants like Google tap into public values scaffolding the library system and yet, when looking beneath this surface, core values such as transparency and openness are hard to find.
Inexperienced users tend to trust proprietary engines as neutral knowledge mediators [but] engine operators use meta-data to interpret collective profiles of groups of searchers.
Another article, in Serials Review, is entitled Google Scholar’s Dramatic Coverage Improvement Five Years after Debut. The author finds that over the five years from 2005 to 2010 Google Scholar has improved its coverage of scholarly journals. Coverage varied between subject fields, but in 2005 was between 30% and 88%; in 2010 between 98% and 100%.
Librarians criticised Google Scholar in its early days for its very patchy coverage, and also for its lack of openness – it was very hard to find out exactly what it did cover. It seems they have overcome that problem, though worries over its accuracy remain. In an article in Issues in Science and Technology Librarianship science researchers at the University of California Santa Cruz were surveyed about their article database use and preferences. Web of Science was the single most used database, selected by 41.6%. Statistically there was no difference between PubMed (21.5%) and Google Scholar (18.7%) as the second most popular database. 83% of those surveyed had used Google Scholar and an additional 13% had not used it but would like to try it. While Google Scholar is favored for its ease of use and speed, those who prefer Web of Science feel more confident about the quality of their results than do those who prefer Google Scholar. Librarians and faculty alike often assert that “all researchers use Google Scholar.” Based on this study, this is essentially correct. 83% of researchers had used Google Scholar and an additional 13% had not used it but would like to try it. Of those who had used Google Scholar, almost three quarters of them (73%) found it useful.
In this context I was interested to see that Richard Wintle, one of the guest bloggers on this network, wrote recently about his experience of PubMed, suggesting that sometimes Google Scholar performed better than PubMed. I think every tool has occasional weaknesses, so it is good to have multiple search tools available.
Peter Jacso, who has followed Google Scholar for some years, wrote in Library Journal about “Google Scholar’s ghost authors” and in Online Information Review about the “Metadata mega mess in Google Scholar“. He highlights a key problem:
Google’s algorithms create phantom authors for millions of papers. They derive false names from options listed on the search menu, such as P Login (for Please Login). Very often, the real authors are relegated to ghost authors deprived of their authorship along with publication and citation counts.
Jacso says therefore that Google Scholar is inappropriate for bibliometric searches, for evaluating the publishing performance and impact of researchers and journals. One of the problems is that Google’s secrecy means that we don’t know how many records are in Google Scholar, and can only guess at the frequency of these errors.
Google Scholar is five years old, so it is still a young child when compared to PubMed (fully launched in 1997) or PubMed’s progenitor Index Medicus (started 1879). But Google Scholar no longer has a “beta” label, so clearly Google think it is a finished product or at least “good enough”.
My advice – be a little cautious whichever search tool you are using, but especially so with Google Scholar.
Interesting post Frank, and thanks for the trip back down (search engine) memory lane. I remember the days of wondering whether Lycos or the WWWW (World Wide Web Worm) was better – then deciding that AltaVista was the cream of the crop! I can still remember a friend first mentioning Google to me… and telling me its name came from “Go Ogle”(!).*
As for Scholar… thanks for the pingback to my post. If you dig deeply enough you’ll find a link in the post it links to (following all that?) to yet another earlier post I made, which might be vaguely interesting:
Scholarly Googles, Foibles and FAILs, in which I complain about its lack of sub-year date ranges, among other things.
Another issue I’ve noticed, in line perhaps with Peter Jacso’s articles, is that (a) it often picks up the same article more than once, if it’s found in multiple collections, (b) it sometimes seems to hot-link to copies of an article that might be intended to live inside a subscription-only collection, and (c) its listing of citations is completely out of whack with, say, ISI’s Science Citation Index. This latter observation isn’t really surprising, but it does call into question how accurate either source for citation information might be, I think.
*Blatantly untrue.
Suppose I’m showing my age again, but have never used Google Scholar for lit searches – PubMed is my default setting, with Web of Sci mostly only for citation counting. Have only used Google Scholar very occasionally when either:
(i) trying to find stuff the others can’t find but which I know exists; or
(ii) trying to track down full text archives (e.g. of journal content) where other methods have failed
(iii) trying to hunt down things in JSTOR, which has to be the most impenetrable e-archive I’ve ever had to try and find things in.
I’m convinced that Google are going to take over the world. That’s why I named my cat after them: to please my new overlords (well, and she also has “goggle” marks around her eyes, and spent her first couple of days in our house searching the place from top to bottom. I’m just sorry she wasn’t sitting at her usual spot in the front window when the Google Streetview cameras came past). However, I don’t use Google Scholar much – I use PubMed for every day use and WOS when I need to see which papers have cited a paper of interest. Maybe I should give Scholar another chance now that its improved from its early days though!
Google Scholar is great for quickly finding a known paper online (especially since my university populates search results with links to subscribed databases). If I want to discover papers, however, I search a particular journal or specialized database such as ISI Web of Knowledge.
When reading around a subject in order to make decisions on a manuscript, PubMed is usually first, though I do use Google Scholar. The bottom line is that I’ll use whatever gets me what I want.
For some reason I have gone back to front in responding to comments.
@Henry – yes, I think that is the only sensible strategy. Use what works (for you), but be aware of alternatives to and potential weaknesses in what you use. it’s nice to have redundancy in the system as no service is 100% perfect.
@DrFriction – that is a good illustration: finding known papers versus “resource discovery”, as we info-types call it. Different tools for different purposes.
@Cath – I like the story of your searching cat, Google! Does she have a chrome nameplate? And a small humanoid automaton to play with? I agree that Google seem set to take over the world, but then so did IBM and Microsoft before them . They succeed for a bit but circumstances change.
@Austin – see above, re. using what works. You may not be (just!) showing your age. It could be that you a discerning and discriminating searcher? Then again …
@Richard – I do not fully understand how Google Scholar populates its database, but yes it does need better duplicate detection I agree. Re. hot links to subscription-only content, the existence of a reference and link in Google Scholar is no guarantee that you can access the fulltext. The same is true of PubMed and other indexes. Various comparisons have been done between Google Scholar, Scopus and Web of Science citation data. I suspect that it varies between disciplines, but generally Google Scholar brings fewer citations. Web of Science seems to outperform Scopus partly because its citation coverage goes back further in time.
I don’t follow all the literature about this, but a presentation at last year’s SLA conference by Lokman Meho gives a useful summary.
Here at Nature towers we have a handy tool that adds PubMed and CrossRef links to every citation in a paper (only for production though, sorry Henry!). If a paper isn’t found, I jump to vanilla Google or scholar, depending on how hard it is to find. If those two don’t work I give up fairly quickly and ask the author to check it carefully.
For books the British Library is excellent.
On the other hand: Beel, J. Academic Search Engine Spam and Google Scholar’s Resilience Against it. Journal of Electronic Publishing Journal of Electronic Publishing 13(3) December 2010. DOI: 10.3998/3336451.0013.305 http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0013.305
Oh, that is very interesting! It cuts to the heart of what (if anything) makes a scholarly document different from any other old document. Is a web page a scholarly document? Is a technical report (unrefereed) a scholarly document? Google Scholar has some means of identifying scholarly documents, thereby hoiping to include a higher proportion of the literature. But (as the first article I mention above says) it is not open about the means by which it identifies documents that it deems worthy of indexing. Trying to spam it by getting it to include made-up articles is one way to test its resilience.
When searching there is always a trade-off between precision and recall – do you want a small number of highly-relevant documents or do you want everything remotely relevant? This trade-off operates at a macro level too – do you want a search tool that includes only well-regarded academic sources (Web of Science) or do you want everything that might have academic content (Google Scholar). i suppose we shouldn’t complain that Web of Science doesn’t have everything, or that Google Scholar includes some crap – that is how they have been designed.
Regarding the border between scholarly documents and the rest, we librarians used to talk about “grey literature’ – things like reports, standards, patents, legislation and all those things that are not from mainstream publishers. Grey literature was often hard to find out about and hard to get hold of. Since the invention of the web much of the grey literature has been published electronically, usually free of charge. Bringing this material into mainstream finding tools is helpful, though not everyone will be interested in it.
If Cameron Neylon’s recent piece in the Times Higher is correct about the future of scholarly publishing, I wonder what implications there will be for search tools? (Sorry – no time to think this through just now, but interested to hear opinions).
So, Frank, what do you think about referencing to parts of electronic books and how it could be done?
Heather –
Sorry your comment was held up in moderation.
My first reaction was “Oh, that’s easy. CrossRef have been issuing DOIs for book chapters for a while now”. But then I read through the post you linked to and realised it’s not so easy when you want to get down to a precise location within a book. I observed already when reading a book on my iPad that the page numbers changed when I switched from portrait to landscape orientation, so they cease to have any absolute value.
OTOH, when you cite something in a journal article you only link to the article as a whole, which could easily be 10 pages or (for a big review article) even 100 pages long. Hence, linking at the level of a book chapter is probably no worse than that.
I agree it does seem unsatisfactory though. I don’t know enough about the technicalities of ebooks to know what could be possible. I think ePub format is a variety of XML, so in principle one would think that anchors and links should be feasible. I will keep my eyes open for more on this topic.