On Journal Disambiguation

Posted on 24 June 2009 by rpg

Never mind ~~author~~ contributor ID, or DOIs for articles, or whatever (I can’t be bothered looking up the links): I’m currently trying to find correct names for and de-duplicate entire journals.

ouch
there must be a better way

I have to match up all occurrences of a journal’s name, including misspellings and tyops, in our database and correct them to the canonical abbreviation. For further enjoyment I’d like the URL of the journal’s main page, where one exists.

PubMed, frankly, is a bit crap at finding journal names and their homepages. Anyone know of a good resource? Preferably one with an API or at least a script-friendly interface.

In the meantime, my favourite journal so far is

Meded Rijksuniv Gent Fak Landbouwkd Toegep Biol Wet

closely followed by the laconic

Pain.

About rpg

Scientist, poet, gadfly

View all posts by rpg →

This entry was posted in Uncategorized. Bookmark the permalink.

35 Responses to On Journal Disambiguation

Henry Gee says:

24 June 2009 at 09:53

You mean, you have to do this by hand?
I see it all, now – ‘Information Architect’ is one of those euphemistic jpb titles, like ‘Recycling Aggregation Engineer’ (dustman) or ‘Imperial Grand Mekon, Galactic Emperor And Absolute Ruler Of All Living Things’ (_Nature_ editor).
Richard P. Grant says:

24 June 2009 at 09:57

Seriously.
How would a computer know, for example, that
Nat Struct Biol
and
Nat Struct Mol Biol
are the same journal? I only know because I published in it and watched the name change.
And in the example above, you might be able to write a program that identified the three instances of J App Cryst, but would the same program be able to tell that Mol Cell and Mol Cells are different journals? I thought that was a mistake until I looked them up.
Hence the request for an online resource.
Richard P. Grant says:

24 June 2009 at 10:01

Or even that ‘Neuorn’ is a tyop of ‘Neuron’?
The Information Architect’s job is to make sure these mistakes do not occur in the rebuild of the input tool, but I have to fix the fubars that already exist, too.
Jennifer Rohn says:

24 June 2009 at 10:25

Like all Dutch journals, that one you mention sounds like something a Dutch person might yell after hammering his thumb during DIY.
What you need is a sort of Google-esque “did you mean…” spelling approximator built into your dedupe routine. Then it can compile all the similar ones and ask for human input at the very end. Can’t one of your techy people set up a macro or something? I got one to do something like that when I was text-mining.
Journals are hard to find sometimes. I find Wiley InterScience to be the worst: Google invariably leads you to the WIS pseudo-homepage that doesn’t let you do very much, and it’s very difficult to find the link to the real journal homepage within all the corporate mumbo-jumbo.
Richard P. Grant says:

24 June 2009 at 10:35

Isn’t that more or less the definition of Dutch?
You’re right, that is exactly what I need. However, not only are the techs already flat-out working towards the site relaunch, but you’d have to populate the dictionary, and then still check everything to see if the suggestion is really what we mean. We did this morning discuss a type-ahead type thing for the next iteration of the site (and not allow people to enter anything that’s not canonical) but again, this task will have to be completed first.
Best get to it, then!
Maria Wolters says:

24 June 2009 at 11:41

I have a very similar problem in my journal database, which features input from many different bibliographical sources. (Incidentally, do you know who made the decision to UPPERCASE ALL JOURNAL TITLES ON WEB OF SCIENCE? There is a special circle of hell for them.) My policy is to keep all journal titles in their full form and abbreviate for medical publications using a giant search-and-replace script.
Some links I found useful:
Biological journals and abbreviations
Medical journal list, very script friendly
How friendly are F1000 with Thomson? They should maintain lists of who merged with whom for all the glamourmags for which they compute impact factors.
Richard P. Grant says:

24 June 2009 at 11:50

Heh. Thanks Maria, those links look shiny.
Steve Roughley says:

24 June 2009 at 12:01

You could try the CODEN or ISSN. CAS administer CODENS from here – along with ISSNs and ‘official’ abbreviations. (Its all described here on Wikipedia
There is also a short list (~1500) at the CAS website
Richard P. Grant says:

24 June 2009 at 12:06

Ah… we’ve got a list of abbreviations, not useful things like ISSNs. And two and a half thousand non-chemical journals.
Thanks for the thought, though…
Frank Norman says:

24 June 2009 at 12:12

Maybe talk to someone at Suncat?
Duncan Hull says:

24 June 2009 at 12:36

Hi Richard, sounds like you need Named Entity Recognition . The stuff that text-miners get excited about. Its an “active area of research” – which as you probably know, means most of the available software isn’t very useful just yet…
Richard P. Grant says:

24 June 2009 at 12:37

snort
Yeah. I’m the named entity, and I don’t recognize a bloody thing.
Richard P. Grant says:

24 June 2009 at 14:18

Oh! Just realized that Nature PG has a lot of these guys, with lots of lovely URLs: http://www.nature.com/siteindex/index.html
Ian Brooks says:

24 June 2009 at 14:20

Wish I could help. I always redux to Google et al.
This kind of issue is exactly why, as we populate our database, or define ontology underlying metadata, users are given drop down menus for data entry.

Thou Shalt Not Enter Free Text

is my mantra.
Richard P. Grant says:

24 June 2009 at 14:22

I’ve just had that conversation with my head Developer. He’s a good bloke. I have to say that, my job depends on him.
Raf Aerts says:

24 June 2009 at 15:07

If this can be of any help:
Meded Rijksuniv Gent Fak Landbouwkd Toegep Biol Wet
=
Comm Agr Appl Biol Sci Ghent Univ
And Jenny, you’re absolutely right. The Dutch version does sound like something we would shout when hitting our thumb with a hammer:)
Richard P. Grant says:

24 June 2009 at 15:09

You’re not helping.
Frank Norman says:

24 June 2009 at 15:54

Sorry for my cryptic comment above – done in haste on the move.
Suncat is the serials union catalogue for the UK, with serials records from all major UK research libraries (and NIMR!).
I guess though that you are not just after a source of data, but a matching algorithm too? Can’t help there.
Richard P. Grant says:

24 June 2009 at 15:55

It’s actually turning out to be reasonably doable, if tedious. Got a tech to hit google and return the first hit for each abbreviation, which is helping populate my URL list sensibly.
When I’ve made this list, I’m flogging it.
Steve Roughley says:

24 June 2009 at 20:23

Apparently, CAS, as the administrator of the CODENs list, assigns them to just about anything that looks vaguely like a journal, even if its not chemical, and not abstracted by them… apparently… although not tried it. Still, no-one reads anything that hasn’t got ‘chem.’ in its title somewhere do they???? [JOKE!!!]
Richard P. Grant says:

25 June 2009 at 06:18

That’s interesting, because the first few I looked for weren’t there.
Frank Norman says:

25 June 2009 at 06:34

CAS does have a very wide coverage – I think about 13,000 serials – but it doesn’t cover everything.
Frank Norman says:

25 June 2009 at 06:35

CAS does have a very wide coverage – I think about 13,000 serials – but it doesn’t cover everything.
Steve Roughley says:

25 June 2009 at 08:14

Thinking about it, CAS, in the guise of Scifinder, must have done something similar, as they have a ‘locate article’ feature, which in the journal field will figure things like BMCL, Bioorg Med Chem Lett etc to all mean Bioorganic & (or is that ‘and’?) Medicinal Chemistry Letters, for example – but not sure how well it deals with ‘common’ typos.
And no, I’m not on any sort of commision with CAS – its just that they happen to be the ones I interact with most!
Sabbi Lall says:

25 June 2009 at 23:07

Yes, others must have had to solve this problem at some point (so for example ISI and Scopus had to deal with NSB and NSMB at some point). If you asked it to search the first several characters in the names (e.g. Nature Struc*), you would have found both names and a bunch of any tyops, that might hone things down a bit?
My comment’s useless, so here’s a URL to help: http://www.nature.com/nsmb
Life is Pain
Heather Etchevers says:

25 June 2009 at 23:20

Life = Pain
Time = Life
ergo…
Cath Ennis says:

25 June 2009 at 23:29

I always thought Gut was a good journal name.
It would be fun to cite Pain in Gut.
Austin Elliott says:

26 June 2009 at 00:32

I’m oddly proud of my solitary paper in Gut – mainly because Gut is the only scientific journal I’ve ever seen feature as “guest publication” (for the missing words in headlines round) in the TV show Have I Got News For You.
PS “Neuorn” sounds like a kind of being in one of Tolkien’s books to me. Just thought I’d say that before Henry did.
Cath Ennis says:

26 June 2009 at 00:35

Austin, your paper looks from the abstract like it might involve the release of calcium from intracellular stores…?
Sabbi Lall says:

26 June 2009 at 00:36

or Pain in Blood
Richard, Pain is here
Richard P. Grant says:

26 June 2009 at 05:55

Thanks guys. Only another 2392 to go.
Actually, the NPG pages were really helpful, I didn’t realize so many journals were theirs, and they have a lovely page of them all.
Sarbjit, good plan, but when you have thinks like
Mol Cell and
Mol Cells
(two different journals)
J Mol Biol and
J Mol Biol or even
J MoL Biol
(same journal, two misspellings)
then it gets tricky.
Austin Elliott says:

26 June 2009 at 10:39

Cath: yep, among other things. I am a calcium signaler / microscopy geek by scientific trade
Re. journal names, you can’t get it confused with anything, but after sitting through another 40 minute seminar of myriad incomprehensible abbreviations, slides of unlabelled 20-lane Western blots, or handle-turning mutagenesis of every residue in a protein, I often think it is no accident that there is a journal whose abbreviation is Anal Biochem…
Frank Norman says:

26 June 2009 at 11:46

Don’t forget Biochemistry and Biochemistry. Yes, you can find two journals with the exact same title, so then we add on the place of publication to disambiguate, so the second of those becomes Biochemistry (Moscow). Strictly speaking the first one should be Biochemistry (Washington) but we tend to omit the qualifier for the more familiar title.
Richard P. Grant says:

26 June 2009 at 13:05

Yah, I have a couple like that, too.
Pamela Arroues says:

22 July 2009 at 16:01

My favorite site for figuring this stuff out is Genamics JournalSeek at http://journalseek.net/index.htm Most of the abbreviations are in there.
I too am searching for the translation to Meded.Rijksuniv.Gent Fak.Landbouwkd.Toegep.Biol.Wet. What a pain

Comments are closed.

On Journal Disambiguation

About rpg

35 Responses to On Journal Disambiguation

Richard P Grant

Recent Posts

Recent Comments

Archives

Categories

Meta