The ways and means of science are changing. It’s true: I can feel the tide tugging at me. I’m that waterlogged bit of dead tree mired in beach shingle; the last few passes of the surf have caused me to start sliding in. As the tide continues to turn, I will soon be flowing out into the grey deeps, liberated from gravity and on my way โ whether I want to be or not.
Too much information: Scientific datasets no longer color between the lines
What triggered this idea today was Excel spreadsheets. Like them or loathe them, it’s not really possible to analyze a genome-wide screen without a large number of them. In the past I have got round my antipathy towards the output of this hateful Microsoft product by printing the damn things out at the first opportunity, impaling them spitefully with holes and filing them in a tidy binder with colourful tabs. Soon, the printed spreadsheet would acquire scribbles, notes, a rainbow’s worth of highlighter pen marks. Thumbed through until the corners were ragged, stained with coffee, I would know exactly where my experiments were and what I had to do next. I might feel the need to update or correct the electronic version, but it was never the working copy.
All well and good, but what to do when your spreadsheet has thousands of rows and more than fifty columns? No amount of column narrowing and font reduction can force one of these babies onto a piece of A4. Print it out and your machine will spew out a monster collage that would need to be pieced together like the Dead Sea Scrolls (along with about a hundred superfluous blank pages for good measure). But try as I might, I cannot seem to think when facing a small computer screen with multiple windows of information that I need to compare. Click one open and the other is immediately forgotten; click back and you forget why you left in the first place.
But there is hope: I liken this difficulty to the mental shift I had to make, in the 1980’s, when we all had to start composing words with a keyboard instead of a pen. Remember that, those of you of a certain age? I have a distinct recollection of sitting at a shiny Canon electric typewriter in my university dorm room, trying to force my creative juices to flow without a pen between my fingers. I felt disarmed, almost crippled. The typing movements of my fingers could not seem to stimulate the same neuronal pathways. Now, of course, my handwritten journals are what is rough and artless โ only with a keyboard can I produce quality material. My brain, it seems, has adopted. And I have no doubt that the next generation will be able to perform these mental acrobatics, to think in virtual space, as naturally as breathing.
In the meantime, you’ll have to excuse me: I have a tide to catch.
> Like them or loathe them, it’s not really possible to
> analyze a genome-wide screen without a large number of
> them.
Oh please, please, please, no, don’t that with excel, please ๐
http://www.mysql.com/
http://www.r-project.org/
http://www.geol.lsu.edu/jlorenzo/linux_commands/linux_commands.htm
Greetings, Lindenbaum – I don’t even know you, but I somehow sense that you are mounted atop a rearing white steed.
Can these items do better when the relationships are not like-to-like? (as in cross-species comparison when there is one homologue in one organism and multiple ones in the other?)
Hi Jennifer,
I only read this post by coincidence, and haven’t much to say about the topic, but I really like your writing style. Best,
B.
Thanks B.! I don’t know anything about the topic either, which is a large part of the problem.
That’s pretty hardcore, Lindenbaum, at least for ‘wet’ biologists like Jenny and myself.
It is an emerging problem: I’m doing exon microarrays. The commercial software is (a) expensive and (b) crap, so we’re cobbling stuff together with Filemaker Pro and bush Perl scripts.
It’s a – um – learning experience.
I don’t want to taint the rest of my lab by my luddite association. Lots of my labmates are comfortable with this tool, our bespoke RNAi database. But it’s still online, and the comparative presentation is, for me, far less cumbersome when exported into a tab-delimited file.
Now, of course, my handwritten journals are what is rough and artless โ only with a keyboard can I produce quality material.
Don’t knock the power of the pen. Perhaps it’s because I have been through the person-of-a-certain-age barrier, but I find a pen and a big notebook (better still, a pencil) really helpful for getting ideas down quickly, without having to wait to boot things up. Sure, all the ‘quality’ work gets done on the keyboard, but I find pencil-and-paper by far the best for rough sketching — of plotlines, and so on.
Of course you’re right, Henry. I am a copious list maker and middle-of-the-night-notebook-by-the-bed scribbler, and I find it valuable. But as far as creating art – I used to be able to do it with a pen, and now can no longer. And you read stories of Jane Austen penning Pride and Prejudice in one draft…it’s astounding. Word processing has become synonymous with writing.
Oh boy, there’s a lot to get through here. Starting with “what Pierre said”. It’s entirely possible and indeed preferable to analyse genome-wide screens without Excel spreadsheets.
It’s natural to look at data and try to force it into the tools with which we’re familiar. You see rows and columns, you think “spreadsheet”. Other people see rows and columns, they think “database table”, or “R data frame”. First point then: if you know that there’s a better way to perform a task, don’t you owe it to yourself to at least try and find out about it?
To answer Jennifer’s questions: getting your data into a database table (such as MySQL) is incredibly beneficial, because it allows you to query it in all sorts of ways. So yes, you would be able to ask questions such as “show me all orthologs of gene X from species Y and list the paralogs too”. If your data are structured, designing relevant queries is easier.
Which brings me to “that’s easy for you to say but I’m a bench biologist”. Well – so are/were an awful lot of people who now call themselves bioinformaticians. You don’t learn this stuff overnight – it takes some commitment and most of all, a belief that it’s worthwhile.
I realise that many hard-working bench scientists only have time to dabble in bioinformatics – and that’s fine. At least try to network with your friendly, local bioinformatician or statistician (or network socially on the web). You’ll find a lot of people more than willing to help and advise – but you have to meet them halfway. We’re not impressed by this “but I’m just a little ol’ wetlab biologist” thing. You’re a scientist, I’m a scientist, we’re both interested in learning new stuff, new technologies and analysing data in the very best way that we can.
hahahaha! Hello Neil. I might have expected such a reply from you ๐
Re-reading the thread, and the comments, I think you’re being a little unfair in your penultimate sentence. We (JR, me, quite a few others) are trying, we are learning. It’s just a little daunting when someone says to you “Oh, use MySQL” as if it were the most natural thing in the world.
It’s just a little daunting when someone says to you โOh, use MySQLโ as if it were the most natural thing in the world.
No more so than someone saying, “Oh, just clone, tag, and IP it” or, “just do a genetic screen!”
@Richard – if I came over as unfair, I really didn’t mean to! I do appreciate that a lot of biologists are making the effort.
I just look at computer literacy as another useful skillset in the problem-solving arsenal of the biologist. If you need it, you learn it. It’s the ones who say “all that computer stuff is irrelevant to me” simply because the terminology is unfamiliar who really annoy me ๐
Eric: touchรฉ.
Neil: No worries mate. If you need it, you learn it is a mantra to live by, but it doesn’t stop it being daunting, or difficult.Hi again Jennifer,
Neil has nicely written what I quickly wanted to say when I suggested to use a database such as mysql to analyse your data. Of course, Eric is true: you cannot ask someone to use such an engine “as if it were the most natural thing in the world”. But I guess you might find someone that have those skills (You might have some bioinformaticians in your Nature Network), describe what you want, and start a wonderful scientific collaboration. ๐
@Eric: No more so than someone saying, โOh, just clone, tag, and IP itโ
Yes, but I would never ask a bioinformaticist to take five years to learn how to do this well for a project that’s only funded for four. One could argue that one should play to one’s strengths as a scientist. (Maybe this in my industry experience showing: when in doubt, outsource.)
@Lindenbaum: (You might have some bioinformaticians in your Nature Network)
Hey, Lindenbaum, wanna join my Network? ๐
Thanks for all your comments and suggestions, and just to assure you that I am happy to learn, I’m good with computers, we do commune with bioinformaticists, and I have even recently told my boss that I’d like to go on a bioinformatics course. But in some ways we are a bit off-track. The phenomenon I’ve posted about originally is a failure to grasp complex relationships in virtual space – whether in spreadsheet or queried database table, I’m not convinced that I will be able to easily get my head around the dataset without something on paper. This is not because paper is the best way to do it, but because at the moment, I don’t seem to be wired to see data that way. This, I know, will change…hence my metaphor about the tides.
I love the text in the graphic.
Neil has mentioned trying to find a tame bioinformatician or statistician (promises of beer and blueberry pie have worked on me in the past). There are a couple of groups here at NN that you could try posting queries on. I think they need someone start posting before they take off.
Well, Bob, time and commitment are an issue. If someone were going to help me out properly, it would be a significant amount of time at the moment. In previous places where I worked, there’d be a dedicated bioinformatics facility – one could just make an appointment to chat with someone, and they’d work with for days, weeks, months, whatever it took. The first thing I did when I moved to UCL was look up the Bioinformatics Department – and found out that it was a paid service. At that time, our finances were pretty tight: so quite disappointing.
So you get what you pay for. When it’s a casual collaboration, I find the expert never has enough time: this is my full-time project, but only one of dozens for the expert. You don’t really feel comfortable pressuring them to work faster or to prioritize certain things; even if they’re a co-author, you can’t really expect someone doing you a friendly favor to put in the hours required. The RNAi screening group I recently visited in Heidelberg had an astounding 2:1 ratio of full-time bioinformaticians to wet biologists. Which tells you something about the time requirements.
And a linguistical question:
bioinformaticist or bioinformatician? (I like the latter, with its flavor of magic)
I will not mention the computational biology group at one department I was in…
damn. Too late.
Neil has also written about this on his blog.
Seems to me that rather than being adversarial about bench biologists learning computer skills or bioinformaticians learning lab work, there is scope for collaboration here, for most efficient use of everyone’s skills?
Of course. But as I mentioned above, in certain projects, the computational needs far outweigh the biological needs; it’s not as easy as Bob’s blueberry pie to persuade very busy, sought-after people into a hefty collaboration. Sometimes, paying them could be the only option – and then you need the readies.
Jennifer’s point about the lack of time etc. is a good one – that’s why you need to tame bioinformaticians and statisticians. A lot of the queries I deal with are fairly easy to handle, particularly after I’ve trained the biologists properly.
The stats support in our biology faculty is awful. There are a few people with the skills, but not enough of us. The people doing the work know they need the support, but that takes resources, and the message somehow doesn’t filter up. I provide the support because I enjoy it, and I like the people (and the blueberry pie!), but I’m not paid to do it.
Hmm. This might need a blog post to explore fully. I already have one half-written responding to some of the issues raised here. And I have to get back to the sheep horns….
Bob, alert me when you finish that post – my network snapshot has become unusable because I have too many contacts!
Yes, I think paid support is going to be the most reliable for really hardcore tasks…so it comes down to the budget in the end. Otherwise your support is always going to be sporadic, and depending on the whims and schedules of others. And the price of blueberries when out of season! (ยฃ3 for a box containing about 30 berries, currently)
I’m carefully trying to avoid an ‘us and them’ thing here. I’m not entirely successful, but please keep my intent in mind as you read this ๐
My experience, having desperately wanted bioinformatical help, is that the bioinformaticians are too interested in proving the fossil record0, or something equally fatuous, to lend their skills to an interesting biological problem. Certain people from the group I’m thinking of have come all the way to Australia to give seminars, and people have walked away saying “What was the point of that?”
And before you bite me, Neil, that’s because what they were doing was biologically useless, not because we didn’t understand it.
I would love to have bioinformaticians on site, who wanted to collaborate. And I know such beasts exist – Neil himself has given me useful pointers (he’s 600 miles from me, so a more formal collaboration is not impossible, just more tiresome) and I appreciate that. But my experience is that Neil is unusual in this respect. (Having said that, my DPhil supervisor was a skilled BI and loves turning computers loose on interesting biology, too.)
Now I’m expecting that a lot of bioinformaticians will now crawl out of the silicon-work and say “Hey! I want to help you!” and that’s brilliant, really. But many of us more ‘wet’ biologists (including those of us who can hack a little bit of silico stuff) have been burned by uninterested comp biol people.
0 We know it’s true. What’s the point of saying this group of proteins is related to that group by this much, if there’s no functional information?
Wow! Not sure there is much more I can add here of substance but I wanted to tease out a couple of the issues. A real problem is we don’t have these systems literate people embedded in our groups. Two reasons for this; the funding systems don’t support them, and we’re not training people properly so that new people don’t assume they need these skills. But as someone said, if there isn’t someone in your lab then there will be someone out there who does have the skills. And this is exactly the right place to find them (and nice helpful ones like Neil as well!). You don’t have to share a cup of coffee with someone to collaborate with them (although it really does help).
Actually I do have something substantive to offer, dabbledb (dabbledb.com) is a not too bad point and click web based database system that can upload excel spreadsheets and then do some simple search and relationship type queries on the data. You might find that a reasonable halfway house en route. We use it for our stocks and strains database (neylonlaboratory.dabbledb.com – can give you access to the underlying database if you want a look)
@Richard: you wrote “what they were doing was biologically useless, not because we didn’t understand it” and my way around this recently, was to commit myself to testing one of these probably biologically useless tools on a grant application the computer folks put in, in exchange for goodwill and calling on them when I am stuck in a hard place.
@Pierre: thanks for the real-life links.
@Jennifer: just as an example, I’m about as wet as you or Richard, but a postdoc and I gritted our teeth and used R for microarray data; for our SAGE data we graduated from Excel to Access and sideways to FileMaker Pro and next round will use MySQL; it’s possible. Wrapping my mind around the use of queries was embarrassingly difficult but extremely useful, and now we have all sorts of cool databases in the lab. And it’s just the tip of the iceberg (I hope)…
So I nearly never use Excel anymore, although sometimes I like to see all entries in the databases just to remember the volumes of data we generated… and I tend to use “find” with a wildcard more than a formal query.
Heather, thanks for the encouragement. I already know how to use FileMaker and Access for other purposes, so I might give that a whirl.
Cameron, dabbledb sounds interesting as well!
@Jennifer just to know if I/we could be of any help (no warranty, I’ve got my own work to do ๐ )and if you’re OK, could you describe (here or by mail ?) your dataset and the common operations you’re doing on excel with those data ?
@cameron thank you for the link. I also like http://services.alphaworks.ibm.com/manyeyes/home and google as just released an sql-like-API which allow to analyse a remote dataset http://code.google.com/apis/visualization/.
Galaxy is another tool for (genomic) data integration http://main.g2.bx.psu.edu/
@Richard the worst part of (bio)informatics are the users ๐
Thanks, LP, I will contact you offline about this…
And before you bite me, Neil, that’s because what they were doing was biologically useless, not because we didn’t understand it.
Oh, there’s an awful lot of biologically-irrelevant bioinformatics for sure. Just read – well, any issue of a bioinformatics journal ๐
If it isn’t helping biologists with their data, it ain’t worth squat, IMHO.
There are probably some BI’s out there who don’t agree…
@Neil There’s an awful lot of biologically-irrelevant copy-and-paste-your-favorite-field-here for sure.
๐
A real stumbling block to initiating collaborations of the sort we’re talking about here is failure to communicate. I’ve had many people come to me and ask if I could do “stuff” with their data – which they generally don’t explain very well (assuming I’m biologically illiterate, no doubt).
I would suggest that making a list of questions you want answered, and a list of data available, both of which you can present to your FNB* will get you a long way towards fruitful discussions. And please, when they suggest you don’t have enough data to make a statistical argument, don’t hit them – ask why?
you might also want some advice on asking geeks questions
@JR – third the R recommendation – that’s how I got into microarray analysis.
friendly neighbourhood bioinformatician
“Now I’m expecting that a lot of bioinformaticians will now crawl out of the silicon-work and say โHey! I want to help you!โ and that’s brilliant, really.”
In my case I never turned down an opportunity to help and most of them never involved a paid-or-published collaboration. That’s why we blog, we have communities and forum here and somewhere else.
Paolo, would you have turned it down if it were going to take you six months of full-time work? All I’m saying is that some projects need more than just occasional favors, I think, and when a wet lab hits this problem, it’s difficult to know what to do.
Thanks for the blog and forums links!
I was thinking more in terms of co-authorship than payment….
Either way, Maxine, you are asking someone who is probably employed full time by someone else to sacrifice their own ongoing projects for something outside of their group. I’m not going to mention names, but I am aware of more than one ‘wet’ colleague whose friendly neighborhood BIs are getting flack from their supervisors for spending too much time off-project, leading to tension between the groups. A younger lab member, say a postdoc BI, might be pleased to help out a biologist colleague in their peer group, but this will have ramifications for the BI’s home lab/service department, which authorship may not fully recompense.
I think I have suggested this before but this thread looks like a good place to repeat. It it would be great if Nature Network had some way to more easily find collaborators. Each person could define the set of topics/tools/time we would be willing to provide and when someone placed a call for collaboration on their profile it would be propagated to right people.
We could also think of the opposite case that happens a lot in bioinformatics. Computational predictions should be tested as often as possible and in some cases testing the predictions would take almost no extra time for someone with the right reagents in hand.
Pedro, this is an excellent suggestion. It could be a unique selling point for NN. Scientists don’t really need another social networking site for social stuff, but for really practical concerns, NN should take the lead.
Count me in. I can give excellent advice on homebrew .
Yes, I suppose Microbiology qualifies…
Richard, can I collaborate? I’ve got several fruit trees but haven’t made the plunge into making plum beer and fig wine yet, despite much encouragement from friends who want to participate in the spoils.
Sure thing Cath.
I’ve given Jenny my chutney recipe, but you know what it’s like with collaborators: They’re all keen at first and then 3 years down the line you have to hassle them for the data.
I’m more of a home-made pie sort of girl. (American sweet, not British savo(u)ry)
mmmm pies.
I agree with Pedro too. I guess the problem is making it work.
OK, I’ve created a group to discuss these issues – it seems more appropriate than filling Jennifer’s comments (not that she’s complaining, I’m sure.).
I think it’s interesting to speculate whether this sort of online networking will ever replace the ties people make physically, through people they have met at conferences and at seminars. The other day a colleague of mine decided to email a perfect stranger to ask whether he could collaborate, and possibly even visit the lab for an extended period to learn a technique. This felt, to me, a bit abrupt. But then, more than one sort of tide is currently turning…
I quote from that article Chris cited, because I have been guilty as charged, and it might prevent any of you from the same:
“If you are trying to find out how to do something, begin by describing the goal. Only then describe the particular step towards it that you are blocked on. Often, people who need technical help have a high-level goal in mind and get stuck on what they think is one particular path towards the goal. They come for help with the step, but don’t realize that the path is wrong. It can take substantial effort to get past this.”
This is where we biologists have to be careful not to talk down to the biostatisticians – they nearly always understand the big goal, and when they don’t, we’re not explaining well enough. In one instance, we talked at cross-purposes for a month.
Online networking might work best with bioinformatics; inviting someone physically to your lab sight unseen for a month is riskier from a merely human point of view but it’s pretty much the same gamble if you’ve met them once at a conference coffee break. That is, they can still break your equipment and keep incompatible hours and be sexist or whatever else you might worry about as a host.
Actually, it was the other way round – he was inviting himself to a mini-sabbatical in the stranger’s lab! Cheeky, as the Brits say.
“they nearly always understand the big goal, and when they don’t, we’re not explaining well enough.”
I think this is an interesting assumption, and not one I think I agree with when reflecting on my own experiences. Why should the BIs have sole custody of perfect communication and understanding skills? To be honest, I found the ‘How to talk to geeks’ link offered above to be a bit patronizing, which for me almost overshadowed the bulk of very good advice. If that document were all I knew about geeks, I’d never even want to try. I don’t believe great expertise gives one the right to be scathing. Sometimes people don’t know enough to even know if their question is stupid – and if they’re slapped down, they might never come back. I think treating people politely and with respect should be something the scientific community should always strive for.
Bioinformatics is just one area where most biologists need help in form of a collaboration of some sort. Scientific Writing is another example, especially for those of us not speaking English as a first language. I do cancer research, and a good collaboration between basic and clinical research is essential and should go in both directions. Unfortunately, I have rarely found good examples for this in the 15 years I do cancer research.
The Collaboration Group that Bob started is a good step forward. (Do all intensive blog discussions on Nature Network result in the start of a NN Group?) Some general questions that I have include
Can a collaboration work when started in a social network? Do we trust someone we never met in person?
What are the forms of compensation? Is there only coauthorship and money?
Can the Nature Network software help in creating collaborations? You can list interests, projects and publications on your profile, why not also services you are willing to provide.
Martin –
I have had a collaboration with a guy in Mexico that was done purely over the web – he posted a question to a mailing list, and it went on from there. I’ve never actually met the guy.
As for compensation, blueberry pie (see above) is another alternative.
The NN software might be able to help generate collaboration, but I don’t what the best way of implementing it would be. Why not deposit your thoughts at the collaboration group? As Mrs. Merton would say “Let’s have a heated debate!”
I’ve also finally put up some of the thoughts as I had threatened.
It’s odd, Bob, Martin…despite all the interactions I have online with people I’ve never met, I find it much easier to feel comfortable about collaborating when I finally meet them in the flesh. This, too, is probably a defect of my generation that will not be passed on to the next.
Wow, what a fascinating thread! Does this break some kind of record for the most-commented-on-nature-network-post I wonder?
It does now ๐
Hi Duncan
Thanks for your kind words. I was gratified to see so many bioinformaticists crawl out of the woodwork on this one. I think a lot of people are grappling with this problem on both sides, so I was also pleased to see so many solutions mooted.
What format is the original data in?
and
What questions are you trying to answer?
I’m sure someone can rustle up a quick (and re-usable) interface, if you make the data available.
Anyone who can animate their avatar has my admiration!
Thanks for the offer – I am amazed by how helpful everyone is being.
I’m a little late to this thread… To add to the list of suggestions: SQL is part of the core curriculum for undergraduate Computer Science majors at most universities. I don’t know whether you are at a university with a CS department, but I’m sure that if you put up a poster/flyer in the department asking for help with your database design/querying, send an email to the CS email list, or approach a CS professor for recommendations for students, you’d easily be able to find people to help out. They’d learn about practical (and interesting!) applications for their skills, and you’d (hopefully) get another way to look at your data.
Hilary –
I like the idea of
exploitingedifying students! It’s a good idea, thanks.I have been working with CpG island microarray data on excel from sometime, intially it was hellish work now getting used to that mess, no other go….
Yes, I think you can get used to anything. That had more or less been my strategy, before hearing some of the ideas in this thread…
IMO the challenge is always authoring tools. People don’t necessarily need to learn SQL, but they do need to learn to understand how to conceptualize queries and relationships.
Can life scientists get by using excel? No, it is far to limited, designed for finance people and not complex scientific data mining. Now on the other hand, with tools like dabbledb and Blist you have the ability to use excel like front ends, i.e. familiar UI’s with relational backends that permit querying. These tools are still young, but you get the picture.
Of course, we could do a much better job of building applications that we can mine through simple query builders. That’s where Cameron’s notions of collaboration come into play. You don’t have to have a formal collaboration. There are enough people out there who will do it for you.
OK this calls for a blog post ๐
I have an invitation to view a dabbledb site but sadly life has prevented my exploration thus far. I plan an expedition very soon…
As long as you let us know when you’re expected back, so that we can alert the Coast Guard if you go missing.