Enter the data scientist

I was interested to see a report issued by the JISC last week entitled Skills, Role & Career Structure of Data Scientists & Curators: Assessment of Current Practice & Future Needs. Written by Alma Swan and Sheridan Brown from Key Perspectives, well known for their work on Open Access and Publishing, it makes recommendations on “the role and career development of data scientists and the associated supply of specialist data curation skills to the research community”.
What are data scientists? Are they scientists who look after data? Or IT people who look after data? Or other people who look after data? Do all data scientists do the same kind of “looking after”? Do we know exactly what they do? Do we care?
I do care, though I certainly don’t know the answers to these questions. The report goes some way to providing answers, defining four varieties of data person: Data Creator, Data Scientist, Data Manager, Data Librarian. I guess these represent points along a continuum, with the focus gradually changing from “making the data do something” at one end to “ensuring the data survives” at the other end.
I was intrigued by the suggestion that research funders should require at least one member of the project team to be nominated as the project’s data scientist. I wonder how popular that will be? However, my main interest is of course in the contribution that libraries and librarians may or not have to play in this field.
I can see similarities between some work that librarians do and this new field of data curation. I’ve been wondering about data for some years. As a librarian I tried to engage with biological databanks just as sources of information, making users aware of what was available and helping them to make use of the resources. That worked for a bit, until the resources became so numerous and complex that I could not provide any meaningful assistance. Then a few years later the question “who should look after data?” came up. Well, libraries “look after” stuff, so should we also look after data? Three years ago I went along to the first Digital Curation Conference to ask the question but didn’t really get an answer and I am still not clear. I certainly wouldn’t feel ready to put on a data librarian hat any time soon. It seems I am not alone – there are apparently only five data librarians in the whole of the UK.
The report is quite encouraging, saying:

The role of the library in data-intensive research is important and a strategic repositioning of the library with respect to research support is now appropriate. We see three main potential roles for the library: increasing data-awareness amongst researchers; providing archiving and preservation services for data within the institution through institutional repositories; and developing a new professional strand of practice in the form of data librarianship.

That seems reasonable to me, bearing in mind that the data librarians would be working within a framework of data scientists and data managers. I’d be interested to hear what the data scientists and data managers think. Do you need a new breed of data librarian?
As is the way with such reports, it also sets out a series of further studies that need to be completed:

  • A description of the role played by data scientists and the value of the contribution they make to research
  • Examples of data science careers
  • The development of a set of practices that represent good practice in data science

For those interested, there is also a workshop next month on Roles and Responsibilities for Effective Data Management, organised by the Research Data Management Forum. It’s not related directly to this report.

About Frank Norman

I am a librarian in a biomedical research institute. I've been around a few years, long enough to know that exciting new things fall into the same familiar patterns. I'm interested in navigating a path for libraries as we move further from print to electronic resources to open research, and become more embedded in research workflows.
This entry was posted in Uncategorized. Bookmark the permalink.

5 Responses to Enter the data scientist

  1. Sara Fletcher says:

    It’s a really interesting subject. As Diamond ramps up, the amount of data produced annually by the facility is expected to be in the PetaByte realm. We’re also expecting to generate over 2,000 scientific papers. Given that both factors will be key in determining success, how data is curated is vitally important. So far the emphasis has been on technological solutions, but I think the value of such a “data scientist” is seriously underestimated…

  2. Richard P. Grant says:

    bq. As Diamond ramps up, the amount of data produced annually by the facility is expected to be in the PetaByte realm. We’re also expecting to generate over 2,000 scientific papers.
    So come on Stephen, they’re just waiting for your crystals!

  3. Bob O'Hara says:

    Intriguing. I hadn’t thought about bioinformatics as a subset of library work, but I guess you’re right. I totally agree that “data scientists” are necessary – I work as a statistician with biologists, and I don’t want to become a database manager, and they don’t either. Which means we all miss out on the possibilities that are available.
    I’m curious to know what you think is involved in “looking after data” – I’m guessing it’ll be different to what the rest of us think.

  4. Frank Norman says:

    Bob – I would never say that bioinformatics is a subset of library work! No more than are printing, or grammar. But bioinformaticians produce information resources that are more or less easy to use, and new users may need assistance with using them. This is where it moves into library work. See the course for librarians that NCBI provide.
    I am not sure what is involved in “looking after data”. The report suggests that the Data Librarian is mostly concerned with curation, preservation and archiving. That certainly includes a big chunk of metadata work, and also some routine work on the preservation side. I like the report’s highlighting of the need for awareness raising – this is something that fits with our current roles I think.
    One problem in many of these discussions is the use of the term “data” to cover various different kinds of data – highly curated collections, large scale dataset gathering, lab-based data generation. These all have different patterns of generation and use and different lifetimes.

  5. Frank Norman says:

    Sara – I come over all faint at the thought of a Petabyte! Maybe I’m not cut out to be a data librarian after all!