Just came across a great MIT Sloan Research Paper from Victoria Stodden. It’s entitled The Scientific Method in Practice: Reproducibility in the Computational Sciences. The highlights include the results from a survey of computational scientists from the field of machine learning. Many different reasons for sharing both data and code are explored, but below I’ve just provided the five most common answers for sharing and not sharing code (% respondents in brackets). How do they fit with your own views?
|#||Reason for sharing code||Reason for not sharing code|
|1||Encouraging scientific advancement (91%)||The time it takes to clean up and document for release (78%)|
|2||Encouraging sharing and having others share with you (90%)||Dealing with questions from users about the code (56%)|
|3||Being a good community member (87%)||The possibility that your code may be used without citation (45%)|
|4||Increase in publicity (85%)||The possibility of patents or other IP constraints (40%)|
|5||Improvement in the caliber of research (84%)||Legal barriers, such as copyright (34%)|
So the big barrier to more sharing more code is the overhead of getting it ready for release and the overhead of dealing with user queries once released. As Victoria goes onto say “this is interesting because it speaks to an incentive misalignment in the reward structure for scientific structure… [and] … suggests a strong need to account for code and data release directly in the research review process.” Our qualitative research we did at our community engagement workshop certainly chimes in with this.
These overheads, and the issues around IP, are also likely for other types of, and approaches to, software preservation. Part of our project will be to make recommendations on how to create the right sort of incentives for sharing and preservation when it is appropriate.