Making Research Software Count: One Month Into Our Open Innovation Sprint, Track #2

Research software is the engine driving modern science and discovery. From analyzing complex datasets to simulating intricate phenomena, it's an indispensable part of the research lifecycle. Yet, understanding and quantifying its impact remains a significant challenge for researchers, project contributors, institutions, and funders alike.

How do we effectively measure the reach and influence of these vital tools? How can we incentivize better practices around software citation, metadata, and sustainability?

To tackle these shared ecosystem challenges head-on, NumFOCUS’s Open Source Science Initiative launched the 2025 “Impact of Research Software” Open Innovation Sprint.

What is the Open Innovation Sprint?

Running from March through late 2025, this sprint is a collaborative, fast-paced initiative bringing together researchers, engineers, designers, community organizers, and users. Our goal is to produce actionable, community-driven outputs – open source tools ready for immediate use and contribution.

This inaugural sprint, focused on measuring research software impact, is inspired by efforts such as the CZI 2023 Software Impact Hackathon, is led by NumFOCUS’s Open Source Science Initiative, and is made possible through collaboration with the Research Software Alliance (ReSA), Open Source Collective and ecosyste.ms, and the Sprint’s contributors and their organizations.

Our Sprint Tracks

We've structured the sprint into two dedicated tracks, each tackling a different facet of the software impact puzzle:

Track 1: Linking Software Packages to Academic Citations & Mentions

Track 2: Improving Automated Workflows with Extended Software Metadata for Simplified Storytelling

What's Been Happening in Track 2?

Thanks to lively discussions and contributions, Track 2 is a-buzz with activity! Here's a glimpse:

  • Metadata Standards Discussions: We're exploring how best to represent software using standards like CodeMeta and Schema.org. This includes experiments on ensuring metadata files are valid and recognized by search engines – crucial for FAIR principles. We're discussing how to map between different formats (like CITATION.cff, .zenodo.json, and codemeta.json) as well as how to reconcile differences between metadata files in a single repository. Finally, we’re discussing what critical fields are absent in existing metadata standards, and how we might integrate them into software development practices.

  • Tooling Landscape Analysis: We're mapping out existing tools for generating, validating, converting, and analyzing software metadata. You can see and contribute to our list here! We're identifying gaps and determining how to extend, combine, or support existing tools.

  • Research Software Metadata Analysis Tooling: First, we leveraged Ecosyste.ms to identify 1,354 unique repositories with codemeta.json, 49,080 repositories with CITATION.cff (or .bib, .txt), and 1,156 repositories with zenodo.json. We will be extending this search for more files in the near future. We then began developing the Research Software Metadata Analyzer (RSMA), a command-line tool to fetch metadata files from GitHub repos, validate them against their schemas, and assess their completeness. This helps us understand the current state and use of metadata in research software, and identify areas for improvement, collaboration, and consolidation.

  • Automating Metadata Extraction, Validation, and Generation Tooling: Where does good metadata come from, and how can we automate the metadata generation process? We're investigating how to reliably extract information from READMEs (like badges for DOIs or repository status), GitHub APIs (contributors, releases, licenses), DOI metadata (authors, affiliations, funding), and files like pyproject.toml or CITATION.cff. We have already experimented with direct identification and generation as well as with fine-tuning LLMs to fill the gaps. We expect to extend an existing solution (perhaps a tool emerging from SciCodes) with these methodologies in the near future, and expect to soon have some code on our repo to which you can contribute!

  • Incentives and Ease of Use Explorations: A core theme of Track #2 is making metadata management easier and more valuable for contributors. How can we build tools that generate multiple metadata files from a single input? How can we demonstrate the payoff (e.g., better discovery, easier citation, potential recognition) for maintaining rich metadata? How can we generate metadata from the plethora of identifying and metadata-ish files that naturally come with a software project?

If any of this is of interest to you, join us! This is an open innovation sprint. Register to the calls here, leave questions, comments, or ideas on the repo here, or contribute directly to the ongoing development of the tools we’ve started building!

Hope to see you soon… for science!

Next
Next

Building Bridges for a Sustainable Future Through Open Source