We run this as an open source software project - if you have an idea for an improvement, please propose it via pull request, issue tracker, or mailing list.
A serious challenge for this project (and others like it) is a lack of 'ground truth'. If we knew ahead-of-time what the right answers were, we'd just use them :-). If we knew what the right answers were for a large data set, we could use that as a training set for statistical analysis and/or a learning algorithm.
I see. That makes sense.
One thing I'm trying to get a sense of (and I still need to read the paper very thoroughly to find out) is what exactly the "risk" you a measuring is risk of. That would make it easier to identify ground truth or proxies for it in existing data.
For example, 'having a vulnerability to SQL' injection is a very different kind of risk from 'having a low bus factor'.
Identifying when projects have died because of bus factor issues might be possible from observational data of open source communities.
Since we lack ground truth, we did what was documented in the paper. Here's a quick summary. We surveyed past efforts, selected a plausible set of metrics based on that, and heuristically developed a way to combine the metrics. We then had experts (hi!) look at the results (and WHY they were the results), look for anomalies, and adjust the algorithm until the results appeared reasonable.
This is great.
Is there a record of the anomalies and the adjustments?
Is there any sort of formal procedure for further expert review?
I would be interested in designing such a procedure if there isn't one.
We also published everything as OSS, so others could propose improvements. We presume that humans will review the final results, and that helps too.
Thanks! and understood :)