Topics

repo_uri type is not in the data


Hanno Böck
 

Hi,

I wanted to try some automated analysis on the CII badge project repos,
however it seems that the current data doesn't contain any info on the
repo type. While this can often be guessed (e.g. everything containing
git in its name is likely a git repo), that's not always the case.
This can probably be a field that only needs to hold around 5 different
values. (Looking at the data I see git, svn, bzr, cvs - just a
proposal, but we could simply use the command name as the field
content).

Also looking at the data there seem to be some projects that have a
wrong URL in there (e.g. having a reference to a github username and
not the repo itself). It should be relatively easy to check that
automatically.

--
Hanno Böck
https://hboeck.de/

mail/jabber: hanno@hboeck.de
GPG: BBB51E42


David A. Wheeler
 

Hanno Böck:
I wanted to try some automated analysis on the CII badge project repos, however it seems that the current data doesn't contain any info on the repo type. While this can often be guessed (e.g. everything containing git in its name is likely a git repo), that's not always the case.
This can probably be a field that only needs to hold around 5 different values. (Looking at the data I see git, svn, bzr, cvs - just a proposal, but we could simply use the command name as the field content).
That's obviously possible. However, I'm trying to limit the amount of information users have to provide - every question increases their effort. It seems to me that automated analysis should be nearly perfect, since as you note there are relatively few options. If it's on GitHub it's always git, if ends in ".git" it's git.

Also looking at the data there seem to be some projects that have a wrong URL in there (e.g. having a reference to a github username and not the repo itself). It should be relatively easy to check that
automatically.

That is such a weird mistake. You'd think people would know what their project or repo URLs are :-).

Please identify the projects so we can fix them.

--- David A. Wheeler


Hanno Böck
 

On Sat, 30 Jul 2016 17:05:21 -0400
"Wheeler, David A" <dwheeler@ida.org> wrote:

That's obviously possible. However, I'm trying to limit the amount
of information users have to provide - every question increases their
effort. It seems to me that automated analysis should be nearly
perfect, since as you note there are relatively few options. If it's
on GitHub it's always git, if ends in ".git" it's git.
I'd more think of an automated way to just check the repos, url-based
guessing may be unreliable. I attached a quick and dirty script.
Alternatively you can have a dropdown box that defaults to git. Given
the monoculture of git almost nobody would have to change it :-)

Please identify the projects so we can fix them.
Output from my script grepped for UNKNOWN, which means no repo
identified:

json/112.json http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/ UNKNOWN
json/114.json https://github.com/open-infrastructure/container-tools
UNKNOWN
json/164.json https://git.opnfv.org/ UNKNOWN
json/197.json https://git.gnupg.org UNKNOWN
json/211.json https://www.tinc-vpn.org/git/ UNKNOWN
json/212.json https://github.com/vmware/phon UNKNOWN
json/232.json http://github.com/atom UNKNOWN
json/234.json https://git.videolan.org/ UNKNOWN
json/246.json https://github.com/openstack UNKNOWN
json/26.json http://trousers.sourceforge.net UNKNOWN
json/34.json https://git.kernel.org UNKNOWN
json/54.json https://git.openssl.org/ UNKNOWN
json/74.json https://gerrit.zephyrproject.org/ UNKNOWN
json/98.json http://kea.isc.org/wiki UNKNOWN

The cvs one is a challenge: it seems cvs lacks the concept of a repo
url.
container-tools has removed the repo.
vmware/phon seems a typo which should be vmware/photon.
rest is mostly referencing git overview pages and not repo URLs.

Also given that this allows for some easy statistics:
2 bzr
3 nourl
4 svn
14 UNKNOWN
154 git

The git dominance is huge :-)

--
Hanno Böck
https://hboeck.de/

mail/jabber: hanno@hboeck.de
GPG: BBB51E42


Mark Rader
 

Humm

After seeing your statistics and looking at the 14 unknowns it looks like a binary option might be better. It could be git or other, with maybe a comment field for other.

On Jul 30, 2016, at 4:57 PM, Hanno Böck <hanno@hboeck.de> wrote:

On Sat, 30 Jul 2016 17:05:21 -0400
"Wheeler, David A" <dwheeler@ida.org> wrote:

That's obviously possible. However, I'm trying to limit the amount
of information users have to provide - every question increases their
effort. It seems to me that automated analysis should be nearly
perfect, since as you note there are relatively few options. If it's
on GitHub it's always git, if ends in ".git" it's git.
I'd more think of an automated way to just check the repos, url-based
guessing may be unreliable. I attached a quick and dirty script.
Alternatively you can have a dropdown box that defaults to git. Given
the monoculture of git almost nobody would have to change it :-)

Please identify the projects so we can fix them.
Output from my script grepped for UNKNOWN, which means no repo
identified:

json/112.json http://cvsweb.netbsd.org/bsdweb.cgi/pkgsrc/ UNKNOWN
json/114.json https://github.com/open-infrastructure/container-tools
UNKNOWN
json/164.json https://git.opnfv.org/ UNKNOWN
json/197.json https://git.gnupg.org UNKNOWN
json/211.json https://www.tinc-vpn.org/git/ UNKNOWN
json/212.json https://github.com/vmware/phon UNKNOWN
json/232.json http://github.com/atom UNKNOWN
json/234.json https://git.videolan.org/ UNKNOWN
json/246.json https://github.com/openstack UNKNOWN
json/26.json http://trousers.sourceforge.net UNKNOWN
json/34.json https://git.kernel.org UNKNOWN
json/54.json https://git.openssl.org/ UNKNOWN
json/74.json https://gerrit.zephyrproject.org/ UNKNOWN
json/98.json http://kea.isc.org/wiki UNKNOWN

The cvs one is a challenge: it seems cvs lacks the concept of a repo
url.
container-tools has removed the repo.
vmware/phon seems a typo which should be vmware/photon.
rest is mostly referencing git overview pages and not repo URLs.

Also given that this allows for some easy statistics:
2 bzr
3 nourl
4 svn
14 UNKNOWN
154 git

The git dominance is huge :-)

--
Hanno Böck
https://hboeck.de/

mail/jabber: hanno@hboeck.de
GPG: BBB51E42
<badgeparse>
<badge-repo_url.txt.xz>
_______________________________________________
CII-badges mailing list
CII-badges@lists.coreinfrastructure.org
https://lists.coreinfrastructure.org/mailman/listinfo/cii-badges


David A. Wheeler
 

Hanno Böck [mailto:hanno@hboeck.de]
The git dominance is huge :-)
Indeed. CVS hasn't had a release since 2008, and no commits in years:
http://cvs.savannah.gnu.org/viewvc/cvs/ccvs/src/?sortby=date#dirlist
To me, using CVS is itself an indicator of concern, because of its lack of maintenance.

Most people prefer distributed version control today (me too). But if you want centralized version control, at least subversion (svn) is actively maintained & improved (latest release April 2016):
https://subversion.apache.org/

Thanks for the script! To me, that suggests that for almost all cases it's possible to figure out what version control system is in use. I really want to minimize what we ask users to do.

--- David A. Wheeler