Thank you all so much for the feedback you've given on the <a href="https://github.com

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

A sample I just did for a single project can be found here: <a href="https://github.co

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Request for discussion] Alpha code.json Project Inventory Schema,about gsa/code-gov-web

Comments (21)

stvnrlly commented on May 23, 2024 5

Cool! Very nice to see this happening, and it looks great. Overall, I agree with Emanuel's comments (especially when he says that he thinks that I'm right). Here are some additional thoughts:

Government-specific Elements

I'd like to encourage you to think about how this could be useful outside of the government, too. With the possible exception of the exemptions, I don't think that there's anything necessarily government-specific about this schema, so choosing a term besides agency could allow organizations like Code for DC to move to this standard and get one step closer to a shared standard.

Agency Info

Right now it's just a name, but in my opinion a URL is even better, as it provides some disambiguation and context. As such, making that an object with name and url could be helpful. There are probably some fun OMB codes that could go in there, too, but that doesn't seem useful right now (and might be a job for a separate API at a later date).

Multiple Projects

Making that an array is a great idea.

Binary States

I'd recommend using true and false instead of 0 and 1, as it may make more sense to a non-techie.

Exemptions

Since this is related to Government-Wide Reuse, why not combine them into a single object? Additionally, if multiple exemptions are possible, an array may fit better, and it may help to link directly to a URL for the exemption instead of a number proxy. If a URL isn't possible, naming the exemptions and including that along with the number (e.g. 1 - Law or Regulation) could help future-proof it.

License

The URL requirement in civic.json had a nice forcing function of forcing OSS projects to think about licensing, but that doesn't work as well for the government. In my opinion, this field exists mostly to answer the question of "Can I use this?" If I saw that a project didn't have a license, I wouldn't necessarily know if that's because it was public domain or because it was under copyright and unlicensed.

So, at the risk of making this much too complicated, I'd propose something like this:

"copyright": {
    "licensed": true,
    "license": [SPDX identifier],
    "copyrighted": true

Public domain status could then be indicated with "copyrighted": false, while still allowing for the project to also be marked as CC0.

Required Fields

There should be at least one required field that points to a location to learn more about the project, be it a homepage or a repository.

Fields that Don't Need a Human

There are a few things—like description, languages, and last commit—that could also be pulled from the GitHub API. In those situations, I think it's better to leave it out of the schema and let people pull that information directly from the source to reduce the number of places where that information needs to be maintained.

Here, it seems like there's a non-zero probability of non-GitHub projects being tracked, so there may be a good argument for keeping them in.

Ability to Complete

Just as a side note, I don't see anything in here that a project member isn't likely to know, which is great. That seems like an obvious thing, but I've definitely dealt with standards that stump me, and then it doesn't get filled out well.

from code-gov-web.

emanuelfeld commented on May 23, 2024 3

Initial thoughts:

License

I'm reminded of this recent discussion relating to whether 18F's non-standard license was actually required.

Could additionally allowing an SPDX license identifier (which is what npm's package.json does) prod agencies to use standard licenses? This is something that could be baked into a code.json form (and probably auto-filled once given a license URL).

Agency

Looking at the top-level agency field, I'm concerned that there may be duplication/conflicts in cases where more than one agency is involved in a code project. In DC civic.json the partners field includes all parties involved, including the principal one.

Updated

@stvnrlly is (I think rightly) biased against fields that will be frequently updated. I believe all of these fields should be automatically generated. These are easy to neglect/mess up.

Contact

You may want to allow for additional contact URLs, outside of Twitter. DC Civic.json's contact object has a freeform URL attribute.

Tags

Defining taxonomies is a pain. I don't believe anyone can predict in advance the specific tags that would prove useful. On the other hand, there may be room for guidance on what makes a good tag. You often see people including every conjugation/singular/plural/geographic format/etc.

Unique IDs

If implemented with view toward an API, I would like a way to discover new projects and monitor changes in existing ones (e.g. new partners, updated license, new repository URL). Could any of the required code.json fields serve as a unique ID?

from code-gov-web.

ctubbsii commented on May 23, 2024 1

I still have no idea what "built for government-wide reuse" means. All open source projects would be "world-wide reuse", so does "government-wide reuse" refer to closed-source, but government-shared projects? It seems to me that all government closed-source software should allow that, provided whatever project-specific prerequisites are met. Further, "for" implies intent. I'm not sure why intent matters. The whole point of open source'ing and inner source'ing, is reuse. Aren't all software projects potentially re-usable, regardless of initial intent? It seems to me that the whole point of this effort is to make it easier to share and re-use, government produced software, regardless of intent. This is not clear at all, and is prone to confusion.

If that field is kept, it really needs better documentation. That documentation should specifically address the circumstances under which that field has a particular value, and should explain why that value is necessary because it could not be deduced by other attributes (like, open source status/license/exemption/etc.).

from code-gov-web.

jcastle-zz commented on May 23, 2024

@theresaanna schema looks good for a start. Think the optional element of URL should be mandatory. What's the point in identifying repos by name and not by location?

We need to finalize the schema soon because agencies will have to collect the metadata. They first have to consider where the code libraries are stored.

Does anyone know of a Github API that collects all org repo metadata and ouputs in a JSON format (or similar)? That would help jumpstart the metadata collection process.

from code-gov-web.

IanLee1521 commented on May 23, 2024

@jcastle -- I have some Javascript code that does this to visualize our (@LLNL) orgs on our http://software.llnl.gov page. You can find that here: https://github.com/LLNL/llnl.github.io/blob/master/js/github-dynamic.js

I also have some Python scripts I'll get pushed up today.

from code-gov-web.

IanLee1521 commented on May 23, 2024

@theresaanna -- Is there a suggestion on what to put when the code is not directly from an agency? In my case would I put Lawrence Livermore National Laboratory and / or DOE ?

from code-gov-web.

IanLee1521 commented on May 23, 2024

A sample I just did for a single project can be found here: https://github.com/LLNL/llnl.github.io/blob/master/_data/code.json

I'll work on a script to get it more fleshed out shortly.

from code-gov-web.

david-a-wheeler commented on May 23, 2024

I would suggest adding "release date", that is, the date it was initially released to the public. This is interesting information for many reasons, and isn't always obvious from the version control information.

This is one of the fields captured here: http://www.dwheeler.com/government-oss-released/

from code-gov-web.

theresaanna commented on May 23, 2024

@IanLee1521 Thanks so much for digging into this and trying it out!

@theresaanna -- Is there a suggestion on what to put when the code is not directly from an agency? In my case would I put Lawrence Livermore National Laboratory and / or DOE ?

That's an excellent question. My thinking is that we may want to add another field that would accommodate LLNL. I'm not sure what to call it, though. Do you think this is a good solution, and do you have an idea of what would be a suitable key?

from code-gov-web.

IanLee1521 commented on May 23, 2024

@theresaanna -- Perhaps something like organization? I imagine that other agencies will have the same issue. Certainly DOE with the national labs. But also I would expect DOD would want to get subdivided to Army, Navy, Marines, Air Force, etc. Another example would be GSA -> 18F.

Another option would be to have that all included in a single field, something like:

"agency": "DOE // LLNL"
"agency": "GSA // 18F"

etc.

from code-gov-web.

okamanda commented on May 23, 2024

Hello folks,

Thanks for keeping up the lively discussion on the alpha version of the schema. The specification for version 1.0 of the metadata schema is now available here: https://github.com/presidential-innovation-fellows/code-gov-web/blob/master/_draft_content/schema/specification_v1_0.md. Sample JSON files to be included soon.

from code-gov-web.

mikecharles commented on May 23, 2024

With an organization you could even drill down to a specific level. For example, my organization would be:

DOC/NOAA/NWS/NCEP/CPC

If the org is parsed into levels, one could query a specific level to see how much code is being produced at that level:

noaa_code = find(organization[1] == 'NOAA')
doc_code = find(organization[0] == 'DOC')

Something like that...

from code-gov-web.

bondsbw commented on May 23, 2024

I agree that true and false are much better than 1 and 0.

from code-gov-web.

thecapacity commented on May 23, 2024

I particularly like @stvnrlly 's comments and wanted to record a few (some overlapping with the other commenters too) to weigh in.

Apologies for the quick list;

I feel like the "vcs" key value should be dropped, e.g. GitHub lets one pull from SSH, Git, or SVN so it feel confusing (e.g. even in the example I think someone might be confused if it should be "github" or "git", especially with other companies like Microsoft incorporating git into their tools.
- This could also potentially be inferred from the URL.
I feel like the "language" key/value should just be part of the tags e.g. "python" might just be a tag vs. a "sub field tag" which would allow easier filtering e.g. search: python+connecting (which is of course possible with multiple fields but harder to implement.
The "partners" field feels like it's destined to be underutilized, such as when agencies don't know / care who's using their code and maybe wan't maintain this... also ideally the VCS will track "forking" so this could ideally be queried in real-time vs. a static snapshot.
- The Gov-wide reuse field also feels similarly “undefined”, e.g. an agency might not know if it’s reused
The "openSourceProject" field as a binary flag seems confusing (to me) - if the software is released as part of the Open Source release then isn’t this always True?
I would change "updated" to "schema_updated" to make it clear "what's being updated" (per the earlier discussions)
Lastly, and I know it was discussed before - I don't feel like "License" should be permitted to be null - and in fact I think it might be better to require this to be a file within the repository.
- e.g. I think the License field could just be a "tag" vs. a specific field.

from code-gov-web.

MikePulsiferDOL commented on May 23, 2024

I think it's important to think about scalability. How much of this can be automated as @stvnrlly suggested for a few fields? Maintaining the data.json file for DOL has been a nightmare of manual labor, especially when there are schema updates. Even CKAN can be hours and hours of clicking the days away.

from code-gov-web.

IanLee1521 commented on May 23, 2024

@MikePulsiferDOL -- I'm in the process of getting some code released that would help with generating this JSON. It's making it's way through our release process which is taking its time... I'll update once I can push it to @LLNL.

from code-gov-web.

IanLee1521 commented on May 23, 2024

@ctubbsii / @thecapacity -- I think that the two fields "openSourceProject" and "governmentWideReuseProject" are meant to encode three possible states of being for Code that is to be listed on Code.gov:

Open Source
Government-wide Reuse
Closed Source / Exempted / etc.

To @ctubbsii 's comment:

... so does "government-wide reuse" refer to closed-source, but government-shared projects? It seems to me that all government closed-source software should allow that ...

While I agree with the sentiment that all software should allow that the fact is that until the Federal Source Code Policy there hasn't been any hard requirement for it to be available broadly across government and the default has been "Closed Source". The policy makes the requirement for government-wide reuse.

from code-gov-web.

mchogan commented on May 23, 2024

Consider adding a schema version identifier so that future parsers will know what fields to expect.

"openSourceProject": 1

Consider eliminating attributes that can be calculated. It should be possible to calculate openSourceProject using the license URL.

"governmentWideReuseProject": 0

Consider minimizing the number of attributes in the schema. Instead of trying to get the right answer on the first try, include an attribute called something like "optional": that accepts an array of JSON objects so that experimental attributes can be tested before graduating to required schema fields. With a schema version attribute a parser would know which fields are expected and that introspection is required for the optional fields.

"status": "Alpha",

Instead of status, consider asking for the release number. Usually a release or version number indicates alpha, beta, 1.0.0, etc. It might be worth recommending a standard like semver, used by Angular2 and other projects.

from code-gov-web.

mchogan commented on May 23, 2024

It might be worth extending an existing package manager schema rather than creating a new one. For example...

from code-gov-web.

JJediny commented on May 23, 2024

Comments on Current/Proposed as of 10/13

openSourceProject

Seems redundant and confusing, if the project uses an accepted open source license then it is true/1 if it doesn't then false/0. Suggest removing

agency

Using an agency acronym is dangerous as some agencies internally can't even agree about their own (e.g. USFWS or FWS, USACE or ArmyCorps, etc.). While the use of program/bureau Codes are 'safer' for data quality they are not intuitive. We have previously made the recommendation that the Government can and should create a reference mapping of Agency Domain Names (e.g. @gsa.gov, @usfws.doi.gov, etc.) mapped to their bureau/program. Using an agencies domain is far more stable and less likely to create a data cleaning nightmare and frankly speaking the people doing the data entry likely already know their email address.
https://project-open-data.cio.gov/v1.1/schema/#bureauCode

license/language

Both of these attributes should implement/reference a controlled vocabulary to ensure consistency.

Comments on what's missing as of 10/13

Globally Unique Identifiers (GUID/UUID)

These are critical to establishing provenance to the canonical source of data. The whole point of them is that they can be distributedly generated but yet still statistically unique that the chances of anyone generating duplicative GUID/UUID(s) is realistically impossible. Not using one 1. makes any parent/child relationships impossible and 2. there are no other Unique identifiers used so as titles change then knowing "is this project really that project" making it impossible to avoid/test for redundant/duplicative entries. See #56

isPartof (Parent/Child relationships)

As we have discovered in implementing data.json, the concept of a collection (i.e. the ability for one component of a project to reference its parent project) is critically important.
https://project-open-data.cio.gov/v1.1/schema/#isPartOf

contact.role

the contact field should allow/encourage multiple entries but currently there is no concept of a contact's role (e.g. project manager, development lead, etc). This has been a concern in project open data that personal turn over and/or the want/need to direct people to a generic inbox for a program/team to complement the specific employee/POC for the project.

General Comments
JSON is great for having one file that contains a series of entries (more then one open source project). But it is less human readable then its YAML derivative. Given that multiple YAML documents can be compiled into one JSON document; IMHO it is more practical for those responsible for data entry to use YAML as either a code.yml file in the root dir of the repo and/or as an enhanced README which is a single README.md file with YAML frontmatter to better manage structured data (this is the exact file format of how Github Pages works to structured content for static websites in lieu of a CMS/Database). It then is easy enough to generate a json file from all of those files as a crawl/collect/transform within a Github Organization for instance

from code-gov-web.

mattbailey0 commented on May 23, 2024

Thanks everyone. We've had a bit of a proliferation of issues related to the schema. Let's move this conversation to #41

from code-gov-web.

[Request for discussion] Alpha code.json Project Inventory Schema about code-gov-web HOT 21 CLOSED

Comments (21)

Government-specific Elements

Agency Info

Multiple Projects

Binary States

Tags

Exemptions

License

Required Fields

Fields that Don't Need a Human

Ability to Complete

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent