Comments (21)
Cool! Very nice to see this happening, and it looks great. Overall, I agree with Emanuel's comments (especially when he says that he thinks that I'm right). Here are some additional thoughts:
Government-specific Elements
I'd like to encourage you to think about how this could be useful outside of the government, too. With the possible exception of the exemptions, I don't think that there's anything necessarily government-specific about this schema, so choosing a term besides agency
could allow organizations like Code for DC to move to this standard and get one step closer to a shared standard.
Agency Info
Right now it's just a name, but in my opinion a URL is even better, as it provides some disambiguation and context. As such, making that an object with name
and url
could be helpful. There are probably some fun OMB codes that could go in there, too, but that doesn't seem useful right now (and might be a job for a separate API at a later date).
Multiple Projects
Making that an array is a great idea.
Binary States
I'd recommend using true
and false
instead of 0
and 1
, as it may make more sense to a non-techie.
Tags
I'm strongly in favor of defining the tags or dropping it altogether, as people are often too creative for their own good. However, it may make sense to leave it freeform for an initial period and then reevaluate (1) if it's useful and (2) if certain tag themes are emerging.
Exemptions
Since this is related to Government-Wide Reuse, why not combine them into a single object? Additionally, if multiple exemptions are possible, an array may fit better, and it may help to link directly to a URL for the exemption instead of a number proxy. If a URL isn't possible, naming the exemptions and including that along with the number (e.g. 1 - Law or Regulation
) could help future-proof it.
License
The URL requirement in civic.json
had a nice forcing function of forcing OSS projects to think about licensing, but that doesn't work as well for the government. In my opinion, this field exists mostly to answer the question of "Can I use this?" If I saw that a project didn't have a license, I wouldn't necessarily know if that's because it was public domain or because it was under copyright and unlicensed.
So, at the risk of making this much too complicated, I'd propose something like this:
"copyright": {
"licensed": true,
"license": [SPDX identifier],
"copyrighted": true
Public domain status could then be indicated with "copyrighted": false
, while still allowing for the project to also be marked as CC0.
Required Fields
There should be at least one required field that points to a location to learn more about the project, be it a homepage or a repository.
Fields that Don't Need a Human
There are a few things—like description
, languages
, and last commit
—that could also be pulled from the GitHub API. In those situations, I think it's better to leave it out of the schema and let people pull that information directly from the source to reduce the number of places where that information needs to be maintained.
Here, it seems like there's a non-zero probability of non-GitHub projects being tracked, so there may be a good argument for keeping them in.
Ability to Complete
Just as a side note, I don't see anything in here that a project member isn't likely to know, which is great. That seems like an obvious thing, but I've definitely dealt with standards that stump me, and then it doesn't get filled out well.
from code-gov-web.
Initial thoughts:
License
I'm reminded of this recent discussion relating to whether 18F's non-standard license was actually required.
Could additionally allowing an SPDX license identifier (which is what npm's package.json does) prod agencies to use standard licenses? This is something that could be baked into a code.json form (and probably auto-filled once given a license URL).
Agency
Looking at the top-level agency field, I'm concerned that there may be duplication/conflicts in cases where more than one agency is involved in a code project. In DC civic.json the partners field includes all parties involved, including the principal one.
Updated
@stvnrlly is (I think rightly) biased against fields that will be frequently updated. I believe all of these fields should be automatically generated. These are easy to neglect/mess up.
Contact
You may want to allow for additional contact URLs, outside of Twitter. DC Civic.json's contact object has a freeform URL attribute.
Tags
Defining taxonomies is a pain. I don't believe anyone can predict in advance the specific tags that would prove useful. On the other hand, there may be room for guidance on what makes a good tag. You often see people including every conjugation/singular/plural/geographic format/etc.
Unique IDs
If implemented with view toward an API, I would like a way to discover new projects and monitor changes in existing ones (e.g. new partners, updated license, new repository URL). Could any of the required code.json fields serve as a unique ID?
from code-gov-web.
I still have no idea what "built for government-wide reuse" means. All open source projects would be "world-wide reuse", so does "government-wide reuse" refer to closed-source, but government-shared projects? It seems to me that all government closed-source software should allow that, provided whatever project-specific prerequisites are met. Further, "for" implies intent. I'm not sure why intent matters. The whole point of open source'ing and inner source'ing, is reuse. Aren't all software projects potentially re-usable, regardless of initial intent? It seems to me that the whole point of this effort is to make it easier to share and re-use, government produced software, regardless of intent. This is not clear at all, and is prone to confusion.
If that field is kept, it really needs better documentation. That documentation should specifically address the circumstances under which that field has a particular value, and should explain why that value is necessary because it could not be deduced by other attributes (like, open source status/license/exemption/etc.).
from code-gov-web.
@theresaanna schema looks good for a start. Think the optional element of URL should be mandatory. What's the point in identifying repos by name and not by location?
We need to finalize the schema soon because agencies will have to collect the metadata. They first have to consider where the code libraries are stored.
Does anyone know of a Github API that collects all org repo metadata and ouputs in a JSON format (or similar)? That would help jumpstart the metadata collection process.
from code-gov-web.
@jcastle -- I have some Javascript code that does this to visualize our (@LLNL) orgs on our http://software.llnl.gov page. You can find that here: https://github.com/LLNL/llnl.github.io/blob/master/js/github-dynamic.js
I also have some Python scripts I'll get pushed up today.
from code-gov-web.
@theresaanna -- Is there a suggestion on what to put when the code is not directly from an agency? In my case would I put Lawrence Livermore National Laboratory
and / or DOE
?
from code-gov-web.
A sample I just did for a single project can be found here: https://github.com/LLNL/llnl.github.io/blob/master/_data/code.json
I'll work on a script to get it more fleshed out shortly.
from code-gov-web.
I would suggest adding "release date", that is, the date it was initially released to the public. This is interesting information for many reasons, and isn't always obvious from the version control information.
This is one of the fields captured here: http://www.dwheeler.com/government-oss-released/
from code-gov-web.
@IanLee1521 Thanks so much for digging into this and trying it out!
@theresaanna -- Is there a suggestion on what to put when the code is not directly from an agency? In my case would I put Lawrence Livermore National Laboratory and / or DOE ?
That's an excellent question. My thinking is that we may want to add another field that would accommodate LLNL. I'm not sure what to call it, though. Do you think this is a good solution, and do you have an idea of what would be a suitable key?
from code-gov-web.
@theresaanna -- Perhaps something like organization
? I imagine that other agencies will have the same issue. Certainly DOE with the national labs. But also I would expect DOD
would want to get subdivided to Army
, Navy
, Marines
, Air Force
, etc. Another example would be GSA
-> 18F
.
Another option would be to have that all included in a single field, something like:
- "agency": "DOE // LLNL"
- "agency": "GSA // 18F"
etc.
from code-gov-web.
Hello folks,
Thanks for keeping up the lively discussion on the alpha version of the schema. The specification for version 1.0 of the metadata schema is now available here: https://github.com/presidential-innovation-fellows/code-gov-web/blob/master/_draft_content/schema/specification_v1_0.md. Sample JSON files to be included soon.
from code-gov-web.
With an organization you could even drill down to a specific level. For example, my organization would be:
DOC/NOAA/NWS/NCEP/CPC
If the org is parsed into levels, one could query a specific level to see how much code is being produced at that level:
noaa_code = find(organization[1] == 'NOAA')
doc_code = find(organization[0] == 'DOC')
Something like that...
from code-gov-web.
I agree that true
and false
are much better than 1
and 0
.
from code-gov-web.
I particularly like @stvnrlly 's comments and wanted to record a few (some overlapping with the other commenters too) to weigh in.
Apologies for the quick list;
- I feel like the "vcs" key value should be dropped, e.g. GitHub lets one pull from SSH, Git, or SVN so it feel confusing (e.g. even in the example I think someone might be confused if it should be "github" or "git", especially with other companies like Microsoft incorporating git into their tools.
- This could also potentially be inferred from the URL.
- I feel like the "language" key/value should just be part of the tags e.g. "python" might just be a tag vs. a "sub field tag" which would allow easier filtering e.g.
search: python+connecting
(which is of course possible with multiple fields but harder to implement. - The "partners" field feels like it's destined to be underutilized, such as when agencies don't know / care who's using their code and maybe wan't maintain this... also ideally the VCS will track "forking" so this could ideally be queried in real-time vs. a static snapshot.
- The Gov-wide reuse field also feels similarly “undefined”, e.g. an agency might not know if it’s reused
- The "openSourceProject" field as a binary flag seems confusing (to me) - if the software is released as part of the Open Source release then isn’t this always
True
? - I would change "updated" to "schema_updated" to make it clear "what's being updated" (per the earlier discussions)
- Lastly, and I know it was discussed before - I don't feel like "License" should be permitted to be null - and in fact I think it might be better to require this to be a file within the repository.
- e.g. I think the License field could just be a "tag" vs. a specific field.
from code-gov-web.
I think it's important to think about scalability. How much of this can be automated as @stvnrlly suggested for a few fields? Maintaining the data.json file for DOL has been a nightmare of manual labor, especially when there are schema updates. Even CKAN can be hours and hours of clicking the days away.
from code-gov-web.
@MikePulsiferDOL -- I'm in the process of getting some code released that would help with generating this JSON. It's making it's way through our release process which is taking its time... I'll update once I can push it to @LLNL.
from code-gov-web.
@ctubbsii / @thecapacity -- I think that the two fields "openSourceProject" and "governmentWideReuseProject" are meant to encode three possible states of being for Code that is to be listed on Code.gov:
- Open Source
- Government-wide Reuse
- Closed Source / Exempted / etc.
To @ctubbsii 's comment:
... so does "government-wide reuse" refer to closed-source, but government-shared projects? It seems to me that all government closed-source software should allow that ...
While I agree with the sentiment that all software should allow that the fact is that until the Federal Source Code Policy there hasn't been any hard requirement for it to be available broadly across government and the default has been "Closed Source". The policy makes the requirement for government-wide reuse.
from code-gov-web.
Consider adding a schema version identifier so that future parsers will know what fields to expect.
"openSourceProject": 1
Consider eliminating attributes that can be calculated. It should be possible to calculate openSourceProject using the license URL.
"governmentWideReuseProject": 0
Consider minimizing the number of attributes in the schema. Instead of trying to get the right answer on the first try, include an attribute called something like "optional": that accepts an array of JSON objects so that experimental attributes can be tested before graduating to required schema fields. With a schema version attribute a parser would know which fields are expected and that introspection is required for the optional fields.
"status": "Alpha",
Instead of status, consider asking for the release number. Usually a release or version number indicates alpha, beta, 1.0.0, etc. It might be worth recommending a standard like semver, used by Angular2 and other projects.
from code-gov-web.
It might be worth extending an existing package manager schema rather than creating a new one. For example...
from code-gov-web.
Comments on Current/Proposed as of 10/13
openSourceProject
Seems redundant and confusing, if the project uses an accepted open source license then it is true/1 if it doesn't then false/0. Suggest removing
agency
Using an agency acronym is dangerous as some agencies internally can't even agree about their own (e.g. USFWS or FWS, USACE or ArmyCorps, etc.). While the use of program/bureau Codes are 'safer' for data quality they are not intuitive. We have previously made the recommendation that the Government can and should create a reference mapping of Agency Domain Names (e.g. @gsa.gov, @usfws.doi.gov, etc.) mapped to their bureau/program. Using an agencies domain is far more stable and less likely to create a data cleaning nightmare and frankly speaking the people doing the data entry likely already know their email address.
https://project-open-data.cio.gov/v1.1/schema/#bureauCode
license/language
Both of these attributes should implement/reference a controlled vocabulary to ensure consistency.
Comments on what's missing as of 10/13
Globally Unique Identifiers (GUID/UUID)
These are critical to establishing provenance to the canonical source of data. The whole point of them is that they can be distributedly generated but yet still statistically unique that the chances of anyone generating duplicative GUID/UUID(s) is realistically impossible. Not using one 1. makes any parent/child relationships impossible and 2. there are no other Unique identifiers used so as titles change then knowing "is this project really that project" making it impossible to avoid/test for redundant/duplicative entries. See #56
isPartof (Parent/Child relationships)
As we have discovered in implementing data.json, the concept of a collection (i.e. the ability for one component of a project to reference its parent project) is critically important.
https://project-open-data.cio.gov/v1.1/schema/#isPartOf
contact.role
the contact field should allow/encourage multiple entries but currently there is no concept of a contact's role (e.g. project manager, development lead, etc). This has been a concern in project open data that personal turn over and/or the want/need to direct people to a generic inbox for a program/team to complement the specific employee/POC for the project.
General Comments
JSON is great for having one file that contains a series of entries (more then one open source project). But it is less human readable then its YAML derivative. Given that multiple YAML documents can be compiled into one JSON document; IMHO it is more practical for those responsible for data entry to use YAML as either a code.yml file in the root dir of the repo and/or as an enhanced README which is a single README.md file with YAML frontmatter to better manage structured data (this is the exact file format of how Github Pages works to structured content for static websites in lieu of a CMS/Database). It then is easy enough to generate a json file from all of those files as a crawl/collect/transform within a Github Organization for instance
from code-gov-web.
Thanks everyone. We've had a bit of a proliferation of issues related to the schema. Let's move this conversation to #41
from code-gov-web.
Related Issues (20)
- Count of Tasks does not reflect currently selected filters HOT 2
- Nav broken HOT 2
- Roadmap cards - change background to white
- Link to Code of Conduct
- Issue with agency filter for help wanted
- Browse Projects not working HOT 7
- Node-Sass not installing on Chromebook HOT 1
- Update Project Details to remove background color from metadata
- Make height of dropdown menus taller
- Browse Projects and Search Results on mobile have infinite scroll instead of pagination HOT 1
- Homepage on mobile is not displaying correctly HOT 1
- Browse Projects repository count does not reflect filter change HOT 2
- Browse Projects repository count is incorrect - includes Exempted repos. HOT 2
- Change Contribute/Help Wanted to Open Tasks
- Project Error Page HOT 1
- Broken link on homepage HOT 2
- Clicking next in pagination should bring user to top of page HOT 2
- Superscript links on 'Implementation' page return page not found HOT 2
- Number of repos should update with filter HOT 1
- 🛑DEPRECATED🛑 - Repo no longer being maintained
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from code-gov-web.