whatwg / html-build Goto Github PK
View Code? Open in Web Editor NEWBuild scripts for https://github.com/whatwg/html
License: Other
Build scripts for https://github.com/whatwg/html
License: Other
Currently, we depend on both curl
and wget
. This is redundant, so we should pick just one. Personally, I'd prefer we drop wget
since curl
ships by default with OS X.
I'll take a stab at this later this week.
As noted in #83 (comment) the problem @annevk ran into with trying to determine which wattsi was getting called on his system would be easier to troubleshoot if we had a way to know what version of wattsi was getting called.
I've lost track a bit about how the various pieces fit together, but it seems whatwg/html already uses Travis to build the HTML Standard. Why not add some rsync/scp at the end of that and remove all scripts from the server? Having the server just host static resources seems much better.
Should the wattsi installer not take care of this?
We already include legacy bugs: whatwg/html#619
The simplest possible implementation would be to check for links to the spec in open issues, as the bug filing tool already includes a link. If that becomes too error-prone, we could limit it to either URLs in the first comment, or have a format like loc:https://html.spec.whatwg.org/#html-vs-xhtml:mime-type
that's only for the bug scraping script.
Seems like it would work https://unix.stackexchange.com/questions/17949/what-is-the-difference-between-grep-egrep-and-fgrep
It would be lovely if we could put together something like a Makefile so that there was a single command that would ensure that dependencies (like wattsi
) were downloaded/updated/built, and would execute the various build commands to produce the generated documents.
A few things are currently less than ideal:
Wouldn't it be nice if building were just blazing fast by default, and rebuilding dependencies was an option that should rarely be used?
It looks like the files that are eventually used are only these 6:
Together they are only 1.7 MB, or 282 kB gzipped. That's a lot of room for saving.
Rough proposal:
If you do a clean clone and build.sh, then cd into the html subdirectory and do git branch -r
, there are no remote branches. git fetch --all
does not help. You can't do things like git checkout fetch
to get the fetch branch.
Not really sure how to fix this. Presumably it's a result of the --depth 1
? But even doing the unshallow doesn't fix it.
html5lib/html5lib-tests#71 showed up the fact that MathML defines ⃛
, ⃛
, ⃜
, and ̑
differently to HTML.
This seems to be because we don't match the behaviour of https://github.com/w3c/xml-entities/blob/gh-pages/entities.xsl#L174 (the template starting with <xsl:template match="entity">
; note this is XSLT 2 so isn't supported in that many places!) in .entity-processor.py and .entity-processor-json.py (why oh why do we have two different files with so much duplicated code?).
Once #103 is fixed we should have another look at integrating with @tobie's tool to provide diffs for changes to the HTML Standard. One way we could do this is offer only diffs for the multipage documents that changed. That will require a somewhat custom setup unfortunately, but I don't think there's a way around that for the HTML Standard at this point.
Some was introduced recently. It would be good to avoid it.
https://html.spec.whatwg.org/multipage/indices.html#all-interfaces has "INSERT INTERFACES HERE" rather than a list of interfaces. Works fine single-page.
I bet it was pattern matching looking for /* sealed */
.
If the first character in <dfn>
content is a
, e
, i
or o
then linting fails with Possible article problems
error. I guess this is due to [aeio]
part of this regexp. I wonder it it's intentional and, if so why it's required?
This seems like a nice potential savings. @sideshowbarker are you familiar enough with the various things Wattsi can output to write a regex to detect such line numbers? I recall them being in parenthesis; maybe just \(\d+\)
?
IRC chatter says so, at least.
PRs to https://github.com/domenic/wattsi-server welcome. I'll probably work on this tomorrow otherwise. Currently I'm thinking the best approach is to include an output.txt
in the zip file produced by the server?
Given https://lists.w3.org/Archives/Public/www-archive/2015Aug/0013.html this might actually be doable, but would probably be quite a bit of work.
The dev edition subdfns are taking over the main IDs. See e.g. https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#dom-fae-form-2.
Ugh, I can tell this is not going to be fun to fix.
I once saw some advice that really stuck with me. It went something like, "when writing commands meant to be read later by others, use the spelled out version. Your future readers, including yourself, will better be able to understand it, and it costs nothing." The idea being that the short versions are to save you time when typing manually on the command line, but not as appropriate when writing a script.
This would be good to keep in mind as we edit the build script.
I'm trying to translate whatwg/html here: https://whatwg-cn.github.io/html/multipage/
To sync this fork (especially not-yet-translated sections), the build tools (html-build, watssi) are also used in that repo. While I find out the hard coded match strings (in .pre-process-annotate-attributes.pl, .pre-process-tag-omission.pl, and maybe others) will break the build process, for example:
<dt><span data-x=\"concept-element-attributes\">Content attributes</span>:</dt>
https://github.com/whatwg/html-build/blob/master/.pre-process-annotate-attributes.pl#L18
Now I also translated these perl source files locally. Could there be better solutions to make this tool language in-dependent? Or should I push the zh-Hans version to this project, which may require localization mechanism to be implemented.
If we at least validate the output when merging pull requests, we would not accumulate small issues like in whatwg/html#649
Would make a lot of sense together with #46
Serious errors will be caught quickly anyway, this is mostly a matter of appearances.
As an option for faster compilation, get wattsi installed, something like that.
Inspired by #82 (comment).
I'd like ./build.sh --prime-cache
or similar to download the w3cbugs.csv and caniuse.json, then bail. This helps the docker use case in complicated ways, but you can imagine it being useful e.g. before you get on a plane or similar.
An alternate approach: allow --no-update
to be truly no-update, so that if you use that option with an empty cache, it will generate empty caniuse and w3cbugs files to use.
@bzbarsky @dakami and I had a hallway discussion at the end of TPAC about the possibility of adding location.ancestorOrigins to Firefox. bz has had longstanding concerns about the information this leaks to child frames. We arrived at a local consensus that any leakage is roughly equivalent to what happens already with referrer, so it would make sense to redact ancestorOrigins according to referrer policy. (and this could resolve that objection to a Mozilla implementation of ancestorOrigins)
/cc @smaug---- @annevk
Opening this so we don't forget.
HTML_*
environment variables, pass arguments instead. (a natural part of "stop copying")... anything else?
See whatwg/html#324 for details. I guess the problem is that when we make fixes to the build script we don't clear the cache. So maybe this is just a simple fix on the server.
Error: Could not find ID telephone-state-%28type=tel%29 for annotation that uses URLs: http://caniuse.com/#feat=input-email-tel-url
@Hixie how can we fix this?
In whatwg/html#2751 @sideshowbarker proposes adding client-side HTML syntax highlighting. We may want to merge that sooner instead of blocking on what I propose below. But the below proposal avoids some of the issues there and has some other benefits, so we should do it eventually.
The proposal is to have @tabatkins extract his syntax highlighter from Bikeshed and then the html-build process and/or wattsi can shell out to it. The exact shape of this is TBD, see below.
Bikeshed's syntax highlighter consists of:
<mark>
, <ins>
, <del>
, or <a>
The benefits of this over the client-side solution are:
Also, I think we'd want to have this easily disabled during the build process, to get faster local builds. For deploys/in CI we would enable it of course.
This would probably all work best if we can shell out to a script extracted from Bikeshed. It would presumably written in Python, Bikeshed/Pygments's language. There are a few possibilities for the overall workflow:
[tagname, {attrs}, ...contents]
-style tree instead of HTML, I believe so that he then doesn't have to include a HTML parserAfter writing this, I am leaning toward (2) right now, although that didn't align with @tabatkins's thoughts in IRC (he was thinking more along the lines of (3)), so I am curious what the right approach is.
https://github.com/w3c/xml-entities
Probably best to keep a checkout of that repo in .cache, cd in, git pull, then use it. Alternately we could continue curling, using https://raw.githubusercontent.com/w3c/xml-entities/gh-pages/unicode.xml as the source now.
Lines 339 to 348 in 594dd34
./lint.sh html/source
or whatever.With the latest changes a ">" resource is created that contains the following:
/dev/null: Scheme missing.
--2015-09-05 06:30:45-- https://www.w3.org/Bugs/Public/buglist.cgi?columnlist=bug_file_loc,short_desc&query_format=advanced&resolution=---&ctype=csv
Resolving www.w3.org (www.w3.org)... 128.30.52.100
Connecting to www.w3.org (www.w3.org)|128.30.52.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘w3cbugs.csv’
0K .......... .......... .......... .......... .......... 115K
50K .......... .......... .......... .......... .......... 274K
100K .......... .......... .......... .......... .......... 155K
150K .......... .......... .......... .......... .......... 182K
200K .......... .......... .......... .......... .......... 186K
250K ..... 4.72M=1.5s
2015-09-05 06:30:50 (172 KB/s) - ‘w3cbugs.csv’ saved [261414]
FINISHED --2015-09-05 06:30:50--
Total wall clock time: 4.7s
Downloaded: 1 files, 255K in 1.5s (172 KB/s)
I think this plan is good:
Add an endpoint to watti-server, called cldr.inc, which when pinged does the svn checkout of cldr and the .cldr-processor.pl
step and returns the result. This will generally be fast except the very first time.
We then add a guard in the build script (not sure on how) so that if we find out XML::Parser is not installed, we skip the cldr checkout and the .cldr-processor.pl step, instead just downloading it from the server.
@sideshowbarker, anything I'm missing? Seems like it should work.
Given all the recent changes, I am a bit worried about people with outdated build tools running into problems.
If not run with --no-update, I think we should check for updates. I see a few options on how to implement this:
git fetch
then check against origin/master
's HEAD revision. Potential minor cons: doesn't work if you checked out the build tools with a different remote name (like whatwg/master
instead of origin/master
), and does modify your local git checkout, which might be unexpected.If we find an update or any other mismatch with origin/master's HEAD I think we should warn and give instructions. (Not error, and not auto-update.)
I've set up https://github.com/foolip/cldr-data and a cron job using https://github.com/foolip/cldr-data-updater.
Edit: These repos have been removed, let me know if you want them back for some reason.
For me, checking out using svn takes 55 seconds, while a --depth 1
clone using git takes 32 seconds. Not amazing, but depending on how we do it, the incremental updating should be faster.
Does everyone hate git submodules? I think it'd be kind of nice to get all dependencies into Git and explicitly update dependencies, even if it's done by a roll bot.
It seems strange to me that the build scripts are separate from the only source file that will ever use them. I'd suggest integrating them with the main HTML repository (perhaps in a build/
subdirectory) so that symlinks or copying of source files is no longer a necessary step in the process.
Given that we now use rsync we'd have to exclude the .cgi specifically or maybe we can exclude *.cgi to avoid revealing the filename? And I guess after rsync we should wget/curl the relevant remote URL? Is that okay to be public?
(And sorry for breaking this again without an upfront plan.)
It seems like having an html-build Docker image could help solve problems for some contributors.
Using wattsi-server is probably the easiest solution for most contributors, but for contributors with less-stable or less-reliable Internet connections, a better solution might be to have the ability to run the build locally—but without also needing to deal with the not-so-easy steps of needing to build fpc and wattsi from the sources.
So my limited understanding of Docker makes me believe that it may provide a good solution in this case.
Currently, build.sh
does everything every time. It would be lovely if we could at least split component updates (caniuse, unicode, etc) from the actual spec generation process such that they could be executed independently. There's no reason to require network access for spec generation, and hitting the network significantly slows things down.
We should align HTML's commit snapshots with the ones for Bikeshed-produced specs, such as https://streams.spec.whatwg.org/commit-snapshots/e75b9841572ae6153a167eea471433c55d06258e/.
<title>
See whatwg/html#882 and #86. It should instead do something similar to detecting a <!-- not obsolete -->
comment, but @zcorpan says that such comments are stripped out at an earlier stage in the pipeline, so this needs some investigating.
https://github.com/whatwg/html-build/blob/master/build.sh#L49 creates a .htaccess for /multipage/ which overlaps a bit with https://github.com/whatwg/html/blob/master/.htaccess. Also, seems weird to generate it like this rather than do the same as with multipage-404...
We kind of mentioned this in #3 and @sideshowbarker is working on it. But let's outline what I envision a bit more explicitly.
Build script parameters:
Cache should contain:
<!-- BOILERPLATE $filename -->
comments.All other files in https://github.com/whatwg/html-build/blob/master/.gitignore seem to be intermediate build files, and should ideally go in a temporary directory, not in the cache folder (these are distinct concepts).
In particular it would be interesting if we could get alerts somehow when something is checked in that results in new errors or new XXX comments.
I just hit a parse error and had to modify the build script to stop looking for the 65 thing to get the accurate location of the error.
I had just updated and build Wattsi, so something else seems amiss.
Because it default clones whatwg/html, if I directed a first-time contributor here, the instructions in the readme would get them stuck, unable to push after making changes.
I'd suggest a prompt in build.sh that asks them where to check out from. I am thinking:
Didn't find the HTML source on your system...
Enter where you would like to clone it from (GitHub username or URL):
(URL detection can be done by looking for :
.)
Alternately a quick fix is just to update the readme to suggest checking out your fork in a sibling directory before building for the first time.
But it can be installed via
brew install coreutils
@zcorpan discovered that we have multiple 404.html and .htaccess resources, despite us having removed those from the build script a while back.
It seems this is due to us using rsync htmlbuild/update-spec.sh without the --remove argument.
@domenic would it be safe to add that comment? Do you want to do that?
Also, should we have htmlbuild/ in version control somewhere?
Due to
ln -s ../images $HTML_TEMP/wattsi-output/multipage-html/
ln -s ../link-fixup.js $HTML_TEMP/wattsi-output/multipage-html/
ln -s ../entities.json $HTML_TEMP/wattsi-output/multipage-html/
these files end up duplicated on our server. I would prefer we stop doing that and just refer to /{resource}
instead as we already do for /404.html
.
entities.json
seems to require a change in whatwg/html. images/
too. link-fixup.js
might require a change in wattsi since it's only used by multipage.
Thoughts?
Our current CI/CD situation works surprisingly well, but is rather hacked together. It consists basically of using GitHub webhooks against some hand-crafted CGI which then git pulls from master, does a build and deploy.
What I would like to improve:
I think to do this properly we are probably going to want to learn about Jenkins or TeamCity and use one of those. I would love to use Travis CI, because the UI is great and I'm familiar with it, but we have so much tooling to install on each build that it doesn't seem terribly feasible, and my desire to e.g. deploy output snapshots for PRs seems beyond Travis's capabilities. Maybe their paid plan has this flexibility, but it's costly if I recall.
If people in the community are knowledgeable about this kind of CI/CD work and would like to help guide us, or even help us get it set up, we'd be very grateful.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.