Code Monkey home page Code Monkey logo

html-build's People

Contributors

abacabadabacaba avatar annevk avatar cvrebert avatar dbaron avatar defunctzombie avatar domenic avatar foolip avatar hixie avatar hober avatar jeremyroman avatar leobalter avatar marti1125 avatar sideshowbarker avatar stephenmcgruer avatar surma avatar takenspc avatar tawandamoyo avatar wakaba avatar xhmikosr avatar zcorpan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

html-build's Issues

Drop wget dependency

Currently, we depend on both curl and wget. This is redundant, so we should pick just one. Personally, I'd prefer we drop wget since curl ships by default with OS X.

I'll take a stab at this later this week.

Can we move the whole HTML build process to Travis

I've lost track a bit about how the various pieces fit together, but it seems whatwg/html already uses Travis to build the HTML Standard. Why not add some rsync/scp at the end of that and remove all scripts from the server? Having the server just host static resources seems much better.

Find and link to GitHub issues inline in the spec

We already include legacy bugs: whatwg/html#619

The simplest possible implementation would be to check for links to the spec in open issues, as the bug filing tool already includes a link. If that becomes too error-prone, we could limit it to either URLs in the first comment, or have a format like loc:https://html.spec.whatwg.org/#html-vs-xhtml:mime-type that's only for the bug scraping script.

Add a one-step build script.

It would be lovely if we could put together something like a Makefile so that there was a single command that would ensure that dependencies (like wattsi) were downloaded/updated/built, and would execute the various build commands to produce the generated documents.

Cut down on build dependencies (data files and binaries)

A few things are currently less than ideal:

  • A lot of data needs to be downloaded, and it's all thrown away when "Build tools have been updated since last run; clearing the cache."
  • .entity-proccessor.py says "this uses 658 MB and in fact I cannot run it on my VPS.
  • Users need to install Subversion and Perl's XML::Parser, which are likely not pre-installed.
  • (minor) We don't track dependencies, so builds are not reproducible.

Wouldn't it be nice if building were just blazing fast by default, and rebuilding dependencies was an option that should rarely be used?

It looks like the files that are eventually used are only these 6:

  • caniuse.json
  • cldr.inc
  • entities-dtd.url
  • entities.inc
  • entities.json
  • w3cbugs.csv

Together they are only 1.7 MB, or 282 kB gzipped. That's a lot of room for saving.

Rough proposal:

  • Separate out the scripts for building these dependencies so that they can easily be built without also building the spec.
  • Set up an automatically updated html-build-deps repo that has the output.
  • In build.sh, by default use the html-build-deps repo, but have an option to generate from scratch.
  • (maybe) Track the exact html-build-deps commit to use, using either submodules or a DEPS file.

Related issues:
#24
#38
#55
#60 (would be made obsolete)

After the build script clones, I cannot get any remote branches

If you do a clean clone and build.sh, then cd into the html subdirectory and do git branch -r, there are no remote branches. git fetch --all does not help. You can't do things like git checkout fetch to get the fetch branch.

Not really sure how to fix this. Presumably it's a result of the --depth 1? But even doing the unshallow doesn't fix it.

Fix MathML/HTML entity divergence

html5lib/html5lib-tests#71 showed up the fact that MathML defines ⃛, ⃛, ⃜, and ̑ differently to HTML.

This seems to be because we don't match the behaviour of https://github.com/w3c/xml-entities/blob/gh-pages/entities.xsl#L174 (the template starting with <xsl:template match="entity">; note this is XSLT 2 so isn't supported in that many places!) in .entity-processor.py and .entity-processor-json.py (why oh why do we have two different files with so much duplicated code?).

/cc @fred-wang @davidcarlisle

Look into HTML diffs

Once #103 is fixed we should have another look at integrating with @tobie's tool to provide diffs for changes to the HTML Standard. One way we could do this is offer only diffs for the multipage documents that changed. That will require a somewhat custom setup unfortunately, but I don't think there's a way around that for the HTML Standard at this point.

Spell out all command line flags

I once saw some advice that really stuck with me. It went something like, "when writing commands meant to be read later by others, use the spelled out version. Your future readers, including yourself, will better be able to understand it, and it costs nothing." The idea being that the short versions are to save you time when typing manually on the command line, but not as appropriate when writing a script.

This would be good to keep in mind as we edit the build script.

hard-coded match strings make this project language dependent

I'm trying to translate whatwg/html here: https://whatwg-cn.github.io/html/multipage/

To sync this fork (especially not-yet-translated sections), the build tools (html-build, watssi) are also used in that repo. While I find out the hard coded match strings (in .pre-process-annotate-attributes.pl, .pre-process-tag-omission.pl, and maybe others) will break the build process, for example:

<dt><span data-x=\"concept-element-attributes\">Content attributes</span>:</dt>

https://github.com/whatwg/html-build/blob/master/.pre-process-annotate-attributes.pl#L18

Now I also translated these perl source files locally. Could there be better solutions to make this tool language in-dependent? Or should I push the zh-Hans version to this project, which may require localization mechanism to be implemented.

Optionally validate the build output

If we at least validate the output when merging pull requests, we would not accumulate small issues like in whatwg/html#649

Would make a lot of sense together with #46

Serious errors will be caught quickly anyway, this is mostly a matter of appearances.

Add option to prime the cache (and do nothing else)

Inspired by #82 (comment).

I'd like ./build.sh --prime-cache or similar to download the w3cbugs.csv and caniuse.json, then bail. This helps the docker use case in complicated ways, but you can imagine it being useful e.g. before you get on a plane or similar.

An alternate approach: allow --no-update to be truly no-update, so that if you use that option with an empty cache, it will generate empty caniuse and w3cbugs files to use.

redact location.ancestorOrigins according to Referrer Policy

@bzbarsky @dakami and I had a hallway discussion at the end of TPAC about the possibility of adding location.ancestorOrigins to Firefox. bz has had longstanding concerns about the information this leaks to child frames. We arrived at a local consensus that any leakage is roughly equivalent to what happens already with referrer, so it would make sense to redact ancestorOrigins according to referrer policy. (and this could resolve that objection to a Mozilla implementation of ancestorOrigins)

/cc @smaug---- @annevk

Minor cleanups after #62

Opening this so we don't forget.

  • Stop copying things into $HTML_CACHE, and just use them directly
  • Add a small section to the top-level README.md explaining what the quotes/ and entities/ directories are about.
  • Stop using HTML_* environment variables, pass arguments instead. (a natural part of "stop copying")
  • Figure out how to monitor for changes to import to quotes/ and entities/

... anything else?

Add built-time syntax highlighting

In whatwg/html#2751 @sideshowbarker proposes adding client-side HTML syntax highlighting. We may want to merge that sooner instead of blocking on what I propose below. But the below proposal avoids some of the issues there and has some other benefits, so we should do it eventually.

The proposal is to have @tabatkins extract his syntax highlighter from Bikeshed and then the html-build process and/or wattsi can shell out to it. The exact shape of this is TBD, see below.

Bikeshed's syntax highlighter consists of:

  • Pygments as the base
  • Support for highlighting even code that has interspersed markup, which we use a decent amount in HTML---such as <mark>, <ins>, <del>, or <a>
  • Web IDL syntax highlighting, as that is not a Pygments-supported language
  • Line numbering/highlighting (not relevant to us)

The benefits of this over the client-side solution are:

  • No potential startup jank for users
  • Consistency with other WHATWG specs (which use Bikeshed directly)
  • Allows interspersed markup as described above
  • Web IDL syntax highlighting

Also, I think we'd want to have this easily disabled during the build process, to get faster local builds. For deploys/in CI we would enable it of course.

This would probably all work best if we can shell out to a script extracted from Bikeshed. It would presumably written in Python, Bikeshed/Pygments's language. There are a few possibilities for the overall workflow:

  1. Preprocess the spec before feeding it to wattsi; the syntax highlighter is responsible for finding all code blocks
    • Probably won't work: Wattsi input source is not real HTML
  2. Postprocess each page of the the spec after building it; the syntax highlighter is responsible for finding all code blocks
    • Probably will work, although a second pass might be slow
    • Might be more work for @tabatkins
  3. Shell out each code fragment to be highlighted to the syntax highlighter tool
    • Would require Wattsi integration, not html-build integration
    • Would require a format for passing the data; @tabatkins prefers a [tagname, {attrs}, ...contents]-style tree instead of HTML, I believe so that he then doesn't have to include a HTML parser

After writing this, I am leaning toward (2) right now, although that didn't align with @tabatkins's thoughts in IRC (he was thinking more along the lines of (3)), so I am curious what the right approach is.

Move source linting to a separate script, and run it *before* compilation

html-build/build.sh

Lines 339 to 348 in 594dd34

$QUIET || echo
$QUIET || echo "Linting the output..."
# show potential problems
# note - would be nice if the ones with \s+ patterns actually cross lines, but, they don't...
grep -ni 'xxx' $HTML_SOURCE/source| perl -lpe 'print "\nPossible incomplete sections:" if $. == 1'
egrep -ni '( (code|span|var)(>| data-x=)|[^<;]/(code|span|var)>)' $HTML_SOURCE/source| perl -lpe 'print "\nPossible copypasta:" if $. == 1'
grep -ni 'chosing\|approprate\|occured\|elemenst\|\bteh\b\|\blabelled\b\|\blabelling\b\|\bhte\b\|taht\|linx\b\|speciication\|attribue\|kestern\|horiontal\|\battribute\s\+attribute\b\|\bthe\s\+the\b\|\bthe\s\+there\b\|\bfor\s\+for\b\|\bor\s\+or\b\|\bany\s\+any\b\|\bbe |be\b\|\bwith\s\+with\b\|\bis\s\+is\b' $HTML_SOURCE/source| perl -lpe 'print "\nPossible typos:" if $. == 1'
perl -ne 'print "$.: $_" if (/\ban (<[^>]*>)*(?!(L\b|http|https|href|hgroup|rb|rp|rt|rtc|li|xml|svg|svgmatrix|hour|hr|xhtml|xslt|xbl|nntp|mpeg|m[ions]|mtext|merror|h[1-6]|xmlns|xpath|s|x|sgml|huang|srgb|rsa|only|option|optgroup)\b|html)[b-df-hj-np-tv-z]/i or /\b(?<![<\/;])a (?!<!--grammar-check-override-->)(<[^>]*>)*(?!&gt|one)(?:(L\b|http|https|href|hgroup|rt|rp|li|xml|svg|svgmatrix|hour|hr|xhtml|xslt|xbl|nntp|mpeg|m[ions]|mtext|merror|h[1-6]|xmlns|xpath|s|x|sgml|huang|srgb|rsa|only|option|optgroup)\b|html|[aeio])/i)' $HTML_SOURCE/source| perl -lpe 'print "\nPossible article problems:" if $. == 1'
grep -ni 'and/or' $HTML_SOURCE/source| perl -lpe 'print "\nOccurrences of making Ms2ger unhappy and/or annoyed:" if $. == 1'
grep -ni 'throw\s\+an\?\s\+<span' $HTML_SOURCE/source| perl -lpe 'print "\nException marked using <span> rather than <code>:" if $. == 1'
all operates on the source file, which I never noticed until today. It should probably run before we try to compile. Also it would be good to have it as a standalone script so I could do ./lint.sh html/source or whatever.

Mysterious ">" resource

With the latest changes a ">" resource is created that contains the following:

/dev/null: Scheme missing.
--2015-09-05 06:30:45--  https://www.w3.org/Bugs/Public/buglist.cgi?columnlist=bug_file_loc,short_desc&query_format=advanced&resolution=---&ctype=csv
Resolving www.w3.org (www.w3.org)... 128.30.52.100
Connecting to www.w3.org (www.w3.org)|128.30.52.100|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘w3cbugs.csv’

     0K .......... .......... .......... .......... ..........  115K
    50K .......... .......... .......... .......... ..........  274K
   100K .......... .......... .......... .......... ..........  155K
   150K .......... .......... .......... .......... ..........  182K
   200K .......... .......... .......... .......... ..........  186K
   250K .....                                                  4.72M=1.5s

2015-09-05 06:30:50 (172 KB/s) - ‘w3cbugs.csv’ saved [261414]

FINISHED --2015-09-05 06:30:50--
Total wall clock time: 4.7s
Downloaded: 1 files, 255K in 1.5s (172 KB/s)

Getting rid of the Perl libxml prerequisite

I think this plan is good:

Add an endpoint to watti-server, called cldr.inc, which when pinged does the svn checkout of cldr and the .cldr-processor.pl step and returns the result. This will generally be fast except the very first time.

We then add a guard in the build script (not sure on how) so that if we find out XML::Parser is not installed, we skip the cldr checkout and the .cldr-processor.pl step, instead just downloading it from the server.

@sideshowbarker, anything I'm missing? Seems like it should work.

Build tools should check to see if they are outdated before running

Given all the recent changes, I am a bit worried about people with outdated build tools running into problems.

If not run with --no-update, I think we should check for updates. I see a few options on how to implement this:

  • Find some GitHub website or API endpoint that will tell us the latest commit hash. Pro: doesn't modify the user's local checkout at all. Con: I haven't found one in 30 seconds of searching so maybe it doesn't exist.
  • Do git fetch then check against origin/master's HEAD revision. Potential minor cons: doesn't work if you checked out the build tools with a different remote name (like whatwg/master instead of origin/master), and does modify your local git checkout, which might be unexpected.

If we find an update or any other mismatch with origin/master's HEAD I think we should warn and give instructions. (Not error, and not auto-update.)

Use a Git mirror of CLDR to get rid of Subversion dependency

I've set up https://github.com/foolip/cldr-data and a cron job using https://github.com/foolip/cldr-data-updater.

Edit: These repos have been removed, let me know if you want them back for some reason.

For me, checking out using svn takes 55 seconds, while a --depth 1 clone using git takes 32 seconds. Not amazing, but depending on how we do it, the incremental updating should be faster.

Does everyone hate git submodules? I think it'd be kind of nice to get all dependencies into Git and explicitly update dependencies, even if it's done by a roll bot.

Integrate build scripts with whatwg/html.

It seems strange to me that the build scripts are separate from the only source file that will ever use them. I'd suggest integrating them with the main HTML repository (perhaps in a build/ subdirectory) so that symlinks or copying of source files is no longer a necessary step in the process.

Restoring print.pdf

Given that we now use rsync we'd have to exclude the .cgi specifically or maybe we can exclude *.cgi to avoid revealing the filename? And I guess after rsync we should wget/curl the relevant remote URL? Is that okay to be public?

(And sorry for breaking this again without an upfront plan.)

cc @domenic @izh1979

Explore creating a Docker image with wattsi and other build dependencies

It seems like having an html-build Docker image could help solve problems for some contributors.

Using wattsi-server is probably the easiest solution for most contributors, but for contributors with less-stable or less-reliable Internet connections, a better solution might be to have the ability to run the build locally—but without also needing to deal with the not-so-easy steps of needing to build fpc and wattsi from the sources.

So my limited understanding of Docker makes me believe that it may provide a good solution in this case.

Split `build.sh` into separately-executable steps.

Currently, build.sh does everything every time. It would be lovely if we could at least split component updates (caniuse, unicode, etc) from the actual spec generation process such that they could be executed independently. There's no reason to require network access for spec generation, and hitting the network significantly slows things down.

Tweak commit snapshot production

We should align HTML's commit snapshots with the ones for Bikeshed-produced specs, such as https://streams.spec.whatwg.org/commit-snapshots/e75b9841572ae6153a167eea471433c55d06258e/.

  • Commit snapshots should have a big scary warning
  • Commit snapshots should say "Commit Snapshot" instead of "Living Standard" as their subtitle
  • Commit snapshots should have an appropriately-modified <title>
  • Commit snapshots should have a link back to the living spec in their header somewhere
  • The living standard should have a link to the current commit snapshot in its header somewhere
  • The commit-snapshots-shortcut-key.js file should be introduced to both the commit snapshot and the living standard

Design for output/input/etc.-splitting

We kind of mentioned this in #3 and @sideshowbarker is working on it. But let's outline what I envision a bit more explicitly.

Build script parameters:

  • source, corresponding to https://github.com/whatwg/html
  • cache, a cache directory where we store cached things (see below) to make incremental builds faster and to avoid network access
  • output, where it will put the final output files (currently listed in the readme)

Cache should contain:

  • cldr checkout. Currently in .cldr-data
  • w3cbugs.csv. Currently re-fetched every time as w3cbugs.csv
  • caniuse.json. Currently re-fetched every time as caniuse.json
  • entities.inc, entities-dtd.url, cldr.inc: currently created via fun scripts. These are included in the spec via <!-- BOILERPLATE $filename --> comments.
  • entities.json: Currently created via fun scripts. This is one of the output files.

All other files in https://github.com/whatwg/html-build/blob/master/.gitignore seem to be intermediate build files, and should ideally go in a temporary directory, not in the cache folder (these are distinct concepts).

Record the log somewhere visible

In particular it would be interesting if we could get alerts somehow when something is checked in that results in new errors or new XXX comments.

Wattsi is not re-run

I just hit a parse error and had to modify the build script to stop looking for the 65 thing to get the accurate location of the error.

I had just updated and build Wattsi, so something else seems amiss.

Workflow is not good for people without push access to whatwg/html who want to do PRs

Because it default clones whatwg/html, if I directed a first-time contributor here, the instructions in the readme would get them stuck, unable to push after making changes.

I'd suggest a prompt in build.sh that asks them where to check out from. I am thinking:

Didn't find the HTML source on your system...
Enter where you would like to clone it from (GitHub username or URL):

(URL detection can be done by looking for :.)

Alternately a quick fix is just to update the readme to suggest checking out your fork in a sibling directory before building for the first time.

Server folders are not properly cleared

@zcorpan discovered that we have multiple 404.html and .htaccess resources, despite us having removed those from the build script a while back.

It seems this is due to us using rsync htmlbuild/update-spec.sh without the --remove argument.

@domenic would it be safe to add that comment? Do you want to do that?

Also, should we have htmlbuild/ in version control somewhere?

Reduce file duplication

Due to

ln -s ../images $HTML_TEMP/wattsi-output/multipage-html/
ln -s ../link-fixup.js $HTML_TEMP/wattsi-output/multipage-html/
ln -s ../entities.json $HTML_TEMP/wattsi-output/multipage-html/

these files end up duplicated on our server. I would prefer we stop doing that and just refer to /{resource} instead as we already do for /404.html.

entities.json seems to require a change in whatwg/html. images/ too. link-fixup.js might require a change in wattsi since it's only used by multipage.

Thoughts?

Ideas for better continuous integration and deployment

Our current CI/CD situation works surprisingly well, but is rather hacked together. It consists basically of using GitHub webhooks against some hand-crafted CGI which then git pulls from master, does a build and deploy.

What I would like to improve:

  • Contributors should be able to see the build results, including any errors, and the output html file. This should be linked in the GitHub interface, with a green checkmark or red X, like Travis does.
  • If the build/deployment gets broken on master, the team (= editors + MikeSmith + anyone else interested) should get an email.
  • It should be easy for the team to view the build status over time in a dashboard, similar to Travis CI.
  • Built results for master should be committed to a separate repository (whatwg/html-output?) for easy change-tracking, as has been requested a few times.
  • (Stretch goal) commit snapshots should be uploaded to html.spec.whatwg.org/commit-snapshots/, similar to https://streams.spec.whatwg.org/commit-snapshots/.

I think to do this properly we are probably going to want to learn about Jenkins or TeamCity and use one of those. I would love to use Travis CI, because the UI is great and I'm familiar with it, but we have so much tooling to install on each build that it doesn't seem terribly feasible, and my desire to e.g. deploy output snapshots for PRs seems beyond Travis's capabilities. Maybe their paid plan has this flexibility, but it's costly if I recall.

If people in the community are knowledgeable about this kind of CI/CD work and would like to help guide us, or even help us get it set up, we'd be very grateful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.