manubot / rootstock Goto Github PK

View Code? Open in Web Editor NEW

450.0 19.0 167.0 7.83 MB

Clone me to create your Manubot manuscript

Home Page: https://manubot.github.io/rootstock/

License: Other

Shell 13.49% HTML 86.51%

markdown manuscript publishing manubot doi citation continuous-publication

rootstock's Introduction

Python utilities for Manubot: Manuscripts, open and automated

Manubot is a workflow and set of tools for the next generation of scholarly publishing. This repository contains a Python package with several Manubot-related utilities, as described in the usage section below. Package documentation is available at https://manubot.github.io/manubot (auto-generated from the Python source code).

The manubot cite command-line interface retrieves and formats bibliographic metadata for user-supplied persistent identifiers like DOIs or PubMed IDs. The manubot process command-line interface prepares scholarly manuscripts for Pandoc consumption. The manubot process command is used by Manubot manuscripts, which are based off the Rootstock template, to automate several aspects of manuscript generation. The manubot ai-revision command is used to automatically revise a manuscript based on a set of AI-generated suggestions. See Rootstock's manuscript usage guide for more information.

Note: If you want to experience Manubot by editing an existing manuscript, see https://github.com/manubot/try-manubot. If you want to create a new manuscript, see https://github.com/manubot/rootstock.

To cite the Manubot project or for more information on its design and history, see:

Open collaborative writing with Manubot
Daniel S. Himmelstein, Vincent Rubinetti, David R. Slochower, Dongbo Hu, Venkat S. Malladi, Casey S. Greene, Anthony Gitter
PLOS Computational Biology (2019-06-24) https://doi.org/c7np
DOI: 10.1371/journal.pcbi.1007128 · PMID: 31233491 · PMCID: PMC6611653

The Manubot version of this manuscript is available at https://greenelab.github.io/meta-review/.

Installation

If you are using the manubot Python package as part of a manuscript repository, installation of this package is handled though the Rootstock's environment specification. For other use cases, this package can be installed via pip.

Install the latest release version from PyPI:

pip install --upgrade manubot

Or install from the source code on GitHub, using the version specified by a commit hash:

COMMIT=d2160151e52750895571079a6e257beb6e0b1278
pip install --upgrade git+https://github.com/manubot/manubot@$COMMIT

The --upgrade argument ensures pip updates an existing manubot installation if present.

Some functions in this package require Pandoc, which must be installed separately on the system. The pandoc-manubot-cite filter depends on Pandoc as well as panflute (a Python package). Users must install a compatible version of panflute based on their Pandoc version. For example, on a system with Pandoc 2.9, install the appropriate panflute like pip install panflute==1.12.5.

Usage

Installing the python package creates the manubot command line program. Here is the usage information as per manubot --help:

usage: manubot [-h] [--version] {process,cite,webpage,ai-revision} ...

Manubot: the manuscript bot for scholarly writing

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

subcommands:
  All operations are done through subcommands:

  {process,cite,webpage,ai-revision}
    process             process manuscript content
    cite                citekey to CSL JSON command line utility
    webpage             deploy Manubot outputs to a webpage directory tree
    ai-revision         revise manuscript content with language models

Note that all operations are done through the following sub-commands.

Process

The manubot process program is the primary interface to using Manubot. There are two required arguments: --content-directory and --output-directory, which specify the respective paths to the content and output directories. The content directory stores the manuscript source files. Files generated by Manubot are saved to the output directory.

One common setup is to create a directory for a manuscript that contains both the content and output directory. Under this setup, you can run the Manubot using:

manubot process \
  --skip-citations \
  --content-directory=content \
  --output-directory=output

See manubot process --help for documentation of all command line arguments:

usage: manubot process [-h] --content-directory CONTENT_DIRECTORY
                       --output-directory OUTPUT_DIRECTORY
                       [--template-variables-path TEMPLATE_VARIABLES_PATH]
                       --skip-citations [--cache-directory CACHE_DIRECTORY]
                       [--clear-requests-cache] [--skip-remote]
                       [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Process manuscript content to create outputs for Pandoc consumption. Performs
bibliographic processing and templating.

options:
  -h, --help            show this help message and exit
  --content-directory CONTENT_DIRECTORY
                        Directory where manuscript content files are located.
  --output-directory OUTPUT_DIRECTORY
                        Directory to output files generated by this script.
  --template-variables-path TEMPLATE_VARIABLES_PATH
                        Path or URL of a file containing template variables
                        for jinja2. Serialization format is inferred from the
                        file extension, with support for JSON, YAML, and TOML.
                        If the format cannot be detected, the parser assumes
                        JSON. Specify this argument multiple times to read
                        multiple files. Variables can be applied to a
                        namespace (i.e. stored under a dictionary key) like
                        `--template-variables-path=namespace=path_or_url`.
                        Namespaces must match the regex `[a-zA-
                        Z_][a-zA-Z0-9_]*`.
  --skip-citations      Skip citation and reference processing. Support for
                        citation and reference processing has been moved from
                        `manubot process` to the pandoc-manubot-cite filter.
                        Therefore this argument is now required. If citation-
                        tags.tsv is found in content, these tags will be
                        inserted in the markdown output using the reference-
                        link syntax for citekey aliases. Appends
                        content/manual-references*.* paths to Pandoc's
                        metadata.bibliography field.
  --cache-directory CACHE_DIRECTORY
                        Custom cache directory. If not specified, caches to
                        output-directory.
  --clear-requests-cache
  --skip-remote         Do not add the rootstock repository to the local git
                        repository remotes.
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level for stderr logging

Manual references

Manubot has the ability to rely on user-provided reference metadata rather than generating it. manubot process searches the content directory for files containing manually-provided reference metadata that match the glob manual-references*.*. These files are stored in the Pandoc metadata bibliography field, such that they can be loaded by pandoc-manubot-cite.

Cite

manubot cite is a command line utility to produce bibliographic metadata for citation keys. The utility either outputs metadata as CSL JSON items or produces formatted references if --render.

Citation keys should be in the format prefix:accession. For example, the following example generates Markdown-formatted references for four persistent identifiers:

manubot cite --format=markdown \
  doi:10.1098/rsif.2017.0387 pubmed:29424689 pmc:PMC5640425 arxiv:1806.05726

The following terminal recording demonstrates the main features of manubot cite (for a slightly outdated version):

Additional usage information is available from manubot cite --help:

usage: manubot cite [-h] [--output OUTPUT]
                    [--format {csljson,cslyaml,plain,markdown,docx,html,jats} | --yml | --txt | --md]
                    [--csl CSL] [--bibliography BIBLIOGRAPHY]
                    [--no-infer-prefix] [--allow-invalid-csl-data]
                    [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                    citekeys [citekeys ...]

Generate bibliographic metadata in CSL JSON format for one or more citation
keys. Optionally, render metadata into formatted references using Pandoc. Text
outputs are UTF-8 encoded.

positional arguments:
  citekeys              One or more (space separated) citation keys to
                        generate bibliographic metadata for.

options:
  -h, --help            show this help message and exit
  --output OUTPUT       Specify a file to write output, otherwise default to
                        stdout.
  --format {csljson,cslyaml,plain,markdown,docx,html,jats}
                        Format to use for output file. csljson and cslyaml
                        output the CSL data. All other choices render the
                        references using Pandoc. If not specified, attempt to
                        infer this from the --output filename extension.
                        Otherwise, default to csljson.
  --yml                 Short for --format=cslyaml.
  --txt                 Short for --format=plain.
  --md                  Short for --format=markdown.
  --csl CSL             URL or path with CSL XML style used to style
                        references (i.e. Pandoc's --csl option). Defaults to
                        Manubot's style.
  --bibliography BIBLIOGRAPHY
                        File to read manual reference metadata. Specify
                        multiple times to load multiple files. Similar to
                        pandoc --bibliography.
  --no-infer-prefix     Do not attempt to infer the prefix for citekeys
                        without a known prefix.
  --allow-invalid-csl-data
                        Allow CSL Items that do not conform to the JSON
                        Schema. Skips CSL pruning.
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level for stderr logging

Pandoc filter

This package creates the pandoc-manubot-cite Pandoc filter, providing access to Manubot's cite-by-ID functionality from within a Pandoc workflow.

Options are set via Pandoc metadata fields listed in the docs.

usage: pandoc-manubot-cite [-h] [--input [INPUT]] [--output [OUTPUT]]
                           target_format

Pandoc filter for citation by persistent identifier. Filters are command-line
programs that read and write a JSON-encoded abstract syntax tree for Pandoc.
Unless you are debugging, run this filter as part of a pandoc command by
specifying --filter=pandoc-manubot-cite.

positional arguments:
  target_format      output format of the pandoc command, as per Pandoc's --to
                     option

options:
  -h, --help         show this help message and exit
  --input [INPUT]    path read JSON input (defaults to stdin)
  --output [OUTPUT]  path to write JSON output (defaults to stdout)

Other Pandoc filters exist that do something similar: pandoc-url2cite, pandoc-url2cite-hs, & pwcite. Currently, pandoc-manubot-cite supports the most types of persistent identifiers. We're interested in creating as much compatibility as possible between these filters and their syntaxes.

Manual references

Manual references are loaded from the references and bibliography Pandoc metadata fields. If a manual reference filename ends with .json or .yaml, it's assumed to contain CSL Data (i.e. Citation Style Language JSON). Otherwise, the format is inferred from the extension and converted to CSL JSON using the pandoc-citeproc --bib2json utility. The standard citation key for manual references is inferred from the CSL JSON id or note field. When no prefix is provided, such as doi:, url:, or raw:, a raw: prefix is automatically added. If multiple manual reference files load metadata for the same standard citation id, precedence is assigned according to descending filename order.

Webpage

The manubot webpage command populates a webpage directory with Manubot output files.

usage: manubot webpage [-h] [--checkout [CHECKOUT]] [--version VERSION]
                       [--timestamp] [--no-ots-cache | --ots-cache OTS_CACHE]
                       [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Update the webpage directory tree with Manubot output files. This command
should be run from the root directory of a Manubot manuscript that follows the
Rootstock layout, containing `output` and `webpage` directories. HTML and PDF
outputs are copied to the webpage directory, which is structured as static
source files for website hosting.

options:
  -h, --help            show this help message and exit
  --checkout [CHECKOUT]
                        branch to checkout /v directory contents from. For
                        example, --checkout=upstream/gh-pages. --checkout is
                        equivalent to --checkout=gh-pages. If --checkout is
                        ommitted, no checkout is performed.
  --version VERSION     Used to create webpage/v/{version} directory.
                        Generally a commit hash, tag, or 'local'. When
                        omitted, version defaults to the commit hash on CI
                        builds and 'local' elsewhere.
  --timestamp           timestamp versioned manuscripts in webpage/v using
                        OpenTimestamps. Specify this flag to create timestamps
                        for the current HTML and PDF outputs and upgrade any
                        timestamps from past manuscript versions.
  --no-ots-cache        disable the timestamp cache.
  --ots-cache OTS_CACHE
                        location for the timestamp cache (default:
                        ci/cache/ots).
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level for stderr logging

AI-assisted academic authoring

The manubot ai-revision command uses large language models from OpenAI to automatically revise a manuscript and suggest text improvements.

usage: manubot ai-revision [-h] --content-directory CONTENT_DIRECTORY
                           [--config-directory CONFIG_DIRECTORY]
                           [--model-type MODEL_TYPE]
                           [--model-kwargs key=value [key=value ...]]
                           [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Revise manuscript content using AI models to suggest text improvements.

options:
  -h, --help            show this help message and exit
  --content-directory CONTENT_DIRECTORY
                        Directory where manuscript content files are located.
  --config-directory CONFIG_DIRECTORY
                        Directory where AI revision configuration files are
                        located. If unspecified, disables custom
                        configuration.
  --model-type MODEL_TYPE
                        Model type used to revise the manuscript. Default is
                        GPT3CompletionModel. It can be any subclass of
                        manubot_ai_editor.models.ManuscriptRevisionModel
  --model-kwargs key=value [key=value ...]
                        Keyword arguments for the revision model (--model-
                        type), with format key=value.
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level for stderr logging

The usual call is:

manubot ai-revision --content-directory content/

The parameters --model-type and --model-kwargs are used for debugging purposes. For example, since the tool splits the text into paragraphs, you might want to see if paragraphs were detected correctly. The tool incurs a cost when using the OpenAI API, so this could be important to check for text with complicated structure.

manubot ai-revision \
  --content-directory content/ \
  --model-type DummyManuscriptRevisionModel \
  --model-kwargs add_paragraph_marks=true

Development

Environment

Create a development environment using:

conda create --name manubot-dev --channel conda-forge \
  python=3.11 pandoc=2.11.3.1
conda activate manubot-dev  # assumes conda >= 4.4
pip install --editable ".[webpage,dev]"

Commands

Below are some common commands used for development. They assume the working directory is set to the repository's root, and the conda environment is activated.

# run the test suite
pytest

# install pre-commit git hooks (once per local clone).
# The pre-commit checks declared in .pre-commit-config.yaml will now
# run on changed files during git commits.
pre-commit install

# run the pre-commit checks (required to pass CI)
pre-commit run --all-files

# commit despite failing pre-commit checks (will fail CI)
git commit --no-verify

# regenerate the README codeblocks for --help messages
python manubot/tests/test_readme.py

# generate the docs
portray as_html --overwrite --output_dir=docs

# process the example testing manuscript
manubot process \
  --content-directory=manubot/process/tests/manuscripts/example/content \
  --output-directory=manubot/process/tests/manuscripts/example/output \
  --skip-citations \
  --log-level=INFO

Release instructions

This section is only relevant for project maintainers. GitHub Actions deploys releases to PyPI.

To create a new release, bump the __version__ in manubot/__init__.py. Then, set the TAG and OLD_TAG environment variables:

TAG=v$(python setup.py --version)

# fetch tags from the upstream remote
# (assumes upstream is the manubot organization remote)
git fetch --tags upstream main

# get previous release tag, can hardcode like OLD_TAG=v0.3.1
OLD_TAG=$(git describe --tags --abbrev=0)

The following commands can help draft release notes:

# check out a branch for a pull request as needed
git checkout -b "release-$TAG"

# create release notes file if it doesn't exist
touch "release-notes/$TAG.md"

# commit list since previous tag
echo $'\n\nCommits\n-------\n' >> "release-notes/$TAG.md"
git log --oneline --decorate=no --reverse $OLD_TAG..HEAD >> "release-notes/$TAG.md"

# commit authors since previous tag
echo $'\n\nCode authors\n------------\n' >> "release-notes/$TAG.md"
git log $OLD_TAG..HEAD --format='%aN <%aE>' | sort --unique >> "release-notes/$TAG.md"

After a commit with the above updates is part of upstream:main, for example after a PR is merged, use the GitHub interface to create a release with the new "Tag version". Monitor GitHub Actions and PyPI for successful deployment of the release.

Goals & Acknowledgments

Our goal is to create scholarly infrastructure that encourages open science and assists reproducibility. Accordingly, we hope for the Manubot software and philosophy to be adopted widely, by both academic and commercial entities. As such, Manubot is free/libre and open source software (see LICENSE.md).

We would like to thank the contributors and funders whose support makes this project possible. Specifically, Manubot development has been financially supported by:

the Alfred P. Sloan Foundation in Grant G-2018-11163 to @dhimmel.
the Gordon & Betty Moore Foundation (@DDD-Moore) in Grant GBMF4552 to @cgreene.

rootstock's People

Contributors

Stargazers

Watchers

Forkers

agitter dhimmel evancofer vsmalladi petebachant resurgo-genetics thuang kmezhoud arielsvn julienhering adebali codeaudit fabi91 kaymmm joshgay shankyty ethanwillis edawson project-renard-survey benjamin-lee shuvro-zz zach-hensel pierreenolivier slochower matthiasmengel michaelmhoffman pythseq mardom yvanlebras nupulmonary dankein rhagenson chayast sunt05 gaybro8777 valavanca aulenbac ruihong000 olgabot emna944 jaquejbrito taylordm jperkel danielderrick nilswellhausen jaybee84 wolfgangdj sq-96 alimanfoo wookietreiber liaojinyue sacdallago aelmas na399 arinbasu adam3smith lihpsg jneubert rando2 ram-z duerrsimon lihuazou jungyo zhongshan2020 sblisesivdin bhushan760 stephanlewandowsky lubianat mbhall88 ctb bkutlu matthewturk suchow klc100 saran-wang spatialmodel qinyuz2 faisal-qadri cee498project1 wanganye123 phildong mohamadimahnaz ahastie2 cr25 bolajilawal fmsabatini gooood-night banyekalaok pankajkarman yanhanl longhuang-hit hanzhs1225 cathyxinchangli swartmilan sgosline kaszanas zhampu whzouyang coralblade cthoyt

rootstock's Issues

Simplifying authors.tsv to manuscript conversion

Currently author parsing is disabled in this repo. I'm thinking of simplifying the TSV format and how it gets added to the manuscript. Basically, here would be the columns:

github_username
full_name
initials (possibly)
orcid
affiliations
funding
email
symbols (superscript symbols to add next to name). Could be symbol for corresponding or contributed equally. Or anything. The symbols would be manually defined in the mardown doc.

I was thinking of removing the approve column, and going for each author submits a PR to add their name, hence approving.

Unlike the system for the deep review, the build system, would not try to condense affiliations or funding across authors. In other words, each author would get their details printed next to their name. There would be more duplication of text, but this system will be more reliable. Additionally, we may eventually move to putting much of this info in tooltips for the HTML version.

@agitter what do you think. Feel free to disagree!

Adopting the pandoc-citeproc markdown citation format

Currently we cite multiple documents like:

Several groups [@doi:10.1371/journal.pone.0032235 @doi:10.1109/TCBB.2014.2343960 @doi:10.1038/srep11476] initiated

Prior to pandoc, this gets converted to:

Several groups [@1AlhRKQbe; @ZzaRyGuJ; @UpFrhdJf] initiated

Then post pandoc conversion, it will look like:

Several groups [30,192,193] initiated

Note how we have to add semicolons to separate each reference. We figured this out at lierdakil/pandoc-crossref#110. It would be nice to align our format with the pandoc-citeproc format. This presumably would also allow us to make non-bracketed citations like:

@doi:10.1371/journal.pone.0032235 was the first group

This would presumably render to

Qi et al 2012 was the first group

However, I haven't found the actual docs for the markdown citation formatting supported by pandoc-citeproc (docs). Tagging @lierdakil and @slochower in case they have any insights.

Renaming repository from manubot-rootstock

I don't think manubot-rootstock is a terrible repo name, but it may not be the best. The two biggest problems I see are that it's:

long
confusing since many users may not be familiar with rootstocks

Here are some other names I jotted down:

manubot-init
manubot-model
manubot-stemcell
manubot-template
manucat
manucross
manumorph
manuroot
manusource
manustart

Alphabetically sorted so I don't bias others with my ranking. Since this would be a slightly disruptive change, we should only make it if we feel any of these names is much better.

CC @cgreene @agitter @slochower. I do like the name "Manubot" for the overall system and the python package. That's way all of these names stick to the manu* theme.

Replace OpenTimestamps submodule with pip install

OpenTimestamps is now on PyPI (announcement). Install with:

pip install opentimestamps-client

We should also update python-bitcoinlib to v0.8.0.

Setting up manubot-rootstock for GitLab with GitLab-CI

Just a suggestion: https://about.gitlab.com/features/gitlab-ci-cd/ :)

Update CSS to left justify table captions

I suggest changing the CSS to left justify table captions, since the main text is left justified and the figure captions are as well. For example, only the table caption in this document is center justified: https://greenelab.github.io/meta-review/v/b8eeea542ce238bbcaf2023add2aecb86ef726bd/

It's not immediately obvious where to change the CSS to accomplish this, but I didn't look thoroughly.

Preserving old versions of files on the gh-pages branch via directories

The gh-pages branch is responsible for the GitHub Pages site and contains output HTML, PDF, CSS, image, and OTS files. Currently, new manuscript builds overwrite the files, which are in the root directory of this branch:

https://github.com/greenelab/manubot-rootstock/blob/f165f609f33b11fdf71a0db6435d4dd159f23973/ci/deploy.sh#L62-L68

I propose instead creating a directory structure, so all past outputs on gh-pages are preserved through versioned directories. The version would be the master commit that the build was based on (i.e. $TRAVIS_COMMIT). For example, I commit f165f60 to master. The outputs that currently go to the root directory of gh-pages would instead go to the v/f165f609f33b11fdf71a0db6435d4dd159f23973 directory (v for version). The latest HTML and PDF manuscript would stay available at their current URLs, probably via symbolic links (see here for how symlinks act with GitHub Pages).

We could use redirects, so v/freeze redirects to the latest versioned directory.

The benefits of this change are twofold:

You can view outdated versions of the HTML manuscript. Right now, you can only see the rendered HTML for the latest version.
The OpenTimestamp .ots files need to be upgraded. Until they're upgraded, they depend on a calendar service for verification. Currently, we haven't upgraded timestamps, which creates the possibility that we may be unable to prove existence if the calendar goes down. Note that the timestamps can only be upgraded after the bitcoin transaction confirms, which could be days. That's why we don't specify --wait in our builds. Anyways, previously I was planning on rewriting the gh-pages history to upgrade timestamps in past commits. However, rewriting history is dangerous. It would be preferable to be able to upgrade past timestamps without rewriting history, which this proposal would enable.

The main disadvantage I can think of is repository size, since more files are being tracked. However, I'm not sure it'd be any bigger, since all files are currently in the git history at some point. According to this:

even if you have multiple files with the same contents but different names or in different locations or from different commits only one copy would ever be saved but with several pointers to it in each commit tree.

Shallow cloning would lose its savings, but I'm not sure we care.

One final point to consider is that a single commit will sometimes be deployed multiple times (say if the CI build is rerun). They will not always be the same. For the same source commit, I think we'd use the latest build.

Toggle Annotate/Highlight Popup // Add Unhighlight Ability

I highlight text with my mouse as I'm reading this. Apparently, there are lots of us who do this.

This is really annoying on Manubot HTML outputs because the highlight popup comes up every time. One time I clicked it by mistake and now there's no way for me to get rid of my highlight and I feel like a jackass who highlighted some unimportant text.

It'd be great if I could a) toggle the highlight-popup and b) un-highlight.

SVG images fail to export to PDF

Refs jgm/pandoc#265

Setup commands fail on macOS

Some of the commands in SETUP.md fail on macOS. IIRC, these commands are:

TRAVIS_ENCRYPT_ID=`grep \
  --only-matching --perl-regexp \
  --regexp='(?<=encrypted_)[a-zA-Z0-9]+(?=_key)' \
  travis-encrypt-file.log`
sed --in-place "s/f2f00aaf6402/$TRAVIS_ENCRYPT_ID/g" deploy.sh

sed --in-place "s/greenelab/$OWNER/g" README.md
sed --in-place "s/manubot-rootstock/$REPO/g" README.md

The issue is likely that the mac versions of these utilities don't support the same long arguments. What a shame.

Page numbers in the print / PDF output

Some consider the lack of page numbers to be disturbing.

Creating a diff between two manuscript versions

Oftentimes, it's important (and required in scholarly publishing) to show the changes between two versions of a manuscript. It would be ideal if Manubot users could "track changes" between two manuscript versions.

Pandoc doesn't have builtin support for diffs: jgm/pandoc#2374. Other options would be:

Exporting to latex and using latexdiff
Exporting to docx and using LibreOffice's Compare Document feature. Currently, not accessible via command line.
Export to ODT and use oodiff
Diffing manuscript.md as a text file (perhaps using diff, prettydiff, or rich-text-diff)
Use GitHub's rich diff view preview or react-rich-diff

ReScience

The ReScience journal could be a potential use case for manubot-rootstock. From https://arxiv.org/abs/1707.04393:

The main inconvenience of the GitHub platform is its almost complete lack of support for the publishing steps, once a submission has successfully passed the reviewing process. At this point, the submission consists of an article text in Markdown format plus a set of code and data files in a git repository. The desired archival form is an article in PDF format plus a permanent archive of the submitted code and data, with a Digital Object Identifier (DOI) providing a permanent reference. The Zenodo platform allows straightforward archiving of snapshots of a repository hosted on GitHub, and issues a DOI for the archive. This leaves the task of producing a PDF version of the article, which is currently handled by the managing editor of the submission, in order to ease the technical burden on our authors

Show context for references

Building on @dhimmel's post on author versus numeric citation styles, another advantage of author-based citations in the current version of Manubot is that it is easier to find where a reference is cited. I can search for Pantcheva, 2018 more easily than 13, for instance, especially if 13 is cited as 12-14 or appears in numeric parts of the text.

A nice feature for numeric citations might a form of "show context" that some journals use. https://www.nature.com/articles/ncomms12989#references is an arbitrary example. The context consists of snippets of the manuscript where the reference was used plus links back to those locations.

This would also give us one way to address #117. We could assert that the reference number is an increasing function of the reference's first context.

Consider wkhtmltopdf alternatives for HTML to PDF export

I was browsing recent pandoc commits and saw jgm/pandoc@c7e3c1e, refs jgm/pandoc#3909 and jgm/pandoc#3906.

We should look into WeasyPrint and Prince.

This could help with the lack of SVG image export in wkhtmltopdf as well as the some of the aesthetics issues. In addition, our conda install of wkhtmltopdf is linux only.

Intregrating Manubot and Idyll

Idyll stands for Interactive Document Language and is a "markup language for interactive documents." The current description reads:

Idyll extends the ubiquitous Markdown format to enable the creation of dynamic, interactive narratives for the web. The language and toolchain aim to empower journalists, researchers, and technical experts to create compelling content using familiar tools and processes.

Idyll can be used to create explorable explanations, to power blog engines and content management systems, and to generate dynamic technical reports. The tool can generate standalone webpages or be embedded inside of your existing site.

Taking a look at an example was helpful. See Idyll on GitHub at idyll-lang/idyll.

@cgreene met the Idyll folks recently and wondered whether it'd be helpful for the Deep Review in greenelab/deep-review#842.

This issue is for discussing whether there is synergy between Idyll and Manubot, and whether there's an opportunity to integrate them in some form.

CC @mathisonian @AndrewGYork @marciovm.

@AndrewGYork is also working on interactive papers hosted via GitHub (example).

Instructions for manual references

It would be helpful to describe the usage of manual-references.json in references/README.md. I can make a pull request myself (eventually).

Enable more advanced math rendering by default

The current default math used in our pandoc build command is severely limited: see the "TeX math in HTML" section of the pandoc demos. Pandoc has support for several more advanced methods for math rendering in HTML.

The question is which one to choose? I've seen MathJax used before in scholarly publishing. However, KaTex is faster to render. There are also several more options.

@slochower did you look into the math options at all for b03e1c3?

Update manual reference guidelines with link to examples

This page provides some nice examples of the CSL metadata for different document types. Would be nice to add to docs.

Autogenerate DOIs (with Zenodo?) based on releases/tags

At the moment, PDFs get pushed to PeerJ. But you could use the GitHub-Zenodo integration to snapshop the whole repo and give it a DOI.

Allow additional information in metadata.yaml

Based on some of the points already discussed on deep review and greenelab/meta-review#75 (comment), I think adding a few additional variables to the metadata would help Manubot be a little more flexible. Some ideas of what we might want to allow:

Address information (in addition to affiliation)
Corresponding author status
First/co-first author status
Specification of author symbol

For the last three, I think we could implement sensible defaults in the jinja template to use if not specified. For example, corresponding author status may be set to "no" unless it is explicitly set to "yes."

Forking a new manuscript documentation

I started thinking about more detailed documentation for someone who wanted to create a new manuscript using this repository as a template. They could fork through GitHub, but that would only support a single manuscript per user.

The process I'm trying is roughly:

Clone https://github.com/greenelab/manubot-rootstock locally and rename manubot-rootstock to my desired new manuscript name
Create a new repository on GitHub with that name
Set my new GitHub remote as origin and manubot-rootstock as upstream
Enable Travis CI and change the readme links
Change the github.io links
Execute some (or all?) of initialize.sh to create the necessary remote branches
Generate and configure the GitHub SSH deploy key for deploy.sh

Is there are more streamlined process we could recommend? Am I missing any steps?

Increase top page margin in print media

The printed page margin was a bit too small on the top for the Sci-Hub manuscript. PeerJ applied their own banner which overlaps with some of the text. See https://peerj.com/preprints/3100v1.pdf

For example,

The other margins looked fine.

Add functionality to add banner in HTML template

It could be useful to have a simple way to add info text in a highly visible banner like "Work in progress" or "Published, peer-reviewed version at [...]" to the head of the HTML file with some simple config.

(From #127)

SETUP.md repo sed substitution is failing

At the OpenCon do-a-thon, we've had 2 users experience potentially faulty substitutions. Rather than rebranding their README to USER/REPO, their README.md is rebranded to USER/USER. Possibly introduced in #84?

The two examples are https://github.com/zambujo/manubot/commit/10397d6a05235c3517ac981b9b3c67920c226b9a are broadwym/manu1@64954e5.

Interestingly one user did not have the issue: https://github.com/schliebs/open_manuscript/commit/77da6c844ac061061c03b93721e7eade90fabd99, making me wonder whether its user error or not.

SETUP.md currently uses:

sed "s/greenelab/$OWNER/g" README.md > tmp && mv -f tmp README.md
sed "s/manubot-rootstock/$REPO/g" README.md > tmp && mv -f tmp README.md

@vsmalladi any ideas what could be happening?

Deploy with windows generated keys fails

Ran into a deploy error when setting up a manuscript at the OpenCon doathon:

bad decrypt
140040671200928:error:0606506D:digital envelope routines:EVP_DecryptFinal_ex:wrong final block length:evp_enc.c:520:

Bitcoin sign (₿, U+20BF) doesn't render in PDF and some browsers

As commented by @arielsvn in greenelab/scihub-manuscript#51 (comment):

there seems to be an encoding issue with the bitcoin symbol on the Discussion section. I noticed it on the pdf, and the same happens with the markdown file, at least on my computer.

This is likely due to the unicode character (₿, U+20BF) a recent addition as part of Unicode 10.0, released June 2017. Note this release has other important symbols/emojis such as 🧟 (Zombie) and 🧖 (Person in Steamy Room).

For me, on Chrome on Ubuntu 17.10, the bitcoin sign renders in the HTML but not the PDF. I'm assuming the PDF gets a certain font embedded on Travis CI, which doesn't have the latest characters. Note that when I generate the PDF locally, the bitcoin signs do render.

So @arielsvn, I think we may want to look into the following solutions:

Updating the font used by the Travis CI build
Specifying a font to use that is up to date

@arielsvn you probably know best what to do here.

Prepended file numbers

It seems like it would be better to specify the ordering of the markdown files by having a separate file.

As it is now it looks like people would have to rename several files if they wanted to change the ordering or add some content in the middle.

Markdown proofer

This CircleCI blog describes Markdown Proofer for validating YAML blocks in Markdown files. It is written in Go, which we could get from conda, but it may not cleanly integrate into our test environment. I'm also uncertain whether it could be applied directly to YAML files like metadata.yaml.

Nevertheless I thought it was worth monitoring.

macOS PDF build issues: long arguments not accepted & missing fonts in PDF

sh build/build.sh fails on MAC OS as the following:

ln --symbolic and rm --recursive do not work. When I changed them to ln -s and rm -r, respectively, they are fine.

However, then it complains about pango. I manually installed it using homebrew and pango was not an issue anymore.

Then the build was completed with no errors but warnings:

WARNING: Ignored `-ms-text-size-adjust: 100%` at 78:5, unknown property.
WARNING: Ignored `-webkit-text-size-adjust: 100%` at 79:5, unknown property.
WARNING: Ignored `-moz-box-sizing: content-box` at 204:5, unknown property.
WARNING: Ignored `-webkit-appearance: button` at 379:5, unknown property.
WARNING: Ignored `cursor: pointer` at 380:5, the property does not apply for the print media.
WARNING: Ignored `cursor: default` at 389:5, the property does not apply for the print media.
WARNING: Ignored `-webkit-appearance: textfield` at 410:5, unknown property.
WARNING: Ignored `-moz-box-sizing: content-box` at 411:5, unknown property.
WARNING: Ignored `-webkit-box-sizing: content-box` at 412:5, unknown property.
WARNING: Ignored `-webkit-appearance: none` at 423:5, unknown property.
WARNING: Invalid or unsupported selector 'button::-moz-focus-inner,
input::-moz-focus-inner ', Unknown pseudo-element: -moz-focus-inner
WARNING: Invalid or unsupported selector '*:not("#mkdbuttons") ', (<FunctionBlock not( … )>, ':not() only accepts a simple selector')
WARNING: Ignored `-webkit-font-smoothing: subpixel-antialiased` at 486:5, unknown property.
WARNING: Ignored `-moz-border-radius: 3px` at 491:5, unknown property.
WARNING: Ignored `-webkit-border-radius: 3px
` at 492:5, unknown property.
WARNING: Ignored `-webkit-font-smoothing: subpixel-antialiased` at 528:5, unknown property.
WARNING: Ignored `cursor: text
` at 529:5, the property does not apply for the print media.
WARNING: Ignored `word-break: break-all` at 733:5, unknown property.
WARNING: Ignored `word-break: break-word` at 734:5, unknown property.
WARNING: Ignored `-webkit-hyphens: auto` at 735:5, unknown property.
WARNING: Ignored `-moz-hyphens: auto` at 736:5, unknown property.

And generated PDF has squares only.

Do you have any idea on why this might be happening?

PDF formatting is not ideal

In several places, the PDF rendering looks (subjectively) worse than the HTML output. (I'm not sure if I'll have time to work on this during the week, but I wanted to drop this here in case someone else has time before me.)

Overall, I think the margins of the PDF could be adjusted. The relatively short title already wraps in the PDF.
There are places where the HTML has spaces between the text and the references, but the PDF output does not. I'm not sure why this happens.
Code style could be formatted as monospaced in the PDF output.
Tables look much better in HTML than PDF (shading and banding).
The SVG example figure is missing (known problem: #14).

Automated figure & table numbering

@slochower welcome to manubot-rootstock... which is meant to be forked when creating a new manuscript. Still a work in progress.

See previous discussions at greenelab/deep-review#354 (comment) and greenelab/deep-review#558.

It seems like the best way to number and reference tables and figures will be with pandoc-tablenos and pandoc-fignos, which are both python packages by @tomduck that we can add to the environment:

They can be enabled in the pandoc conversion script with:

--filter pandoc-fignos
--filter pandoc-tablenos

Since we're also using jinja2 templating, we could do the conversion prior to pandoc if there is a compelling reason.

@slochower do you want to submit the PR? I'm thinking the initial use case we should target is markdown tables and figures embedded via absolute URL (let's save the relative image path case for later).

Also @slowchower, any idea how figure and table captions work?

CC @agitter.

MIT licensed JS code

The anchor script (https://github.com/greenelab/manubot-rootstock/blob/master/build/assets/anchors.js) is not yet listed in the licensing section of the Readme.

I think it would be good to mention this somehow, maybe with "unless noted explicitly" or something similar?

Manubot vs. alternatives

I just found out about Manubot, can you tell the differences between Manubot and alternatives, like:

Authorea
Gitlab
Overleaf
Google docs...

Add "Send Feedback" button that creates an issue

Jake VDP wrote an astronomy paper (github source) that published to gh-pages (http://jakevdp.github.io/multiband_LS/) via gh-publisher. While each of those steps is a little clunky, one awesome feature of this page is that it has a "Send Feedback" button which then opens up a GitHub issue! This is a great way to create a dialogue with the manuscript authors and readers.

EDIT: Added link to gh-publisher

Generic URL ––> archive.org persistent ID/URL ––> Manubot

When you cite a news or blog URL, you might want to reference the archive.org snapshop of the URL.

Can the @url: identifier send a request to archive.org and get that URL to cite in Manubot?

See blog post: https://medium.com/@RaoOfPhysics/89bd3f2ce0fd

Symlink CSS to output directory for local viewing of the HTML

I propose we symlink github-pandoc.css to the output/ directory so that local building and viewing of webpage/index.html or output/manuscript.html (I know those are symlinks of each other) loads the CSS. Viewing the HTML from either webpage/ or output/ currently can't find the CSS because the browser follows the symlink into output/ and therefore doesn't find webpage/github-pandoc.css. Does that make sense? A simple ln -s ../webpage/github-pandoc.css in output/ fixes the issue.

Creating a Manubot CSL that perfects the format of bibliographic entries

Currently, Manubot uses style.csl a slightly modified version of proceedings-of-the-royal-society-b.csl. While this style is decent, I have some ideas for an optimal style. And of course, authors can always switch the style to that of whatever journal they'd like.

The style I envision uses numbers for citations, i.e. renders likeblah blah [1-5,7].. Non bracketed citations could show author name like: Pippi, Hippi, et al [7] wrote.

Bibliographic entries would look something like:

Sci-Hub provides access to nearly all scholarly literature
_{^{Daniel S Himmelstein, Ariel R Romero, Stephen R McLaughlin, Bastian Greshake Tzovaras, Casey S Greene}}
PeerJ Preprints (2017-07-20) _{^{DOI: 10.7287/peerj.preprints.3100v1}}

Ideally, author names would be in smaller text and hyperlink to ORCID records when available. The smallness of text here is an exaggeration (limited formatting options).

Compared to historical bibliographic formats, the following points are stressed:

Unique identification is the most important aspect of a reference. A hyperlink or DOI is the single most important piece of information.
The title is the most salient human-readable piece of information
Just having a year for the date is too imprecise. The month and day are important for placing works in the proper historical context.
Authorship information is important, but often takes up too much of a reference. Having authorship information in smaller or lighter text would be nice.
Unless a reference is only available in a physical format, the volume / issue / page information is irrelevant.
Historical reference styles adopt vastly different styles based on the type of record (article, interview, etcetera). This is largely unnecessary. If anything, a badge can display the type of record.

There's a webapp to generate a custom CSL style. I've found it a bit difficult to use, but its probably the way to go.

One question is whether to print out the URL rather than hyperlink the title. The benefit of showing the URL would be for readers who have printed the PDF. However, if a reader is at a computer, they could always go back to the digital version with the hyperlink.

Suggestions welcome.

Support Alternate Themes

It is difficult to read a long manuscript with the current style settings.

It might be useful to build on the work of other projects which convert Markdown into the usual academic style:

https://github.com/ickc/markdown-latex-css
https://github.com/thomaspark/pubcss/ // https://thomaspark.co/project/pubcss/demo/acm-sig-sample-web-theme.html
https://gist.github.com/killercup/5917178
etc

Webpage prints to A4 dimensions rather than Letter

See for example this Sci-Hub Manuscript PDF. The Paper Size according to the PDF's properties is A4, Portrait (8.26 × 11.68 inch). This caused an issue when I printed the PDF where some final lines on a page were omitted.

This StackOverflow notes how to change the page to Letter (8.5 × 11). I just want to confirm this is a change we want to make. I didn't realize there were multiple paper sizes, both prevalent, in this unstandardized world!

gh-publisher: lessons to learn?

Have you seen https://github.com/ewanmellor/gh-publisher? What lessons can we learn from them?

EDIT. Example: http://drphilmarshall.github.io/Ideas-for-Citizen-Science-in-Astronomy/

Tracking Manubot usage

It may be nice in the future to produce statistics about how many documents have been authored with Manubot and this rootstock or refer to more examples. @dhimmel has https://github.com/dhimmel/rephetio-manuscript/ and were examples listed in #62.

I haven't been able to think of a non-invasive way to track this. Does anyone else have ideas? Is this worthwhile?

Add BUILD_PDF flag

To work-around PDF build issues (#120) and for quicker local development a BUILD_PDF flag like BUILD_DOCX might be useful.

This would require skipping "manuscript.pdf" in webpage.py, would that be a problem?

Archiving metadata (issues, pull request, etc.)

In deep review, the issues and pull requests were a critical part of the manuscript. I'd like to discuss strategies for archiving some of this metadata.

One initial thought would be to have the build script take a snapshot of the issues and pull requests at the time of the build, ideally with some caching. The deploy script could push them to a new branch, perhaps adding a timestamp. I haven't thought through the technical aspects of this. I expect it is feasible using some of the tools or APIs here.

cc @cgreene

Reference numbering with misspecified citation

In deep review (greenelab/deep-review#845 ), we had a pair of citations without a ; separator [@url:https://eprint.iacr.org/2017/281.pdf @tag:Papernot2017_pate]. The second paper was numbered in the reference list but not actually cited in text, which led to inconsistent reference numbering:

The skipped reference number 161 is @tag:Papernot2017_pate. See the permalink for more context. As a reader, I would expect that @tag:Papernot2017_pate is numbered based on the first appearance in the text.

nan (missing) author fields

Quoting from https://greenelab.github.io/scihub-manuscript/

0000-0002-9925-9623 · Department of Applied Bioinformatics, Institute of Cell Biology and Neuroscience, Goethe University Frankfurt · Funded by nan

Would be better to have jinja2 omit blank fields entirely. In other words remove " · Funded by nan"

Bake hypothesis into the HTML versions of the manuscripts

Add:

<script src="https://hypothes.is/embed.js" async></script>

Doesn't work natively with the PDF files, sadly.

Error during build if there are zero references

I've been playing around with manually building a manuscript based off this template and noticed that if I have absolutely zero references in my document, I get a build error. If I add a reference in any section (e.g., putting [@doi:10.1126/science.1127344] as a placeholder in my abstract), then the error goes away.

$ bash build/build.sh 
Retrieving and processing reference metadata
Using metadata cache: True
Traceback (most recent call last):
  File "references.py", line 111, in <module>
    ref_df['standard_citation'], ref_df['citation_id'] = zip(*result)
ValueError: not enough values to unpack (expected 2, got 0)

I haven't debugged the code, but I think result (calculated on line 109, just above the error) is empty when there are no references. Would a simple check if result not None: ... before line 111 be a workaround?

result = ref_df.citation.apply(
    get_standard_citatation, cache=metadata_cache, override=overrides)

(FWIW, I do get the "potentially misformatted references" error in any case, but the build continues successfully after I add the placeholder. The warning from the templates in the front matter.)

Journal compatibility

I'm excited to see this standalone manuscript repository!

I have a general question in regards to journal submissions. Many journals require Word or LaTex formats for submission. Have you thought about how manuscripts written in this markdown format can be submitted to a journal with those requirements? Would one use pandoc outside of the automatic build to do a one time conversion to Word or LaTeX?

Check out Pandoc Scholar

Described in Formatting Open Science: agilely creating multiple document formats for academic manuscripts with Pandoc Scholar:

In this article we demonstrate the feasibility of writing scientific manuscripts in plain markdown (MD) text files, which can be easily converted into common publication formats, such as PDF, HTML or EPUB, using Pandoc. The simple syntax of Markdown assures the long-term readability of raw files and the development of software and workflows. We show the implementation of typical elements of scientific manuscripts—formulas, tables, code blocks and citations—and present tools for editing, collaborative writing and version control. We give an example on how to prepare a manuscript with distinct output formats, a DOCX file for submission to a journal, and a LATEX/PDF version for deposition as a PeerJ preprint. Further, we implemented new features for supporting ‘semantic web’ applications, such as the ‘journal article tag suite’—JATS, and the ‘citation typing ontology’—CiTO standard.

The GitHub repo for this project is pandoc-scholar/pandoc-scholar. Created by @tarleb.

Let's see if there's anything from Pandoc Scholar we should incorporate here or learn from.

manubot / rootstock Goto Github PK

rootstock's Introduction

Python utilities for Manubot: Manuscripts, open and automated

Installation

Usage

Process

Manual references

Cite

Pandoc filter

Manual references

Webpage

AI-assisted academic authoring

Development

Environment

Commands

Release instructions

Goals & Acknowledgments

rootstock's People

Contributors

Stargazers

Watchers

Forkers

rootstock's Issues

Recommend Projects

Recommend Topics

Recommend Org