petermr / climate Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 5.0 2.93 GB

OpenAccess papers mined for Climate Change

License: Apache License 2.0

climate's Introduction

petermr repositories

Many of these repos are widely used in collaborative projects and include:

code
data
projects

This special repo is to coordinate navigation and discussion

discussion lists

The "Discussions" for this repo https://github.com/petermr/petermr/discussions include discussions for the other repos and are of indicated by their name. They may replace our (private) Slack for all public-facing material (private project management will remain on Slack).

active repos

https://github.com/petermr/pygetpapers. automatic downloading of articles and preprints in bulk. Pioneered by Rik Smith-Unna and ported to Python by @ayush garg. CLI. PyPi: https://pypi.org/project/pygetpapers/
https://github.com/petermr/pyami. Port of Java https://github.com/petermr/ami3 to Python (@petermr). CLI. Includes a prototype GUI in tkinter. PyPi: https://pypi.org/project/py4ami/ (Note there is already an unrelated pyAMI in PyPi so in that namespace we are py4ami, but on Github it's pyami)
https://github.com/petermr/pyamiimage. Analysis of scientific duagrams (@petermr, @anuvc). No CLI, or PyPI yet.
https://github.com/petermr/docanalysis. Text-based analysis of scientific articles (@shweatahegde). CLI under dev. No PyPi yet.
https://github.com/petermr/cevopen. Projects, dictionaries and outreach for analysing articles in plant sciences.
https://github.com/petermr/openvirus. Projects, dictionaries and outreach for analysing articles on viral epidemics.
https://github.com/petermr/crops. NIPGR-intern projects on crops. Minicorpora and dictionaries for terpene synthases
https://github.com/petermr/opendiagram. Adaptation of pyamiimage to extract data from diagrams, especially materials science/batteries
https://github.com/petermr/dictionary. Software for distributed dictionaries and many dictionaries

active Python projects:

For context: We have 4 packages (if that's the right word). They are largely standalone but can have useful library routines. They all share a common data structure on disk (simply named directories). This means that state is less important and often held on the filesystem. It also means that data can be further manipulated by Unix tools and other utilities. This is very fluid as we are constantly adding new data substructures. (I developed much of this in Java - https://github.com/petermr/ami3/blob/master/README.md) . The top directory is a CProject and its document children are called CTrees as they are useful split into many subdirectory trees.

Each package has a maintainer. These are all volunteers. Their Python is all self-taught . There are also interns - mixture of compsci/engineers/plant_sci who have a 3-month stay. They test the tools, develop resources, explore text-mining, NLP, image analysis, machine-learning, etc. They are encouraged to use the packages, link them into Python scripts or Notebooks but don't have time for serious development. (They might add readers or exporters).

pygetpapers , Ayush Garg. https://github.com/petermr/pygetpapers . Searches and downloads articles from repositories. Standalone, but the results may be used by docanalysis or possibly imageanalysis. Can be called from other tools.
docanalysis. Shweata Hegde. https://github.com/petermr/docanalysis . Ingests CProjects and carries out text-analysis of documents, including sectioning, NLP/text-mining, vocabulary generation. Uses NLTK and other Python tools for many operations, and spaCy, scispaCy for annotation of entities. Outputs summary data, correlations, word-dictionaries. Links entities to Wikidata.
pyamiimage, Anuv Chakroborty + PMR. https://github.com/petermr/pyamiimage . Ingests Figures/images, applies many image processing techniques (erode-dilate, colour quantization, skeletons, etc.), extracts words (Tesseract) , extracts lines and symbols (uses sknw/NetworkX) and recreates semantic diagrams (not finished)
py4ami . PMR. https://github.com/petermr/pyami . Translation of ami3(J) to Python. Processes CProjects to extract and combine primitives into semantic objects. Some functionality overlaps with docanalysis and imageanalysis. Includes libraries (e.g. for Wikimedia) and includes prototype GUI in tkinter, and a complex structure of word-dictionaries covering science and related disciplines. (Note the project is called pyami locally but there is already a PyAMI project so there it is called py4ami)

All packages aim to have a common commandline approach, use config files, generate and process CProjects (e.g. iterating over CTrees and applying filters, transformers, map/reduce, etc.). All 4 packages have been uploaded to PyPI

basicTest

Checks that the Python environment works (independently of the applications) https://github.com/petermr/basicTest/blob/main/README.md

presentations

Some presentations about the software, many from collaborators/interns

pygetpapers

https://youtu.be/pUjiNzLVHLY (@ayushGarg) 5 min

notebook

https://github.com/petermr/docanalysis/blob/main/resources/docanalysis_demo.ipynb

docanalysis

docanalysis slides (MADICES): https://github.com/petermr/CEVOpen/blob/master/outreach/docanalysis_demo_madices.pdf

wikidata

WikidataCon Presentation slides and recording: https://github.com/petermr/crops/tree/main/outreach/WikidataCon2021

climate's People

Contributors

Stargazers

Watchers

Forkers

mrchristian gen-r ambarishk ockproject

climate's Issues

Incorporate (Sidcot) school presentation

petermr gave a short presentation (3 mins) on the Keeling Curve to a packed house of students (12-18 years), teachers and local visitors on 201909025. This was very well received and there were many questions - not enough time to hear all of them. This issue is to:

incorporate the school photo (Joanna Hodnett) of the event (get JH permission and CC licence)
incorporate PPT of PMR's slides.

multilingual dictionaries/terms

Organising project comms

My suggestion for my next contribution to the Open Climate Knowledge project is to work out a plan for the next steps in communicating the project, in an agile bite size way, so just a few next steps. Then rinse repeat.

I'll write out some ideas here https://demo.codimd.org/VrRq3-_QQ2eNVQiVAYNgbA?view

In no particular order:

Create website with Jekyll GitHub pages
Short project description: about, attribution
Populate repositories with standard open project docs: CoC, Contribute, licence, etc.
Short presentation
Guide of how to use OpenNotebook: which also means understanding the OpenNotebook functionality, outputs, etc. :-)
Guide for groups wanting to use OCK to add to or create new dictionaries
Roadmap
Define and document process for a dictionary/search so we have a solid base to work
Confirming our initial short term goals for this stage: build project comms; dictionary on 'runaway climate change'; invite others to use.
Produce a paper

That's more then enough...

Easier software install Qs

CM software easier to install, e.g. have all the Java ones on Maven Central, etc.

If this is felt necessary it would be good to look at what needs to be done.

Add as collaborator on OCKProject

Hi Peter,

Can you add me as a collaborator on the OCKProject on GitHub.

You need to go here https://github.com/users/petermr/projects/1/settings/users

and add my use mrchristian

Its so I can configure project area.

Thanks Simon

PDF processing

Can you point me the the part of ContentMine or the instructions for processing and extracting PDF parts. Also is there an example of a source document and the outputs.

I am asking as some colleagues have a PDF document set that they need to extract and enrich components from.

Force11 WG setup

I'm setting up the Force11 WG ready to make an announcement, either today or tomorrow, depending on how simple or complicated things get.

I'm going to follow how the Software Citation Principles WG have done things as they have been successful with their WGs and have experience running a couple. See: https://github.com/force11/force11-sciwg

Like them I will create an information WG repo, which seems like a good idea as document updates will get in the way of the software code repo. And of course I will do this over on the OCKProject area and move to Issue tracking there at some point.

download crossref metadata

General idea: download Crossref metadata and see how much is climate literature is open.

$ getpapers -q "climate change" --api crossref -k 100 --outdir crossref
info: Searching using crossref API
info: Found 689472 results
info: limiting hits
info: Saving result metadata
info: Full CrossRef result metadata written to crossref_results.json
info: Individual CrossRef result metadata records written
MacBook-Pro-3:climate pm286$ tree crossref/
crossref/
├── 10.1002_9781118279380.ch2
│   └── crossref_result.json
├── 10.1002_9781119974178.ch3
│   └── crossref_result.json
├── 10.1002_wcc.158
│   └── crossref_result.json

The metadata is added by publishers and is highly variable

Collections list - temp

Hi,

I will start a collections list on codimd as its a real time collaborative editor https://demo.codimd.org/WnH_pVKuTmSbIic12opyNQ

Its saves as a Gist on GitHub and can clone to repo when ready.

https://gist.github.com/mrchristian/f3855200d9a51a777d9f4e71d5736448

Someone can comment on what is good info to record about the collection: OA, API, etc

Simon

Poster for workshop - Open Energy Modelling Workshop - Berlin 2020

@mrchristian to prepare https://wiki.openmod-initiative.org/wiki/Open_Energy_Modelling_Workshop_-_Berlin_2020

Hertie School of Governance, Berlin, Germany, from 15–17 January 2020

write GenR blog post for Simon Worthington

estimate size of open/closed climate change articles

adding documentation

I would like to add documentation to the repo. Just want to check that creating /docs fits with your structuring so I can keep things tidy at the top level. I will also make some edit to files like Contribute, etc on top level.

An example doc I want to add is a notice on FOSS indexing https://gist.github.com/mrchristian/fa86c949058a550b5f4947bf17c57199

create specialist test corpus

CMIP6 Dictionary building

Build a dictionary for CMIP6 (Climate Model Intercomparison Project Phase 6) Dictionary building - CMIP6 (Climate Model Intercomparison Project Phase - Overview report https://www.geosci-model-dev.net/9/1937/2016/gmd-9-1937-2016.pdf

Ideas for help that OCK needs to recruit

As mentioned in the accompanying Issues #29 'OCK next steps tech' I want to list out areas where OCK needs help as it may be that TIB colleagues have suggestions or ideas about how to plug the gaps.

This is my list of help needed, in no particular order, please add, etc:

Software support documentation
Climate Change specialists as advisors on use of Content Mine, issues in their field, uses of ContentMine OCK, etc
Data science software developers, users to carry out searches, experiments
Members to join a Force11 working group for OCK, contribute on research, papers, WG duties, OA stats, contribute to recommendations and plans for transition to open research/OA
OA experts to help on informing OCK on existing research on OA rates, stats and how OCK can deal with speeding up OA in Climate Change
Wikidata wranglers
knowledge graph expertise
content curation and repository building
RDM
community development and open project strategy implementation

That's all folks

Rendering JATS/XML as HTML5

You want to have some JATS/XML rendered as HTML5 for the Oxford XML Summer School. Can you point me to the type of source, or an example, content that needs rendering that way I can try some things out. Preferably the GitHub Pages Jekyll framework could just use the JATS as is but will have to see.

I take it we would either be wanting concatenate a series of papers from directories into one big HTML output, or create a mini website linking to papers?

help with lists and knowledgebase

To have a speedier list and knowledgebase building process to then feed into Content Mine use I have made a page on the GenR repository for collecting contributions. Also this can help contect into the larger projects of climate change OA liberation. See https://github.com/Gen-R/open-climate/

These pages can then be merged in this repository.

Frequently Asked Questions (FAQs)

Introduction

This is a growing assembly of questions that newcomers to the project and its software might ask. If you join the project you can ask or answer more.

Themes

purpose of the project

how to run the software

raw material

dictionaries

people

GitHub Organisation for OCK

To facilitate the new Force11 Working Group we need to setup a GitHub organisation, this way we can have members and other group functions.

If @petermr you could setup a group and add @mrchristian as an admin, then I will fork the climate repo there and we can use the forked repo as the new working place for the WG.

The GitHub organisation should be an individual account and be named 'Open Climate Knowledge'

create sections

created 200 scoping set for runaway climate

A quick search to see how many papers relate to runaway or tipping .

MacBook-Pro-3:climate pm286$ getpapers -q "((climate change) AND ((runaway) OR (feedback) OR (tipping)))" -k 500 -x -o runaway500
info: Searching using eupmc API
info: Found 9650 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 5.3.2 vs. 6.1 reported by api
info: Limiting to 500 hits
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Duplicate records found: 998 unique results identified
info: limiting hits
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Got XML URLs for 500 out of 500 results
info: Downloading fulltext XML files
Downloading files [==============----------------] 46% (232/500) [77.8s elapsed, eta 89.9]^C

stopped after 40 %, got 222
ami-search

MacBook-Pro-3:climate pm286$ ami-search -p runaway222/ --dictionary compound species country funders 

Generic values (AMISearchTool)
================================
-v to see generic values
oldstyle            true

Specific values (AMISearchTool)
================================
oldstyle             true
strip numbers        false
wordCountRange       (20,1000000)
wordLengthRange      (1,20)

dictionaryList       [compound, species, country, funders]
dictionaryTop        null
dictionarySuffix     [xml]

0    [main] DEBUG org.contentmine.ami.tools.AbstractAMISearchTool  - old style search command); change
cProject: runaway222
legacy cmd> word(frequencies)xpath:@count>20~w.stopwords:pmcstop.txt_stopwords.txt
legacy cmd> search(compound)
legacy cmd> species(binomial)
legacy cmd> search(country)
legacy cmd> search(funders)
!PMC5264177 .!PMC5299408 !PMC5459990 !PMC5472773 !PMC5551099 !PMC5577139 !PMC5578963 !PMC5593823 !PMC5595922 !PMC5651905 !PMC5678106 .!PMC5719437 !PMC5734744 !PMC5770443 PMC5789925 !PMC5795745 !PMC5798756 !PMC5820313 ...
PMC6536552 PMC6538627 !PMC6539176 .PMC6539203 !PMC6540656 PMC6540663 !PMC6541288 PMC6541573 !PMC6541581 PMC6541717 !PMC6542552 !PMC6542844 !PMC6543642 .PMC6544233 PMC6545051 PMC6545231 UNKNOWN nlm tag: city
UNKNOWN nlm tag: city
!PMC6547168 !PMC6549952 PMC6550257 PMC6553685 !PMC6555712 PMC6556101 UNKNOWN nlm tag: city
UNKNOWN nlm tag: city
UNKNOWN nlm tag: version
UNKNOWN nlm tag: version
UNKNOWN nlm tag: version
!PMC6556939 .PMC6558283 !PMC6559081 !PMC6559268 !PMC6559292 !PMC6561295 !PMC6562896 !PMC6563524 PMC6565653 !PMC6566821 PMC6566967 .PMC65679
...
PMC6723259 PMC6724111 !PMC6724177 !PMC6724306 PMC6724339 !PMC6726645 !PMC6727426 PMC5264177 97035 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
.PMC5299408 97036 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
(PMR see to be a lot of these)
PMC5459990 97036 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
...

.PMC6706196 PMC6706372 PMC6706434 PMC6708170 PMC6708426 PMC6709546 105060 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6709957 105060 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6710573 105060 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6711539 PMC6712833 105149 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
.PMC6712961 105149 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6714084 105149 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6714099 PMC6716414 PMC6716840 PMC6717165 PMC6717645 105225 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6718425 105225 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6718993 PMC6720849 .PMC6721090 PMC6721118 105351 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6723259 PMC6724111 PMC6724177 105406 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6724306 105406 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6724339 PMC6726645 105438 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
PMC6727426 105438 [main] DEBUG org.contentmine.ami.plugins.word.WordCollectionFactory  - no words found to extract
....................................................................................................cannot run command: search([compound])[]; cannot process argument: --sr.search (RuntimeException: cannot read inputStream for dictionary: /org/contentmine/ami/plugins/dictionary/compound.xml)
SP: runaway222..................................................................................................................................................................................................................................................................................................................................................................................................................................................................
create data tables
rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrMacBook-Pro-3:climate pm286$

SECTIONS

new tool to find sections. Value depends on publisher consistency

MacBook-Pro-3:climate pm286$ ami-section -p runaway222/ --sections ALL

Generic values (AMISectionTool)
================================
-v to see generic values
oldstyle            true

Specific values (AMISectionTool)
================================
sectionList             [ABBREVIATION, ABSTRACT, ACK_FUND, APPENDIX, ARTICLE_META, ARTICLE_TITLE, CONTRIB, AUTH_CONT, BACK, BODY, CASE, CONCL, COMP_INT, DISCUSS, FINANCIAL, FIG, FRONT, INTRO, JOURNAL_META, JOURNAL_TITLE, PUBLISHER_NAME, KEYWORD, METHODS, OTHER, PMCID, REF, RESULTS, SUPPL, TABLE, SUBTITLE, TITLE]
write                   true

AMISectionTool cTree: PMC5264177
AMISectionTool cTree: PMC5299408
AMISectionTool cTree: PMC5459990
AMISectionTool cTree: PMC5472773
...

creates a section/ dir for each CTree

This is new ...
title of section depends on the subtitles from the publisher.

Comments useful.

Energy Modeling Search

Meeting with Ludwig Hülk @Ludee on Monday at the Reiner Lemoine Institut https://github.com/rl-institut in Berlin to talk about Open Energy Modeling.

I discussed creating a dictionary for a search on Energy Modeling with Ludwig and his colleague, who are experts in the field.

There are two resources we can do this dictionary from, firstly, a Glossary from the Open Energy Modeling Initiative https://wiki.openmod-initiative.org/wiki/Category:Glossary and, second, a ontology that RLI have made https://github.com/OpenEnergyPlatform/ontology

I'll consult with Ludwig and co about how we can collate useful terms from these sources, connect to WikiData and have a hand over and then refinement as we carry out searches.

create general test corpus

Scholarly HTML

do the scholarly HTML files use this Scholarly markup, I know sounds like a silly question, but just need a reality check. Although w3C group seems inactive, so here https://vivliostyle.github.io/vivliostyle_doc/samples/scholarly/index.html

I'm seeing if I can render the HTML outputs from a 'mining session' using the paginated CSS setup from the lovely Vivliostyle people https://vivliostyle.org/

Like so https://vivliostyle.github.io/vivliostyle.js/viewer/vivliostyle-viewer.html#x=https://vivliostyle.github.io/vivliostyle_doc/samples/scholarly/index.html

I should be able to make an inventory of the papers, somehow, then some custom CSS and might work :-) In terms of outputting as standalone website, MD for internal GitHub viewing will be different.

create wikipedia-based general dictionary

technologies for OCK next steps

Hi,

I have a presentation for TIB colleagues tomorrow 29th Oct and I need to ask a couple of questions about the 'thoughts' on technology routes for OCKs proposed next steps. And whether there are: existing systems in place, choices already made, routes contemplated, or explorations being made.

Building an open metadata or/and document repository - so how we could present the data collated by OCK in a usable form to the public?
Knowledge Graph creation?

I would like to be able to present the current view on these two parts of the project as if we need input I can see what people can offer, or have ideas about.

Thanks

Simon

Introductory example CMIP

Overview

This is a brief "scoping" review of finding articles and terms relevant to CMIP (https://en.wikipedia.org/wiki/Coupled_Model_Intercomparison_Project ).

strategy

Search EPMC for "CMIP" and related terms. (getpapers)
retrieve all articles and a subset (cmip200)
create an ad hoc dictionary of terms related to CMIP

locations

Pull request waiting for merge

Pull request with some lists information is waiting for merge. #8

It covers text files from contributing and lists files in a new directory 'lists'.

Some file changes seemed to appear in the directory 'clim107' this wasn't intentional and not sure how, why happened.

Dictionaries method

I've written up the 'dictionaries method' as best I can, I'm sure it's full of holes, but hopefully not too far off.

https://gist.github.com/mrchristian/6e2ec8762bd86c7adc473b0093a6aaf2

Now I'll move onto a simple description of the workflow, at least enough to give us a skeleton and for reader to get an idea of what to expect