russelljjarvis / scienceaccessibility Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 2.0 84.94 MB

Analysis of accessibility of science writing.

License: Other

Shell 4.11% Python 94.02% TeX 1.19% Dockerfile 0.68%

scraping-websites python3 dask accesibility science language manuscript nltk streamlit-dashboard

scienceaccessibility's Introduction

This project has moved to this repository

A live version of the app is here

First Step

git clone https://github.com/russelljjarvis/ScienceAccess.git
cd ScienceAccess

If you don't have python3:

sudo bash install_python3.sh

Installation Apple

sudo bash apple_setup.sh

Installation Linux

sudo bash setup.sh

Run

streamlit run app.py

Manuscript

Overview

Understanding a big word is hard, so when big ideas are written down with lots of big words, the large pile of big words is also hard to understand.

We used a computer to quickly visit and read many different websites to see how hard each piece of writing was to understand. People may avoid learning hard ideas, only because too many hard words encountered in the process. We think we can help by explaining the problem with smaller words, and by creating tools to address the problem.

Why Are We Doing This?

We want to promote clearer and simpler writing in science, by encorouging scientists in the same field to compete with each other over writing more clearly.

How Are we Doing This?

Machine Estimation of Writing Complexity:

The accessibility of written word can be approximated by a computer program that reads over the text and guesses the mental difficulty, associated with comprehending a written document. The computer program maps reading difficult onto a quantity that is informed by the cognitive load of the writing, and the number of years of schooling needed to decode the language in the document. For convenience, we can refer to the difficulty associated with the text as the 'complexity' of the document.

How do some well-known texts do?

First, we sample some extremes in writing style, and then we will tabulate results, so we have some nice reference points to help us to make sense of other results. On the lower and upper limits we have: XKCD: Pushing the limits of extremely readable science and for some comparison, we wanted to check some Machine generated postmodern nonesense

Higher is worse:

complexity	texts
6.0	upgoer5
9.0	readability of science declining
14.0	science of writing
14.9	mean wikipedia
16.5	mean post modern essay generator

Some particular cases:

complexity	texts
13.0	this readme.md
17.0	The number of olfactory stimuli that humans can discriminate is still unknown
18.68	Intermittent dynamics and hyper-aging in dense colloidal gels
37.0	Phytochromobilin C15-Z,syn - C15-E,anti isomerization: concerted or stepwise?

Proposed Remedies:

1 Previously I mentioned creating tools to remedy inaccessible academic research> One tool, that functions as a natural extension of this work, is to enable 'clear writing' tournaments between prominent academic researchers, for example:

mean complexity	author
28.85	professor R Gerkin
29.8	[ other_author]
30.58	[other_author]

Example code for the proposed tool would allow you to select academic authors who then play out a competition demand, and to utilize their writing contributions in the context of a tournament where academic tournament members compete to write simpler text. A more recently maintained version of that file

2 A different remedy proposal is to run the text through simplify, evaluate complexity after translating the document simplify. How different are the scores?

The Following is a plot of the Distribution of Science Writing Versus non-science writing the ART Science corpus:

The science writing niche is characterized, by having a mean reading grade level of 18, neutral, to negatively polarized sentiment type and close to an almost complete absence of subjectivity. Science writing is more resistant to file compression, meaning that information entropy is high, due to concise, coded language. These statistical features, give quite a lot to go on, with regards to using language style to predict the scientific status of a randomly selected web document. The same notion of entropy being generally higher in science is corroborated with the perplexity measure, which measures how improbable the particular frequency distribution of words of observed in a document was.

Developer Overview

Non-scientific writing typically exceeds genuine scientific writing in two important aspects: in contrast to genuine science, non-science is often expressed with a less complex, and more engaging writing style. We believe non-science writing occupies a more accessible niche, that academic science writing should also occupy.

Unfortunately, writing styles intended for different audiences, are predictably different We show that computers can learn to guess the type of a written document: blog, Wikipedia, opinion, and traditional science, by first sampling a large variety of web documents, and then classifying using sentiment, complexity, and other variables. By predicting which of the several different niches a document occupies, we are able to characterize the different writing types and to describe strategies to remedy writing complexity.

Multiple stakeholders benefit when science is communicated with lower complexity expression of ideas. With lower complexity science writing, knowledge would be more readily transferred into public awareness, additionally, the digital organization of facts derived from journal articles would occur more readily, as successful machine comprehension of documented science would likely occur with less human intervention.

The impact of science on society is likely proportional to the accessibility of written work. Objectively describing the character of the different writing styles will allow us to prescribe how, to shift academic science writing into a more accessible niche, where science can more aggressively compete with pseudo-science, and blogs.

Similar projects.

scienceaccessibility's People

Contributors

Stargazers

Watchers

Forkers

mcgurrgurr eliseking

scienceaccessibility's Issues

Impose a word limit on scrapped texts?

If the compression/ decompression ratio metric is deemed valuable metric, texts below a fixed constant word length compress extremely efficiently, and they act to bias this metric such that a small amount of low entropy text, has a deceptively small decompression ratio.

Small texts, seemed to unfavorably bias a lot of other text stat metrics too, so it's (arguably) in our interest to impose a word limit.

This is a GH issue

Hi @karlamoel,

GH issues could be a good structured place, to put critics and guidance about the analysis/visualization of results.

Ie you can link to a rendered version of a notebook (with embedded graphs), and then you can also read of the cell number of a figure you want to change etc.

https://github.com/russelljjarvis/WordComplexityPython/blob/dev/Visualisation_search_terms_reading_levelGS.ipynb

You can add labels, and assign people etc and milestones etc (it's good as a project management tool, regardless of the code orientated context).

execution environment, may now be much simpler than described in README.md

When this project was initially conceived, much of the project had complicated dependencies, as the core program intended to function as a web crawler/scraper, and also there was an intention of creating an octave python communication bridge. Luckily since then, it has been found that GoogleScrape is a very powerful and convenient web scraper, and the attempt to communicate with MATLAB was abandoned.

It's possible that installing either
https://github.com/NikolaiT/GoogleScraper
or
https://github.com/russelljjarvis/GoogleScraper
(the only difference between these)

in BASH sufficiently resolves dependencies, and that Docker is no longer required. I have not bothered investigating this possibility.

Integration.

@danbroz, it might be possible to integrate this with some of your code.

I think it then should be integrated with this:

https://trello.com/c/YwzMk2dU/34-create-an-automatic-coa-generator-for-nsf-grants-based-on-pubmed-or-google-scholar

Finished at https://github.com/rgerkin/nsf-coa (set to private, I will need to invite you to see it).

Currently, it is a python module and notebook, with everything you need, provided you can pip install one thing, and execute a notebook. We can discuss possible ways to deploy it to a website running e.g. Django.

Run text through simplify, evaluate complexity with simplify. How different are the scores?

http://nlpprogress.com/english/simplification.html?fbclid=IwAR0B8G7zEmxVYbFWJMOyVTaHWkv4o9tTTFvVpsOcWrUQ777SXpM6KuM-8QI

Create software diagram

Easy first issue.

WComplexity application Web based 'leaderboard' of simple academic writing competition.

Hi WCTeam

I realized that one cool application of the WC project could be to try to foster competition between scientific writers. It's possible to use WC metrics to evaluate present and scientific writings. Some other fields are advocating for such competitions in order to promote best practices.

To this end, I mined my PI and co-PI on Google Scholar citations and compared them to two benchmark references: 'The readability of scientific texts is decreasing over time', and the 'http://splasho.com/upgoer5/library.php' xkcd upgoer5 editor library corpus (it only permits 10,000 of the most common English words). For good measure, I also added in one of Peter Marting's recent publication abstracts.

https://scholar.google.com/citations?user=GzG5kRAAAAAJ&hl=en&oi=sra
https://scholar.google.com/citations?user=xnsDhO4AAAAJ&hl=en&oe=ASCII&oi=sra

Results down bottom.

I then scored all 5 authors, based on a metric, that takes into account: reading grade level, concision/redundancy, and objectivity.

The competition results confirmed a priori beliefs about text quality: 'The readability of scientific texts is decreasing over time' came first, the upgoer5 library came second, and then there marginal differences between Rick and Sharon as they are co-authors, however Sharon beat Rick. Peter came last, and I am not sure why that is.

I also felt like the limited vocabulary of upgoer5 might harm concision as it may need more simple words to convey the same information, so I also made a metric based on the ratio of unique words, to text word length.

Let me know if either of you guys have any resource links for authors and content that you think should join the competition.

The same metric might do a good job at identifying texts of high quality that are mined with the original broad search queries:

Test Bimodalism

writing occupies a more accessible niche, that academic science writing should also occupy."
image.png

https://www.researchgate.net/post/If_I_am_comparing_two_distributions_of_data_one_bimodal_and_one_unimodal_are_they_statistically_significant

Link to scraping backend with a plotly-dash front end.

https://community.plot.ly/t/type-in-field-click-submit-long-computation-done-in-serial-returns-data-frame-data-frame-plotted-as-histograms/32078

https://github.com/waralex/Dashboards.jl

import Pkg; Pkg.add(Pkg.PackageSpec(url = "https://github.com/waralex/Dashboards.jl.git"))

Workflow needs external validation

@mcgurrgurr
https://github.com/russelljjarvis/ScienceAccessibility/blob/master/Documentation/DevelopingFrontEnd.MD

Repeat Spam detection analysis with R on scrapped corpus

http://www.rob-mcculloch.org/2019_ml/webpage/notes19/NB/NaiveBayesInR.Rmd

create travis-cl tests that work

edit travis.yml until it works.

The idea is that the scraper shouldn't run on Travis-cl, as this would violate its TOS.

Travis should download data from the OSF link:
https://osf.io/fuzgh/
and then run the analysis code over it, to demonstrate that the analysis software emanates from a reproducible build that passes tests.

fix spelling sort/sought.

vdsafds

Fix this image spelling 'sort' -> 'sought'

Re-write calls to google scrape, employing the real module, as opposed to ...

Using self introspective python calls to python.

In other words, the file scrape.py is very hackish, as it was only intended as a fast and dirty proof of concept.

Using python os.system('python ...') etc, is a frowned on habit. Better to just properly import the python library GoogleScrape, and to modify configuration dictionaries that define scrape web search queries instead.

Move paper contents from Overleaf LaTeX to markdown file to facilitate JOSS pub.

Move manuscript contents from Overleaf LaTeX to markdown file to facilitate JOSS pub.
https://github.com/arfon/fidgit/blob/master/paper/paper.md

https://joss.readthedocs.io/en/latest/submitting.html

Increase GS snippet word length.

GoogleScrape by default only collects 200-word snippets.
Extend Google Scrape to cache more words, by retrieving text resources from the URLs that are links.

bench mark texts

The analysis would be aided by benchmark data points (baseline values).

Without these, we can't know how well our language analysis tools are doing.

Ridiculous texts should also be included.

Issues

@mcgurrgurr

Maybe connecting plots one and two so they pop up together in one window as subplots. In the same window mainly for convenience of review more than anything.
Any other final coding steps to ensure robustness and usability. I’m not sure what this means, but basically feeling ok about the code being seen and used that nothing odd will come out of it.

create code of conduct. Something that is really optimally fair and liberal.

https://medium.freecodecamp.org/how-to-attract-new-contributors-to-your-open-source-project-46f8b791d787

Many Search Engines accessed by using DuckDuckGo's bang expansion trick.

GoogleScrape does not support yahoo and twitter etc and others I don't think.

Support for other engines was hacked in by forming query strings prepended with '!y search_query' etc.

This hack might have unintended consequences, but it's better than trying to write a crawler from scratch.

Make Buildable without explicit Docker building.

Move code from here into this repository.
https://github.com/russelljjarvis/ScienceAccess
Especially requirements.txt and .travis.yml

The goal is that that the whole project should be buildable without using Docker explicitly.

Docker2repo can build and host on mybinder.

Voila can convert the notebook to a dynamic web page.

Scraper is transiently broken

But getting text from html and pdf, works when I use the file get_benchmark_corpus.py

Rebuild the scraper so it borrows methods from get_benchmark_corpus instead.

Interactive visualisation of data points, as mouse over navigational links...

Interactive visualisation of data points, as mouse over navigational links is a bit broken, just because yahoo search links are intended for disposable consumption, clicking on them requires clicking forward, can this be done purely pro grammatically?

selenium cannot-create-temp-dir-for-user-data-dir

https://stackoverflow.com/questions/39530123/cannot-create-temp-dir-for-user-data-dir

Solution:
docker system prune.
restart docker