alan-turing-institute / defoe Goto Github PK
View Code? Open in Web Editor NEWCode to analyse books and newspapers data using Apache Spark.
License: MIT License
Code to analyse books and newspapers data using Apache Spark.
License: MIT License
Combine the many duplicated query-specific helper functions (e.g. tuple unpackers) into a single helper module.
defoe/spark_utils.py open_stream creates and connects to a new instance of azure.storage.blob.BlobService for every "blob". This should be refactored so it's only done once per Azure container within which the blobs reside.
defoe/alto|books|fmp/archive.py each define Archive though Archive is not part of the data model (books/pages) but part of how the data is bundled. It should be possible, for example, to run queries over ALTO-compliant books that are not in a ZIP files too. Complementarily, it should be possible to run queries over British Library Newspapers which are in ZIP files.
On CentOS 7:
$ python --version
Python 2.7.15 :: Anaconda, Inc.
$ conda install -c anaconda --file requirements.txt
$ python
> from azure.storage.blob import BlobService
...
ImportError: No module named azure.storage.blob
...
Remove azure packages and reinstall, using conda-forge:
$ pip freeze | grep azure > azure.txt
$ pip uninstall -y -r azure.txt
$ conda install -c conda-forge azure
$ python
> from azure.storage.blob import BlobService
OK
$ pip freeze | grep azure
azure==1.0.3
azure-common==1.1.16
azure-mgmt==0.20.1
azure-mgmt-common==0.20.0
azure-mgmt-compute==0.20.1
azure-mgmt-network==0.20.1
azure-mgmt-nspkg==1.0.0
azure-mgmt-resource==0.20.1
azure-mgmt-storage==0.20.0
azure-nspkg==3.0.2
azure-servicebus==0.20.1
azure-servicemanagement-legacy==0.20.1
azure-storage==0.20.3
On Urika:
$ python --version
Python 2.7.14 :: Anaconda custom (64-bit)
$ conda install -c anaconda --file requirements.txt
$ python
> from azure.storage.blob import BlobService
...
ImportError: No module named azure.storage.blob
...
$ pip freeze | grep azure > azure.txt
$ pip uninstall -y -r azure.txt
$ conda install -c conda-forge azure
$ python
> from azure.storage.blob import BlobService
ImportError: No module named azure.storage.blob
$ conda install -c conda-forge azure-storage
$ python
> from azure.storage.blob import BlobService
ImportError: cannot import name BlobService
$ pip freeze | grep azure
azure-common==1.0.0
azure-mgmt==0.20.1
azure-mgmt-common==0.20.0
azure-mgmt-compute==0.20.1
azure-mgmt-network==0.20.1
azure-mgmt-nspkg==1.0.0
azure-mgmt-resource==0.20.1
azure-mgmt-storage==0.20.0
azure-nspkg==1.0.0
azure-servicebus==0.20.1
azure-servicemanagement-legacy==0.20.1
azure-storage==0.36.0
See also #12
Clean up and standardise logging.
Relates to #5.
Allow for credentials for multiple containers to be provided.
Both in-code and in documentation.
The following queries use PreprocessWordType.LEMMATIZE by default:
To change to another preprocessing type requires editing the source code. Update the query configuration file to:
Extend support for preprocessing to all other queries across all data models.
Decide how errors are to be caught and handled e.g.
Identify opportunities to abstract out a common data model from books and/or newspapers. A model which, at least, allows for the same query code to be run across books and/or newspapers - for a set of common queries - would be good.
Currently defoe/papers/queries/target_concordance_collocation_by_date.py returns:
{
<YEAR>:
[
[<FILENAME>, <WORD>, <CONCORDANCE>, <OCR>],
[<FILENAME>, <WORD>, <CONCORDANCE>, <OCR>],
...
],
<YEAR>:
...
}
Turn the lists into dicts with filename
, keyword
, concordance
, ocr
keys.
Some queries create a single regexp including every word being searched for and try and match this regexp against a string containing every word in the article/page.
defoe.papers.queries.articles_containing_words creates a single regexp with all the words being searched for:
interesting_words = [re.compile(r'\b' + word.strip() + r'\b', re.I | re.U)
for word in list(open(interesting_words_file))]
This regexp is then matched against all the words in the article:
interest = articles.flatMap(lambda (year, article):
[((year, regex.pattern), 1) for regex in interesting_words if regex.findall(article.words_string)])
Adopting this approach in defoe.books.queries.find_words_group_by_year|word gave divergent results from running the non-Spark versions of these queries using alan-turing-institute/cluster-code (epcc-master) , in for example, diseases.py. Inspecting the raw XML data and manually searching for words showed matches not picked up by the above approach e.g. and Small-Pox, Small-pox.
diseases.py uses a function:
lowalpha=re.compile('[^a-z]')
def normalize(word):
return re.sub(lowalpha,'',word.lower())
and matches against each word, rather than the entire text of a page:
for page, word in book.scan_words():
normalised=normalize(word)
if normalised in diseases:
finds[normalised].append(page.code)
Adopting this approach in defoe.books.queries.find_words_group_by_year|word gave the same results as for cluster-code (epcc-sparkrods).
Update word searchers to match on individual normalized words.
Move defoe.books.alto.utils, which has normalize, to defoe.utils.
Run 2to3
to get recommendations for changes.
The key challenge will be handling Unicode. For example:
from cStringIO import StringIO
can be replaced by either of:
from io import BytesIO # for handling byte strings
from io import StringIO # for handling unicode strings
2to3
recommends changing:
try:
return str(result[0])
except UnicodeEncodeError:
return unicode(result[0])
to:
try:
return str(result[0])
except UnicodeEncodeError:
return str(result[0])
and
self.bwords = list(map(unicode, self.query(Page.words_path)))
to:
self.bwords = list(map(str, self.query(Page.words_path)))
https://python-future.org/compatible_idioms.html#unicode suggests something like:
from builtins import str as text
try:
return str(result[0])
except UnicodeEncodeError:
return text(result[0])
This provides more user control over the volume of data returned for each match.
Update:
defoe/alto/queries/keyword_concordance_by_word.py
defoe/alto/queries/keyword_concordance_by_year.py
defoe/papers/queries/keyword_concordance_by_date.py
defoe/nzpp/queries/keyword_concordance_by_date.py
The following query:
defoe/papers/queries/target_concordance_collocation_by_date.py
provides a more complex example of returning concordance within a window of N words and could be copied and simplified (to remove the use of a target word) and then ported to alto
too.
Review all XPath queries and Python attribute names in object models.
Check that the interpretation of what XML elements and attributes mean by the Python developers is correct.
defoe/papers/article.py has:
@property
def words_string(self):
"""
Return the full text of the article as a string. Remove all hyphens.
This merges hyphenated word by may cause problems with subordinate
clauses (The sheep - the really loud one - had just entered my office).
"""
return ' '.join(self.words).replace(' - ', '')
Investigate if it is possible to differentiate hyphens from subordinate clauses.
Improve commenting in and documentation about "lda_topics" query:
defoe.papers.queries.target_concordance_collocation_by_date has a hard-coded WINDOW_SIZE = 5 by default. To change to another window size requires editing the source code. Update the query configuration file to:
If defoe.papers.issue.Issue fails to parse an XML document an object is still built with empty strings, lists etc as fields and datetime.now()
as a date. This can give misleading results when running queries in which such documents are encountered e.g.
nohup spark-submit --py-files defoe.zip defoe/run_query.py ~/data/papers/files.all.txt papers defoe.papers.queries.articles_per_year -n 144 > log.txt &
cat results.yml
{1714: 115, 1715: 302, ....,1950: 7155, 2019: 0}
Notify developers of UCL-dataspring/cluster-code and UCL/i_newspaper_rods of "defoe".
Ingest stranger_danger_group.py
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.