oduwsdl / aiu Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 1.0 101.92 MB

A library for interacting with web archive collections at Archive-It, Trove, Pandora, and more.

License: MIT License

Python 100.00%

archiveit metadata metadata-extraction webarchives

aiu's People

Contributors

Stargazers

Watchers

Forkers

machawk1

aiu's Issues

Provide CDX/C API support for Archive-It

This may provide information that we cannot get through our screen scraping and other API access.

Add support for Croatian Web Archive (HAW) Collections

HAW has collections that can, at a minimum, be scraped by this library to produce a list of URI-Ms and metadata.

The convert_LinkTimeMap_to_dict function does not handle unquoted relations

There are cases when non-memento relations show up in the Link header. The convert_LinkTimeMap_to_dict function does not need to handle these relations, and currently ignores them.

In the case of https://ianmilligan.ca/, there is an item in the link header that causes this function to raise an exception because it is unquoted. The issue is present in https://homerfan.wordpress.com as well. Both of these appear to be WordPress sites.

In the example below, the relation "shortlink" is unquoted, which breaks the parser in this function.

# curl -I https://ianmilligan.ca/
HTTP/2 200
server: nginx
date: Thu, 06 Sep 2018 22:11:55 GMT
content-type: text/html; charset=UTF-8
strict-transport-security: max-age=86400
vary: Accept-Encoding
vary: Cookie
x-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
link: <https://wp.me/4cEB>; rel=shortlink
x-ac: 3.bur _bur

URIs returned from some PandoraCollection and PandoraSubject methods have an extra slash

There is an extra slash before the identifier in some URIs returned from the code.

For subjects:

In [1]: ps = PandoraSubject(83)

In [2]: ps.subject_uri
Out[2]: 'http://pandora.nla.gov.au/subject//83'

For collections:

In [1]: pc = PandoraCollection(12022)

In [2]: pc.collection_uri
Out[2]: 'http://pandora.nla.gov.au/col//12022'

Some web clients and web servers will not redirect these correctly and calling code should not have to create a workaround.

get_seed_metadata fails if called in the wrong order

The following function assumes that a user will have called function load_seed_metadata before calling get_seed_metadata.

https://github.com/oduwsdl/archiveit_utilities/blob/bdaa349623705dc2c70e1dc17316d8b433327e5b/aiu/archiveit_collection.py#L507-L513

If called in the wrong order, then get_seed_metadata throws the following exception.

Traceback (most recent call last):
  File "/Users/smj/Unsynced-Projects/MementoEmbed/mementoembed/services/errors.py", line 27, in handle_errors
    return function_name(urim, preferences)
  File "/Users/smj/Unsynced-Projects/MementoEmbed/mementoembed/services/memento.py", line 195, in seeddata
    output[\'metadata\'] = sr.seed_metadata()
  File "/Users/smj/Unsynced-Projects/MementoEmbed/mementoembed/seedresource.py", line 152, in seed_metadata
    metadata = self.aic.get_seed_metadata(self.urir)[\'collection_web_pages\']
  File "/Users/smj/.virtualenvs/MementoEmbed/lib/python3.7/site-packages/aiu/archiveit_collection.py", line 506, in get_seed_metadata
    d = self.seed_metadata["seeds"][uri]
KeyError: 'seeds'

Provide OAI-PMH support for Archive-It collections

I'm not sure which collections would be a good test. We would need to plan this before implementation.

Update AIU to provide NLA collection metadata

Collections at NLA are stored at https://webarchive.nla.gov.au/collection

The can contain mementos or sub-collections. This makes getting metadata more complicated that getting it from Archive-It.

Each collection has a numeric identifier, such as 15003. The collection identifier exists at the end of the URL for that collection, e.g., https://webarchive.nla.gov.au/collection/15003. NLA collection 15003 has nine sub-collections.

Collection 13742 contains mementos instead of sub-collections. Our solution needs to be able to paginate and load additional content from the collection page. The "Show 10 More" button loads more content, but, to save on resources like RAM, we want to access the content without having to use a headless browser.

Fortunately, after some analysis with Chrome's developer tools, I've discovered that we can acquire a JSON representation of the collection via a URL like https://webarchive.nla.gov.au/bamboo-service/collection/13742 where we replace the last part of the path with the collection identifier.

Using ArchiveItCollection as an example, we will need to create another class that allows anyone to acquire this content via Python. Once this is done, it can be called from Hypercane or MementoEmbed, as needed.

Implement get_collectedby for PandoraCollection and PandoraSubject

I did not see a get_collectedby (or equivalent) for PandoraCollection or PandoraSubject. Such a function would need to descend into TEPs and acquire the organization names that created these collections. This way a resulting story would have access to this metadata for visualization.

In [1]: from aiu import PandoraSubject

In [2]: ps = PandoraSubject(22)

In [3]: memento_list = ps.list_memento_urims()
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in get_metadata_from_tep(res, data)
    106     try:
--> 107         json_data = json.loads(res.text)
    108         #print(json_data)

/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/__init__.py in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    356             parse_constant is None and object_pairs_hook is None and not kw):
--> 357         return _default_decoder.decode(s)
    358     if cls is None:

/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py in decode(self, s, _w)
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()

/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/lib/python3.8/json/decoder.py in raw_decode(self, s, idx)
    354         except StopIteration as err:
--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None
    356         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-3-69a1a780eaf3> in <module>
----> 1 memento_list = ps.list_memento_urims()

~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in list_memento_urims(self)
    390         """Lists the memento URIMs of an NLA Trove collection."""
    391
--> 392         self.load_subject_metadata()
    393         #print(self.metadata["main"]["urims"])
    394         return self.metadata["main"]["urims"]

~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in load_subject_metadata(self)
    358         if not self.metadata_loaded:
    359             soup = BeautifulSoup(self.firstpage_response.text, 'html5lib')
--> 360             self.metadata["main"] = extract_main_subject_data(soup,self.subject_id)
    361             #self.metadata["optional"] = extract_optional_collection_data(self.session.get(self.collection_json_uri))
    362             self.metadata_loaded = True

~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in extract_main_subject_data(soup, subject_id)
    307         #print(tep_id)
    308         tep_json_uri = tep_json_prefix + tep_id
--> 309         tep_dic = get_metadata_from_tep(requests.get(tep_json_uri),data)
    310         #print(tep_json_uri)
    311         #Some tep urls give server error subject 12, https://webarchive.nla.gov.au/bamboo-service/tep/75101

~/.virtualenvs/hypercane-clean2/lib/python3.8/site-packages/aiu/pandora_collection.py in get_metadata_from_tep(res, data)
    115                 return tep
    116             else:
--> 117                 raise ValueError("Could not find 'Problem accessing /bamboo-service/collection/' string in response")
    118
    119         except IndexError as e:

ValueError: Could not find 'Problem accessing /bamboo-service/collection/' string in response

oduwsdl / aiu Goto Github PK

aiu's People

Contributors

Stargazers

Watchers

Forkers

aiu's Issues

Provide CDX/C API support for Archive-It

Add support for Croatian Web Archive (HAW) Collections

The convert_LinkTimeMap_to_dict function does not handle unquoted relations

URIs returned from some PandoraCollection and PandoraSubject methods have an extra slash

get_seed_metadata fails if called in the wrong order

Provide OAI-PMH support for Archive-It collections

Update AIU to provide NLA collection metadata

Implement get_collectedby for PandoraCollection and PandoraSubject

Add support for Internet Archive collections

Document the source code with readthedocs

AIU cannot process Pandora Subject 22

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent