Comments (8)
(I'm replying to one aspect at a time, to structure this!)
As I understand it, mods4pandas currently requires the XML files in some directory and cannot retrieve those via API request (like the query above) directly.
The purpose of mods4pandas is to transform METS XML files into a table (or, more specific, a pandas DataFrame, Excel and CSV was added later as a byproduct) to perform analysis. That is, as far as can be done with such a hierarchical format like METS / MODS. Here is the latest result for the Digitized Collection of the SBB: https://zenodo.org/record/7716032
Querying an API is not really the purpose of the program.
That being said, I could try to figure out which API your above query is using - I honestly don't know at the moment.
from mods4pandas.
Counting
TYPE="annotation"
would be great but even just a total number of all logical elements (that mostly will be pages) for a bunch of books (i.e. multiple METS) would be nice to extract (or have as dataframe).
Logical elements are not pages at all and don't even map to pages 1:1. For example, in the METS you linked, LOG0003
alone maps to over 400 pages in the mets:structLink
:
<mets:structLink>
<mets:smLink xlink:to="PHYS_0001" xlink:from="LOG_0003" ></mets:smLink>
<mets:smLink xlink:to="PHYS_0002" xlink:from="LOG_0003" ></mets:smLink>
[... and so on...]
Sorry if this seems to be pedantic, I just try to point out potential misunderstandings, because I want to understand the problem better :-)
from mods4pandas.
TODOs for me:
- We currently don't have any statistics from the structMap implemented. It would be conceivable to have these, e.g. a count of all mets:div grouped by TYPE
- Figure out which API @ch-sander's original query is using (if any public API)
from mods4pandas.
These types come from the METS TYPE="LOGICAL"
structMap
, e.g.:
<mets:div ID="LOG_0008" TYPE="annotation"/>
We currently don't have any statistics from the structMap
implemented. It would be conceivable to have these, e.g. a count of all mets:div
grouped by TYPE
. Then, given a complete export, you could
- narrow it down to your year range by the column
originInfo-publication0_dateIssued
- sum by the future column
mets_structMap-LOGICAL-div-annotation-count
(name not final)
to sum up the count of annotation elements. These are logical elements, not necessarily physical pages - mapping those would be outside the scope of this project, I believe.
Would this answer the question?
from mods4pandas.
Thanks @mikegerber . @cneud proposed something similar. Counting TYPE="annotation"
would be great but even just a total number of all logical elements (that mostly will be pages) for a bunch of books (i.e. multiple METS) would be nice to extract (or have as dataframe). As I understand it, mods4pandas currently requires the XML files in some directory and cannot retrieve those via API request (like the query above) directly.
I assume there is no structured data (i.e. an integer number) for <mods:extent>416 ungezählte Seiten</mods:extent>
like in this METS (this example has all logical pages twice in </mets:structMap>
it seems)? This would be ideal but it depends on the catague and not on the XML etc.
Sorry for the confusion.
from mods4pandas.
Logical elements are not pages at all and don't even map to pages 1:1.
Sorry for the confusion. So, although I'm interested in the physical pages of books, a digitzed book might contain more images (or, canvases) as, e.g. there's an image of the binding (which is not a page), maybe by mistake the same phyical page was scanned twice (so, 4 digital images represent 2 physical pages only) etc.. I could live with this discrepancy, but I'd like to have a guess about the amount of "digital pages" (in the broad data-contrained sense sketched above) of, say, all books in the SBB digi collection printed in 1666.
from mods4pandas.
I'd like to have a guess about the amount of "digital pages" (in the broad data-contrained sense sketched above) of, say, all books in the SBB digi collection printed in 1666.
That is easily answered with the file above, here with pandas in Python:
https://gist.github.com/mikegerber/72e57c847486163f46de94a71987ef5c
But it's a bit unclear what numbers you really want?
a. Number of annotations? - This can be implemented and is on the TODO above
b. Number of pages for certain works? - This can already be done using the existing implementation and data (see example .ipynb)
c. Number of pages with annotations? - This is somewhat outside of the scope of this tool and the data it exports as it would require interpreting the METS structMap in a less straightforward way (follow the links in the structMap to resolve pages for annotations (=logical elements))
from mods4pandas.
a. Number of annotations? - This can be implemented and is on the TODO above
b. Number of pages for certain works? - This can already be done using the existing implementation and data (see example .ipynb)
c. Number of pages with annotations? - This is somewhat outside of the scope of this tool and the data it exports as it would require interpreting the METS structMap in a less straightforward way (follow the links in the structMap to resolve pages for annotations (=logical elements))
"b." was the most pressing issue and the notebook is already very useful. I will see if I can manage to filter in such a way that it returns those works like in my query above. Publication date 1666 was just an example. I actually need to see how many total pages works have that have at least one "annotation" recorded, belong to some of the subject groups (e.g. "Naturwissenschaft..."), and are published before 1800. But the desired count is for all pages, not only those that carry an annotation
Numbers for "a." and "c." would be even better, but in practice the librarians have not recorded all "annotations" but only the first (<10) to occur in a book, I was told. So, with the current data it's not overly meaningful (for me) but as the data grows, this would be fantastic.
Again, thank you so much!
from mods4pandas.
Related Issues (20)
- Fix tests on Python 3.10
- Integration of ALTO metadata HOT 19
- Review imports
- Smarter handling of namespaces
- Better name for altotool HOT 1
- Review XXXs and TODOs
- Use test data in qurator/modstool/tests/data
- alto4pandas: LANG + language
- README should show some results
- Group name columns by role
- Add missing information for "original" PPNs HOT 7
- One or more element has unexpected attributes: mods:recordIdentifier source="dnb-ppn" HOT 3
- Missing information from the original METS/MOTS HOT 12
- Missing subject/topic, genre HOT 3
- Structure information HOT 12
- Documentation of the fields exported
- Don't use the Python namespace qurator
- Test on Python 3.12
- Group names given in the MODS-file according to given roles to reduce number of columns HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mods4pandas.