unt-libraries / portal-leading Goto Github PK
View Code? Open in Web Editor NEWThe Portal to Texas History projects to support the IMLS-funded project, LIS Education and Data Science Integrated Network Group (LEADING)
The Portal to Texas History projects to support the IMLS-funded project, LIS Education and Data Science Integrated Network Group (LEADING)
This is a ticket to investigate and then start with a sample implementation of frictionless data for the datasets that we have been creating so far for this project. This will take a bit go digging to understand what this is and how it can be used for the project but there is a ton of documentation and examples on github.
https://www.youtube.com/watch?v=lWHKVXxuci0
https://carpentries-incubator.github.io/frictionless-data-agriculture/
https://github.com/Swiss-Polar-Institute/frictionless-data-packages
Those are a few links that are probably a good start. Feel free to put additional links in the comments of this ticket if you find some that are useful.
What I need is, for each UNTL-BS term, a range of keyword counts across all Portal records that share the browse term (e.g. for all the records that share the term "Agriculture," how many have 1 keyword, how many have 2 keywords, etc.). I have two possible avenues in mind for accomplishing this:
https://digital.library.unt.edu/solrparse/raw/?q=:
&fq=(aubrey_system:PTH+OR+untl_institution:UNTA)+AND+dc_rights_access:public
&fq=dc_subject.UNTL-BS_facet:%22Agriculture%22
&facet=true
&facet.field=dc_subject.UNTL-BS_count
&facet.field=dc_subject.KWD_count
&facet.limit=-1
&facet.mincount=1
&rows=0
&wt=xml
In this case, a script would be needed to iterate through all 1,076 UNTL-BS terms, and extract and export the relevant count ranges.
https://digital.library.unt.edu/solrparse/raw/?q=:
&fq=(aubrey_system:PTH+OR+untl_institution:UNTA)
&fq=dc_rights_access:public
&facet.pivot=dc_subject.UNTL-BS_facet,dc_subject.KWD_count
&facet=true
&facet.field=dc_subject.UNTL-BS_facet
&rows=0
&wt=xml
&indent=true
In this case, a script would be needed only to extract and export the relevant information - but it might be more difficult because the layout of the data in this query's output is more complicated than the single-term query.
Should be pretty straightforward.
We need a tabular (tsv or csv) file listing the 254 counties in Texas, along with a few other bits of data.
I suggest we include the following
Note: You can verify against this page - https://texashistory.unt.edu/search/?q=&t=fulltext&sort=added_d&fq= then go to the county facet and type "TX". It should get you how the Texas county is represented in this system (Aubrey).
This can get added to the github repository
In the untl-bs folder let's go ahead and create a data
directory and move the json files we are creating into that directory.
This way we will have a code
and a data
directory to better sort things.
You can do this in the web-based github but it might be easier with Github Desktop or using the command line.
If you haven't used git on the command line, I suggest working through this Software Carpentry Lesson - https://swcarpentry.github.io/git-novice/
Would be interesting to take a look at this tool
And see if it could be helpful for visualizing some of this collection data in different ways.
Converting this pdf into a tabular (csv or tsv) dataset that can be used in something like R or Pandas would be useful.
Texas County Population every ten years - https://texasalmanac.com/sites/default/files/images/topics/ctypophistweb2010.pdf
It can be added to this github repository and we will need to add a citation to the source of the data.
Create a test Solr query that grabs all subject terms (with whatever qualifier) collocated with one specific UNTL-BS term.
Create a script (or modify an existing script?) to run the Solr query and export the resulting terms as individual documents in a corpus.
[prep for clustering/topic modeling, Question 3 Approach C]
I've already done this through edit.texashistory.unt.edu, but the numbers were were not entirely accurate (the Portal filter excludes some records that do display on the Portal). I should replicate that same query using Solr.
I should be able to use pivot facets in Solr to examine the range of unique values in several key fields (i.e. dc_type, untl_collection, and untl_institution) across records that share specific UNTL-BS terms of interest. This will help me identify any terms that have a high actual or proportional occurrence count because of improper or imbalanced usage.
I think it make sense to use the 2019-All dataset for our work.
This is from the TSLAC Public Library Accreditation and Statistics program - https://www.tsl.texas.gov/ldn/statistics
I suggest we create a thinner tabular dataset (tsv or csv) for use with R or Pandas.
With a quick look I could see the fields of
All being the most important with the County being the field that we will use to match this against our other datasets.
This can go into this repository with a citation to the datasets that we used.
Review the json structure and load a copy into a repository folder called untl-bs
and into a file with the date in the format of untlbs-yyyy-mm-dd.json
You can download the existing json data from this url.
Similar to #5, this URL provides the Solr output from the Portal related to photographs held.
This should be converted into a tabular data format that can be used for subsequent analysis.
It should be noted that there are going to be non-Texas counties represented in this data so they will need to be removed for the final dataset because we are focusing on Texas.
Is there a way for me to generate collocation counts as percentages of a term's total occurrence count, rather than as straight numbers? That would help me identify the top most-collocated pairs across the entire Portal much more easily.
(Does the fact that, when querying in Solr, the UNTL-BS term being queried always appears as the top value in the facet list help at all? Is there a way of automatically dividing the collocation count for each other term by the count on the first term in the list?)
Chronicling America maintains a list of a large number of newspaper titles from around the US
https://chroniclingamerica.loc.gov/search/titles/
You can limit to the 5,863 titles that they have records of from Texas - https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas
by adding a format=json
to the URL you can get a json dataset that can try to use for our work.
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&format=json
We can also adjust the number of rows in the results and the page that we are on.
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&rows=100&format=json&page=1
https://chroniclingamerica.loc.gov/search/titles/results/?state=Texas&rows=100&format=json&page=2
One of the things to keep in mind when working with these kinds of systems is to "not be greedy" for example, we have 5863 titles to grab from this. A greedy option would be to try and adjust rows=5863 and try and grab everything in one request. This may be just fine, but it also may be an unplanned sort of request in their system.
I will generally try and break this kind of request into a shorter set of requests, say 100 or maybe 250 per request and then just do a sequence of requests with a few seconds waiting between the requests. It might take a few minutes on my end but usually won't make the data providers upset because someone is hammering on a system when they don't need to.
It will be good to have the output finally end up in a tabular (tsv or csv) form that we can use in R or Pandas. We will need to do a bit of processing to get the info we are wanting. Just a quick look I think we might want
You will also want to do some work with the place
field in the records to try and extract the "County" for each titles so that we can have things align.
I suggest creating a directory in the repository called newspaper-titles
and include both the dataset as well as any scripts used in making the final tabular dataset from the json.
One final suggestion, I suggest downloading the data as .json files and then writing any tools/scripts for working with the data to work off of those files as input instead of trying to have the script do the requests from the database. Just a suggestion.
The following URL shows the Solr output of the Texas Digital Newspaper Program (TDNP)
This can be parsed into a tabular dataset that we can use for further analysis.
While most of the values are going to be Texas counties, they should be checked because there is also the possibility that non-Texas county information could be included.
Once I've got the list of occurrence counts from issue #13 I can run a basic statistical analysis: min/max, mean, median and mode, standard deviation, interquartile range, and/or a 5 number summary displayed as a box and whisker plot.
[Question 1, Approach A]
Starting off with Jupyter Notebooks as a framework for building/documenting/sharing python code let's take a look at building some choropleth maps of the counties in Texas using the different datasets we have created.
Some example python for this.
https://docs.bokeh.org/en/latest/docs/gallery/texas.html
Might be useful when we get to sharing - https://reproducible-science-curriculum.github.io/sharing-RR-Jupyter/01-sharing-github/ the first part is about setting up an account/repo which we have already.
For each of the 16 main headings, and for any additional terms on the terms of interest list, run a Solr query to determine how many of these terms owe their high occurrence counts to collocation with broader or narrower terms within the same branch. This may help eliminate some terms from later analysis.
Once I've got the list of counts from issue #13, I can use that to redo my original spread vs. usage analysis
For each of the 14 main UNTL-BS categories:
Calculate the occurrence count and number of narrower terms as a percentage of the total and compare the two
Starting on the first level of narrower terms:
Determine which "branches" of the UNTL Browse Structure are "top-heavy" (wider terms have higher occurrence counts than narrower terms) or "off-balance" (one narrower term has a high occurrence count relative to its siblings).
For now, I'll have to do the analysis in Excel.
[Question 1, Approach A]
We need to create datasets for the following UNTL-BS fields. (supplied below by @50jonesh
We want to have tab delimited with
aubrey_identifier
display_title
dc_description
dc_subject
We know that there will be an uneven number of columns per row across the dataset and that shouldn't cause issue for the topic modeling.
We will give Mallet a shot first for topic modeling - https://mimno.github.io/Mallet/topics.html
Write and run a Solr query to capture a list of all values visible on the Portal coming from a field we think is likely to have a lot of term redundancy with UNTL-BS terms (e.g. Coverage). Export the list of terms for comparison with the UNTL-BS terms.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.