f-hafner / mag_sample Goto Github PK

Sample code from MAG

Python 17.78% R 1.99% Shell 1.51% Jupyter Notebook 41.66% HTML 36.76% TeX 0.29%

mag_sample's Introduction

mag_sample

The present repository is an extract from a project that uses data from Microsoft Academic Graph (MAG) and ProQuest. The programs work on a remote computer that has the data stored outside of the repository. The path to the data is defined in src/analysis/setup.R and src/dataprep/helpers/variables.py.

Directory structure

src/: data preparation and linking; analyze the publication careers of scientists.
output/: destination for tables and figures generated in src/.
snapshots/: contains files to reproduce the environments (for now, yml files for conda).

directory `src/dataprep`

The pipeline.sh script calls consecutively the scripts for

setting up the sqlite database with MAG data
preparing additional tables for analysis
loading the ProQuest data into the database
linking records between MAG and ProQuest
making some reports about the data and linking quality

directory `src/analysis`

The pipeline.sh script calls consecutively the scripts for

comparing graduation trends in ProQuest data and official statistics
assess the quality of current links
figures of publication dynamics over the career
analysis of career duration and becoming an advisor later in the career
supporting scripts are in setup.R and in helpers/

Notes

See open issues for some features that are currently lacking.

The python helpers are stored in a different location and we need to come up with a clean and simple way to make them accessible across the repository.
There are some redudancies between the Rscripts in analysis and dataprep because the scripts come originally from two different repositories.

mag_sample's People

Contributors

Watchers

Forkers

qcx201

mag_sample's Issues

improve first last name split by using language specific rules

For example for spanish we currently have:

firstname : juan
lastname : gonzalez
middlename : eugenio iglesias

However, the main last name is iglesias (the first last name)

Proposal: use https://nationalize.io to predict which country/language a name is from and implement specific rules for those.

Caveat: For spanish names, sometimes people give just the first lastname and sometimes both. So it is not obvious how to handle it automatically

validate student links

in chemistry, most students co-author with their advisor. https://direct.mit.edu/rest/article-abstract/95/2/698/58091/Chinese-Graduate-Students-and-U-S-Scientific?redirectedFrom=fulltext

--> since we don't use co-authors/institutions of students to link, we should in chemistry find a high match rate between the institutions and co-authors/advisors at the beginning of the career.

Adapt to new modular structure

pipeline.sh will not work as-is because calling a python script in a sub-directory, but loading a module that is not in that subdirectory will fail.
To fix: execute the scripts as modules, like this:

add a file __init__.py to all directories where there is a python script
in the pipeline, adjust the code for calling scripts from "python3 /some/path/script.py" to "python3 -m some.path.script"

Affiliation info: cites, pubs and fields

I think we can make the tables for paper/citation counts at affiliation-year-field0 as well as the keyword list at the same level in one go:

Create temporary table with unique PaperId, Year, AffiliationId
summarise paper outcomes at AffiliationId-Field0-Year level
summarise paper keywords at AffiliationId-Field0-Year level

To consider/note

We will replace the table affiliation_outcomes, which is only at the AffiliationId level, with the output from step 2 above. I don't know right now where we use this old table as an input, and we should check
There are two ways of assigning papers to "departments" at institutions: from the author's main field, or from the paper's main field. The latter is probably more accurate
The output from step 3 contains additionally the columns FieldOfStudyId and a Score. We can use the next lower integer of the score as a frequency weight to calculate tf-idf (but not sure how to exactly implement frequency weights)
Not sure how to add the count of researcher in this query

Here are the queries, which we can use to replace the query in affiliation_outcomes.py.

-- ## 1. create temp table 
CREATE TEMP TABLE paper_affiliation_year AS 
SELECT DISTINCT AffiliationId, Year, PaperId
FROM (
    SELECT a.AuthorId, a.AffiliationId, a.Year, b.Paperid
    FROM AuthorAffiliation a -- ## if an author has 2 main affiliations in the same year, we count their papers at both institutions
    INNER JOIN (
        SELECT PaperId, AuthorId, Year
        FROM PaperAuthorUnique
        INNER JOIN (
            SELECT PaperId, Year
            FROM Papers
        ) USING(PaperId)
    ) b
    ON a.AuthorId=b.AuthorId AND a.Year=b.Year
    -- reduces size of the data set 
    INNER JOIN (
        SELECT PaperId
        FROM paper_outcomes
    ) USING(PaperId)
)

CREATE INDEX ON idx_paper_temp ON paper_affiliation_year (PaperId)

-- ## 2. create table with citation/paper counts
CREATE TABLE affiliation_outcomes AS  -- this is already defined, where? can we replace it? where do we use it?? 
SELECT COUNT(PaperId) AS PaperCount
    , SUM(CitationCount_y10) AS CitationCount_y10
    , AffiliationId
    , Year 
    , Field0
FROM paper_affiliation_year 
INNER JOIN (
    SELECT PaperId, CitationCount_y10 
    FROM paper_outcomes 
) USING(PaperId)
INNER JOIN ( -- each field is unique per paper, so it is ok to join only here 
    SELECT PaperId, Field0 
    FROM PaperMainFieldsOfStudy
) USING(PaperId)
GROUP BY AffiliationId, Field0, Year

CREATE UNIQUE ON idx_affo_AffilYearField ON affiliation_outcomes (AffiliationId, Year, Field0)


-- ## 3. table with keywords
CREATE TABLE affiliation_fields AS 
SELECT SUM(Score) AS Score
    , FieldOfStudyId
    , AffiliationId 
    , Field0
    , Year 
FROM paper_affiliation_year 
INNER JOIN (
    SELECT PaperId, FieldOfStudyId, Score
    FROM PaperFieldsOfStudy 
    INNER JOIN (
        SELECT FieldOfStudyId 
        FROM FieldsOfStudy 
        WHERE level < 2 -- choose appropriate level 
    ) USING(FieldOfStudyId)
) USING(PaperId)
GROUP BY AffiliationId, Field0, Year

CREATE UNIQUE ON idx_afff_AffilYearField ON affiliation_outcomes (AffiliationId, Year, Field0)

improve advisor linking

use comparator functions as for grants
restrict mag sample to be linked with people at US institutions

try one field, see difference in performance, then do the rest

implement retrain option in linking script

in src/dataprep/main/link/setup_linking.py and src/dataprep/main/link/train_link_mag_proquest.py

understand low advisor links in some fields, and fix

see output/quality_linking_advisors.pdf

for instance, goid 2449294030 has the first advisor that only has an affiliation in 2009. among others, he has a science paper w/o any affiliation reported.

check some more examples of non-linked theses in these fields

Package src/helpers

(This is a copy from chrished/MAG/#2)

The helpers/ directory is used in most of the scripts above. helpers/ is easily accessed if it is a subdirectory but when the scripts are in separate subdirectories, the proper way to do this is to package the scripts in helpers/. There are 2 options:

use PyPI, but then the package is public
host a private PyPI server, for instance using devpi. devpi seems straightforward to use, but I don't know if it will be easy to share packages across contributors.

Main benefits:

tidier code, cleanly separated and probably easier to maintain
the packaged helpers could be used as well in other projects (e.g., researcher_gender)

Clean MAG FOS for NA etc

Some abstracts contain text like "abstract not available", currently MAG field of study is predicted for those. Need to filter them out on the go/make a separate list which are NA.

Correct type of AuthorId in links_nsf_mag

GrantID|AuthorId|Position|firstname_similarity|lastname_similarity
0000050|2440992798.0|0|1.0|1.0
0000135|686086669.0|1|1.0|1.0
0000433|327948395.0|1|1.0|1.0

AuthorId has a decimal point. This is wrong. It should be a (big) integer.

I check now where this happens and fix it.

main field per author-year?

do we need to calculate the main field1, field0 for each year an author is active?

it may be helpful in #27, but not sure it is worth the time right now.

rename "institution" column in author_info_linking

d1c3076 changes column names in table author_info_linking (in script with same name in main/prep_mag) to the various meanings of institution. we should also rename the "institution" column to "first_main_institution" to be consistent, and change this in the linking scripts.

fix DeprecationWarning: escape sequences for normalize_string()

DeprecationWarning for normalize_string():

src/dataprep/helpers/functions.py:41
  /home/flavio/projects/mag_sample/src/dataprep/helpers/functions.py:41: DeprecationWarning: invalid escape sequence \w
    s = s.replace("[^\w\d]", " ", regex = True)

src/dataprep/helpers/functions.py:42
  /home/flavio/projects/mag_sample/src/dataprep/helpers/functions.py:42: DeprecationWarning: invalid escape sequence \s
    s = s.replace("\s+", " ", regex = True)

these should be replaced with r"[^\w\d]" and with r"\s+" respectively.

in proquest, use new mag fields at level 0

currently, we're selecting based on pq_fields_mag which are imputed in load_proquest/correspond_fieldofstudy.py. Now that we have the predicted MAG fields, we could use this for selecting on field level 0.

This query computes a table similar to pq_fields_mag:

create table pq_fields_mag_future as 
select goid, ParentFieldOfStudyId AS FieldOfStudyId, sum(score) as score  
from (
    select goid, fieldofstudyid, ParentFieldOfStudyId, score 
    from pq_magfos 
    -- join parents at level 0
    inner join (
        select childfieldofstudyid, ParentFieldOfStudyId
        from crosswalk_fields
        where parentlevel = 0
    ) as crosswalk 
    on (pq_magfos.fieldofstudyid = crosswalk.childfieldofstudyid)
)
group by goid, parentfieldofstudyid

It could replace pq_fields_mag if we use the rank by score to replace the column position in the current table.
It would break some things: setup_linking may need to be adjusted, as well as link/topic_similarity_functions.py

DuckDB for aggregations?

instead of using pandas in-memory with batches.
See here: https://motherduck.com/blog/analyze-sqlite-databases-duckdb/

do some speed comparisons

data loading from db for dedupe: tuples not as expected?

not all records have their year_range as an expected tuple. this gives errors in the training and linking steps for NSF-MAG: we expect the year_range from NSF to be as (year, ) and the year_range from MAG to be (startyear, endyear). But sometimes tuples do not have the expected length, and sometimes other types are passed (see traceback below).

Traceback (most recent call last):
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/site-packages/dedupe/core.py", line 137, in __call__
    self.fieldDistance(record_pairs)
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/site-packages/dedupe/core.py", line 148, in fieldDistance
    distances = self.data_model.distances(records)
  File "/home/christoph/anaconda3/envs/science-career-tempenv/lib/python3.9/site-packages/dedupe/datamodel.py", line 84, in distances
    distances[i, start:stop] = compare(record_1[field],
  File "/home/christoph/mag_sample/src/dataprep/helpers/comparator_functions.py", line 167, in compare_range_from_tuple
    elif len(b) == 1:
TypeError: object of type 'int' has no len()

most probably a problem with loading the data from the database to the form for dedupe.

compute topic similarity more efficiently?

Currently, we iterate over each graduation year, but for each iteration, we load a window of data +/- 5 years into memory. If we compute the similarity for a 2 or more neighboring graduation years, we only have to add data for two additional years. This could speed up the calculations. The trade-off is that this needs more memory.

add linking of nsf grantees to mag

update conda env

on main branch:

patch yml file from flavio/link_grants
add the following dependencies: geopy, xlrd, ipykernel, openpyxl, pandoc

add geographical distance between institutions

load institution list with coordinates -> count to check it's not too many, use cng institutions
calculate pairwise distances
write to sql db

Check what is the cause of linking rate break (compare with putting the split at different years) and check whether the selection of the highest link score is the issue.
tidy up the biology split and make the whole process automatic

improve graduates linking

1. extend time range up to 2015
2. compare name similarity of documents titles: https://stackoverflow.com/questions/8897593/how-to-compute-the-similarity-between-two-text-documents. idea: by definition, people that we link publish something, and this is likely related to the topic of their dissertation.
3. condition mag sample on people at some point/at beginning of career employed in US? It is faster and likely more precise (less noise from unrelated entities). It may reduce the size of the linked sample, but we are interested in the US anyway.
4. save training and setting files in specific folder for graduates
5. Create benchmarking as described below
6. keywords comparator: count number of overlaps / dummy for at least one overlap or not
7. title comparator: larger ngrams? stemming before tfidf? -- use benchmark for intuition
8. extend list of first papers: now those in first 7 years of publication; maximum 10
9. Increase sampled fields from MAG side for people whose field can fall in different major fields (biology vs chemistry)

PaperCount in author_fields_detailed

something I came across when calculating topic similarity: what is PaperCount in author_fields_detailed created/used for? should it be dropped b/c it is calculated elsewhere?

f-hafner / mag_sample Goto Github PK

mag_sample's Introduction

mag_sample

Directory structure

Contents

directory src/dataprep

directory src/analysis