Code Monkey home page Code Monkey logo

Comments (13)

jacquerie avatar jacquerie commented on July 17, 2024 1

The original issue is fixed by inspirehep/inspire-next#1984. The new issue that @fschwenn brought up should probably go in its own issue in ...hepcrawl?

from inspire.

jacquerie avatar jacquerie commented on July 17, 2024

Forgot to cc @kaplun, but maybe @tsgit knows the answer too.

from inspire.

tsgit avatar tsgit commented on July 17, 2024

woah -- that's a bad bibupload from

1094156.20140106152404 375595 hilu mode:replace file:/opt/cds-invenio/var/tmp-shared/bibedit-cache/bibedit_record_eV_TNm_1094156_32.xml

in 2014 not limited to xme format but also visible publicly in xm format and those are supposed to be hidden files.

how many are there?
@jacquerie @kaplun

from inspire.

tsgit avatar tsgit commented on July 17, 2024

hilu is a DESY cataloguer -- but not a developer, so something in the toolchain must have been wrong at the time

In [7]: from invenio.search_engine import get_collection_reclist

In [8]: heprecs = get_collection_reclist('HEP')

In [9]: badfft = set()

In [10]: for r in heprecs:
    ...:     xm = decompress(run_sql('select value from bibfmt where format="xm" and id_bibrec=%s' % r)[0][0])
    ...:     if xm.find('datafield tag="FFT" ind1="%" ind2="%"') > -1:
    ...:         badfft.add(r)
    ...:         
        
In [11]: len(badfft)
Out[11]: 139

In [12]: badfft
Out[12]: 
{1090369,
 1094156,
 1115831,
 1115876,
 1116124,
 1119996,
 1120518,
 1123359,
 1123523,
 1123802,
 1124579,
 1179996,
 1184387,
 1185409,
 1186735,
 1189002,
 1191014,
 1192965,
 1198033,
 1201900,
 1202269,
 1202491,
 1203072,
 1203155,
 1203366,
 1203846,
 1203875,
 1204492,
 1204547,
 1204945,
 1206327,
 1206352,
 1206843,
 1206884,
 1207442,
 1207630,
 1207641,
 1207869,
 1208106,
 1208623,
 1208733,
 1208807,
 1208884,
 1209405,
 1209447,
 1209840,
 1209910,
 1210054,
 1210064,
 1210447,
 1210689,
 1210692,
 1211366,
 1215306,
 1215337,
 1215587,
 1215612,
 1215782,
 1216303,
 1216535,
 1216603,
 1216672,
 1216887,
 1217117,
 1217362,
 1217696,
 1217710,
 1217741,
 1217858,
 1217862,
 1217981,
 1218030,
 1218290,
 1218345,
 1218357,
 1218393,
 1218995,
 1219065,
 1219075,
 1219249,
 1219311,
 1219343,
 1219346,
 1219970,
 1220252,
 1220253,
 1221009,
 1221061,
 1221062,
 1221074,
 1222146,
 1222686,
 1222841,
 1223359,
 1223860,
 1223990,
 1224160,
 1225546,
 1226021,
 1227658,
 1230983,
 1236870,
 1239650,
 1268877,
 1268878,
 1268879,
 1268880,
 1268881,
 1268882,
 1268883,
 1268884,
 1268885,
 1268886,
 1268887,
 1268888,
 1268889,
 1268890,
 1268891,
 1268893,
 1268894,
 1268895,
 1268896,
 1268897,
 1268898,
 1268899,
 1268900,
 1268901,
 1268902,
 1268903,
 1269029,
 1269030,
 1269031,
 1269032,
 1269033,
 1269034,
 1269035,
 1269036,
 1269037,
 1269038}

from inspire.

tsgit avatar tsgit commented on July 17, 2024

@fschwenn any comments?

from inspire.

jacquerie avatar jacquerie commented on July 17, 2024

how many are there?

Sentry has 284 events of this kind. Since two full migrations have happened so far, this means 142 records.

from inspire.

kaplun avatar kaplun commented on July 17, 2024

Yeah poor bibupload treated the % literally, which is IMHO correct when this is provided in the input MARCXML. If it was well designed it should have spit out an error of invalid character though.

from inspire.

tsgit avatar tsgit commented on July 17, 2024

so to answer @jacquerie question what to do with them: ignore in migration

the affected records should have had publisher XML attached as hidden files, but due to incorrect MARCXML bibupload it wasn't.
@fschwenn might know if the original source is still available and if so, attach the publisher XML correctly and remove the bad FFT%% tags from the records

I counted 139 affected records above, @jacquerie estimates 142 records.

from inspire.

jacquerie avatar jacquerie commented on July 17, 2024

ignore in migration

👍

The small discrepancy in the numbers can be explained by the fact that Nightly currently works on and old prodsync dump from September, so stuff might have changed in the meantime.

from inspire.

fschwenn avatar fschwenn commented on July 17, 2024

It seems the aps harvesting code had a bug in the early days. We at DESY did not notice immediately as during selection and curation the hidden files are not of interest anyway. The original fulltext.xml at /afs/cern.ch/project/inspire/uploads/aps start at 2014.06.05. I do not know why.
Concerning reharvesting for a list of DOIs I fear the status is still that of one year ago, when Jan answered: "APS has a new API (v2) that should work better than the current one and return metadata directly, see e.g. http://harvest.aps.org/docs/harvest-api#general, but we do not have access yet (I just asked for it). Using the old/current API is a bit more tricky since the metadata (e.g. the abstract) is not so easily available - only as part of the XML in a Bagit archive. I could show you how, but I would recommend to wait and see if we can get access to the new API first."

from inspire.

kaplun avatar kaplun commented on July 17, 2024

It's part hepcrawl and part not code related issue. I.e. we need to investigate in order to obtain access to the API.

from inspire.

tsgit avatar tsgit commented on July 17, 2024

can this be closed?

meanwhile I also removed all the bad MARC FFT entries

from inspire.

jacquerie avatar jacquerie commented on July 17, 2024

Yeah, this can be closed.

from inspire.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.