Comments (13)
The original issue is fixed by inspirehep/inspire-next#1984. The new issue that @fschwenn brought up should probably go in its own issue in ...hepcrawl
?
from inspire.
Forgot to cc @kaplun, but maybe @tsgit knows the answer too.
from inspire.
woah -- that's a bad bibupload from
1094156.20140106152404 375595 hilu mode:replace file:/opt/cds-invenio/var/tmp-shared/bibedit-cache/bibedit_record_eV_TNm_1094156_32.xml
in 2014 not limited to xme
format but also visible publicly in xm
format and those are supposed to be hidden files.
how many are there?
@jacquerie @kaplun
from inspire.
hilu is a DESY cataloguer -- but not a developer, so something in the toolchain must have been wrong at the time
In [7]: from invenio.search_engine import get_collection_reclist
In [8]: heprecs = get_collection_reclist('HEP')
In [9]: badfft = set()
In [10]: for r in heprecs:
...: xm = decompress(run_sql('select value from bibfmt where format="xm" and id_bibrec=%s' % r)[0][0])
...: if xm.find('datafield tag="FFT" ind1="%" ind2="%"') > -1:
...: badfft.add(r)
...:
In [11]: len(badfft)
Out[11]: 139
In [12]: badfft
Out[12]:
{1090369,
1094156,
1115831,
1115876,
1116124,
1119996,
1120518,
1123359,
1123523,
1123802,
1124579,
1179996,
1184387,
1185409,
1186735,
1189002,
1191014,
1192965,
1198033,
1201900,
1202269,
1202491,
1203072,
1203155,
1203366,
1203846,
1203875,
1204492,
1204547,
1204945,
1206327,
1206352,
1206843,
1206884,
1207442,
1207630,
1207641,
1207869,
1208106,
1208623,
1208733,
1208807,
1208884,
1209405,
1209447,
1209840,
1209910,
1210054,
1210064,
1210447,
1210689,
1210692,
1211366,
1215306,
1215337,
1215587,
1215612,
1215782,
1216303,
1216535,
1216603,
1216672,
1216887,
1217117,
1217362,
1217696,
1217710,
1217741,
1217858,
1217862,
1217981,
1218030,
1218290,
1218345,
1218357,
1218393,
1218995,
1219065,
1219075,
1219249,
1219311,
1219343,
1219346,
1219970,
1220252,
1220253,
1221009,
1221061,
1221062,
1221074,
1222146,
1222686,
1222841,
1223359,
1223860,
1223990,
1224160,
1225546,
1226021,
1227658,
1230983,
1236870,
1239650,
1268877,
1268878,
1268879,
1268880,
1268881,
1268882,
1268883,
1268884,
1268885,
1268886,
1268887,
1268888,
1268889,
1268890,
1268891,
1268893,
1268894,
1268895,
1268896,
1268897,
1268898,
1268899,
1268900,
1268901,
1268902,
1268903,
1269029,
1269030,
1269031,
1269032,
1269033,
1269034,
1269035,
1269036,
1269037,
1269038}
from inspire.
@fschwenn any comments?
from inspire.
how many are there?
Sentry has 284 events of this kind. Since two full migrations have happened so far, this means 142 records.
from inspire.
Yeah poor bibupload treated the %
literally, which is IMHO correct when this is provided in the input MARCXML. If it was well designed it should have spit out an error of invalid character though.
from inspire.
so to answer @jacquerie question what to do with them: ignore in migration
the affected records should have had publisher XML attached as hidden files, but due to incorrect MARCXML bibupload it wasn't.
@fschwenn might know if the original source is still available and if so, attach the publisher XML correctly and remove the bad FFT%% tags from the records
I counted 139 affected records above, @jacquerie estimates 142 records.
from inspire.
ignore in migration
👍
The small discrepancy in the numbers can be explained by the fact that Nightly currently works on and old prodsync dump from September, so stuff might have changed in the meantime.
from inspire.
It seems the aps harvesting code had a bug in the early days. We at DESY did not notice immediately as during selection and curation the hidden files are not of interest anyway. The original fulltext.xml at /afs/cern.ch/project/inspire/uploads/aps start at 2014.06.05. I do not know why.
Concerning reharvesting for a list of DOIs I fear the status is still that of one year ago, when Jan answered: "APS has a new API (v2) that should work better than the current one and return metadata directly, see e.g. http://harvest.aps.org/docs/harvest-api#general, but we do not have access yet (I just asked for it). Using the old/current API is a bit more tricky since the metadata (e.g. the abstract) is not so easily available - only as part of the XML in a Bagit archive. I could show you how, but I would recommend to wait and see if we can get access to the new API first."
from inspire.
It's part hepcrawl and part not code related issue. I.e. we need to investigate in order to obtain access to the API.
from inspire.
can this be closed?
meanwhile I also removed all the bad MARC FFT entries
from inspire.
Yeah, this can be closed.
from inspire.
Related Issues (20)
- Add display format for HepNames awards HOT 9
- XME format should export full deleted record HOT 8
- What to do with records with 0 pages? HOT 3
- The XME format switches 961__c and 961__x HOT 1
- Custom API to export ORCID,DOI association HOT 1
- multiple DOIs for a record mishandled by bst_arxiv_doi_update
- Authorxml check script seems not work properly
- Experiments: what's in 372__a? HOT 8
- Conferences: ill-formatted CNUMs HOT 3
- Institutions: is CORE in 690C or in 980? HOT 4
- CNUM generator is wrong when the starting date is incomplete HOT 6
- export accelerator in experiment XME HOT 1
- Exposing doctype instead of type in FFT in XME
- Google Scholar indexing issues HOT 6
- Journals: what's in 022__m? HOT 2
- A 773__0 is generated in XME even if one was already there
- Some records have empty XME HOT 2
- Fix CDS OAI harvest duplicate detection
- Make validation for dates more strict.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from inspire.