hisparc / datastore Goto Github PK
View Code? Open in Web Editor NEWData storage solution
License: GNU General Public License v3.0
Data storage solution
License: GNU General Public License v3.0
Stations 102 and 202 were using PySPARC for a period in which it generated bad timestamps.
This is bad data because we can never recover the correct timestamp for most of the data (wrong units for Quantization Error).
102: 1 sep 2014 - 21 sep 2015
202: 19 dec 2014 - 21 sep 2015
For a GPS offset test station 501 and 502 were triggered simultaneously using a pulsegenerator. The data is poluting the real cosmic data. The data is easily identified by the trace signals (no pulses, external trigger) and the interval between events (250 ms). Moreover, we performed the tests so we know the dates: From 2011/10/21 upto and including 2011/10/31.
We should move this data away from these stations and store under test stations. Stations 94 and 95 are obvious candidates, since those are used for similar tests and started data taking after the 501-502 tests. We should ensure that 94/95 also recieve configs from 501/502 to get the right gps coordinates.
Docs have been added in #25. But not everything is fully documented.
Look at the docs for todo's and things that are missing:
http://docs.hisparc.nl/datastore/
In /databases/kascade
there is still about 2 years worth of event data from the station at KASCADE. This should be migrated to the datastore, which already contains some data from the KASCADE station. Note that the current cluster name (karlsruhe) and station number (70001) are different from how they are stored in unmigrated kascade files (kascade, 601).
The directory also contains the KASCADE experiment data (HiSparc-new.dat.gz
), I think that at least a HDF5 version should also be stored for easier access (using sapphire.kascade.StoreKascadeData
).
I added two new stations in the Publicdb; 521 and 522.
When I converted 501 to 521 it sometimes got a 206 return code, which means unknown station number. This error was also logged on frome, in a thread with a process ID different from the other threads running at the time, meaning it was not properly killed when hisparc-datastore
tried to reload httpd.
I am going to update the datastore on frome to the latest master.
And after that merge the lightning branch and update again.
config.ini
and writer_app.py
are in the top level of the datastore repo. The application.wsgi
is renamed to datastore.wsgi
and is one level higher (outside repo). The files in top level are now ignored by git 51dc9e6diff -r unmaintained_datastore datastore
) of note was the eventwarehouse migration log mig_ew.log
, which I moved to /var/log/hisparc/
git remote set-url origin https://github.com/HiSPARC/datastore.git
git fetch
service httpd stop
and stop the writer by attaching the corresponding screen and stopping it (control + C)git reset --hard origin/master
service httpd start
and start the writer again in the screen using sudo -u www python /var/www/wsgi-bin/datastore/writer_app.py
git reset --hard origin/lightning
On 2012-5-16 station 502 transitioned to HiSPARC III electronics.
The baseline was automatically calibrated to 30 ADC by the DAQ.
It was soon found that the thresholds used in the DAQ were incorrect;
The threshold conversion from mV to ADC used a baseline of 200 ADC.
On 2012-6-8 the thresholds were changed to be close to the normal trigger levels.
The thresholds are not properly accounted for in the ESD.
In the period mentioned in the title about 70% of the trigger time reconstructions failed.
I'm uncertain if this data was reprocessed after updating SAPPHiRE to properly take different trigger settings into account.
$ h5ls /databases/frome/2017/4/2017_4_15.h5/hisparc/cluster_amsterdam/station_4
blobs Dataset {799871/Inf}
config Dataset {248013/Inf}
…
$ h5ls /databases/frome/2017/4/2017_4_12.h5/hisparc/cluster_amsterdam/station_91
blobs Dataset {399080/Inf}
config Dataset {121744/Inf}
…
The public database seems to crash while processing these configs (hisparc-update), or while trying to render config data (uwsgi). Those configs should be removed (keeping maybe one, first and/or last?).
The blobs could also be updated, but they do not seem to take a lot of extra space. Removing the config blob data entries would mean that other blob indexes need to be updated as well, so it is easier to simply leave that data.
Given the occurances of bad data: HiSPARC/publicdb#63
Of which I suspect may be due to data corruption, other may just be due to migration issues.
We could activate the fletcher32 option for PyTables, this means that checksums will be made for all data, which will ensure data integrity.
Additionally we could enable blosc data compression to possibly save some space and speed up data read out (making IO less of a bottleneck).
De dag die je wist dat zou komen...
Apr 7, 2019 is the second GPS WNRO
(week number rollover). On this day the 10-bit GPS week number overflowed from 1023 back to 0. GPS week 0 started on Jan 6, 1980. This was the second rollover. The GPS weeknumber of Apr 7, 2019 is 2048.
Our Trimble Resolution T GPS clocks use the start of weeknumber 2048 (Apr 7, 2019 0:00) as the default time after a GPS cold start (no signal acquired). The DAQ will hapilly send events to the datastore evenwhen no GPS signal has been acquired. Events from Apr 7, 2019 are thus very suspicious. Now that Apr 7, 2019 has passed we have verified that the GPS still use Apr 7, 2019 as default time after cold start.
Apr 7, 2019 used to be in the future... Not anymore.
Soon, we must mark events from Apr 7, 2019 as suspicious and not import them into the raw datastore, because there were most probably caused by missing/bad GPS signal.
(Last week I already moved the "old" Apr, 2019 suspicious data in the raw datastore)
On 28 April 2011 station 3201 started using HiSPARC electronics previously used by 3202.
The wrong GPS position was submitted by 3201 (which actually belonged to 3202).
Later the correct position was sent in a config, but the wrong ones was not removed from the datastore.
Today we discovered the writer had been erroneously running as root
lately, thus creating raw datastore hdf5 files chown root.root
. A few days ago frome was physically moved to a new location and the server was restarted. The writer was restarted as user www
(as specified in the docs).
The writer running as user www
could not write to the raw data store. All data was dropped:
/var/log/hisparc/hisparc-log.writer
2018-09-24 00:00:05,473 writer.store_events[4758].store_event_list.ERROR: Cannot process event, discarding event (station: 8006)
2018-09-24 00:00:05,473 writer.store_events[4758].store_event_list.ERROR: Cannot process event, discarding event (station: 8006)
The code that generates this error:
https://github.com/HiSPARC/datastore/blob/master/writer/store_events.py#L127L148
When store_events.store_event_list
is unsuccesful, we still remove the incoming pickled data from the partial
folder!
Solution: Only remove the pickle if process_data
is succesful: https://github.com/HiSPARC/datastore/blob/master/writer/writer_app.py#L73
Data can not be processed for the ESD.
HDF5 errors occur when reading its data from the raw datastore.
Perhaps all data for that station needs to be removed..
Errors for stations 10, 303, 504.
Current env:
Python: 2.6.4 -> Latest version 2.7.11
PyTables: 2.1.2 -> Latest version 3.2.2
Current -> Minimum required for required for PyTables 3.1.1
NumPy: 1.6.2 -> at least 1.7.1
Numexpr: None -> at least 2.4 (Do not use 2.4.6 or 2.5)
Cython: None -> at least 0.21
HDF5: 1.8.3 -> at least 1.8.7 (1.8.4 "HDF5 versions < 1.8.7 are supported with some limitations...")
(This is a proposal, not an issue)
Currently there is no real way to test the datastore (for example when commissioning a new frome) before it is put "live".
Risk:
We need a semi-automated test, that we can run through a VM, but we cannot spend days and days creating it.
Proposal for a test:
What do I want to test:
How to create testdata
sapphire.tests
).If would be best to use a test station for all of this, but I we lack that (and keeping in mind that we cannot afford to spent much time on this) we can use one or more SPA stations.
Python 2 will be end-of-life in a few years.
The datastore should be transitioned to Python 3.
datastore does not need to keep backwards compatible with Python 2.
See discussion at: HiSPARC/sapphire#154 (comment)
This might not be a datastore
issue, but perhaps a station software issue, but since it affects raw data, I'm reporting it here.
singles counts for 2 detector stations (no slave) are apparently reported as zero (0
) instead of -1
or -999
for the missing detectors.
IMHO, this is wrong.
Sample TSV downloaded from my publicdb
test VM:
# Event Summary Data - Singles
#
# Station: (203) College Hageveld
#
(...)
#
2017-01-02 00:00:00 1483315200 162 90 122 67 0 0 0 0
2017-01-02 00:00:01 1483315201 139 80 151 98 0 0 0 0
2017-01-02 00:00:02 1483315202 145 86 164 104 0 0 0 0
I can fix this at the import into the ESD by simply setting the missing channels to all '-999'. But that still leaves the raw data to be ambiguous.
Then again, for singles data we might be safe in assuming 0 means "not connected" or "missing detector".
Possible solutions:
Any thoughts?
New Nikhef station has test data for the period 19-22 May 2018
http://data.hisparc.nl/show/stations/514/2018/5/22/
The GPS location is probably also not final.
frome now has log level set to DEBUG for the writer and uwsgi processes. With the new PySPARC stations that upload every event individually this causes a lot of extra log messages. We should set the log level to INFO. Or, alternatively, instead of reducing the log level, reduce the number of logs kept by the TimedRotatingFileHandler..
Recently frome was having issues accepting new events due to a full partition, caused by the large logs..
It would be very useful to be able to setup the datastore server with vagrant.
That can help in case of a server crash or upgrade.
This would allow us to easily test improvements to how it runs;
such as adding supervisor to keep certain processes running.
frome is not a very complex server, so it should be relatively (compared to publicdb) easy setup.
Some notes on how it was originally setup can be found here: http://docs.hisparc.nl/servers/frome.html
Station 1102 was tested at Nikhef before deployment.
The test data was unfortunately uploaded to the actual station.
Most importantly the GPS location of the Nikhef is now attached to the station.
This needs to be removed/migrated to a test station:
Example test data: http://data.hisparc.nl/show/stations/1102/2016/5/23/
Once the station is properly installed the date for real/good data will be established and the test data can easily be removed.
Are these issues still relevant?
We could use some logs in : https://github.com/HiSPARC/datastore/blob/master/examples/datastore_config_server.py
(This is just an example, the real on lives in the publicdb/provisioning repo)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.