project-osmose / osekit Goto Github PK

OSEkit is an open source suite of tools written in python and dedicated to the management and analysis of data in underwater passive acoustics.

Home Page: https://osmose.ifremer.fr

License: Other

Python 100.00%

acoustic data-analysis-python fair package python

osekit's Introduction

OSEkit is an open source suite of tools written in python and dedicated to the management and analysis of data in underwater passive acoustics.

Presentation • Getting into it • Acknowledgements

ㅤ

Presentation

OSEkit is an open source suite of tools written in python and dedicated to the management and analysis of data in underwater passive acoustics. Among other key features, our toolkit has been adapted to be deployed on a cluster infrastructure; in OSmOSE, our production version runs on DATARMOR. Here are a few indications to help you going through our documentation.

Getting into it

All details to start using our toolkit and make the most out of it are given in our documentation; you will then be redirected to more specific parts of it depending on your profile and intentions:)

Acknowledgements

A great part of our processing codes are based on well-established python packages for data analysis including scipy ,pandas, numpy...

ㅤ

_{© OSmOSE team, 2023-present}

osekit's People

Contributors

Stargazers

Watchers

osekit's Issues

Group the audio files in batch to parallelise the reshaping

Adjust parameters on a time range instead of random or specific files

Compute the expected RAM for reshaping task

Idea : use the highest size in available audio files * number of files to be processed, with an upper bound. It's the most straightforward method, but other, more accurate and complex, exist.

spectro adjustement folder should be reinitialized

when you rerun a spectro adjustement after changing their parameters , the previous adjustment spectro should be first deleted before generating the new one , you only want to see the new ones..

Integrate the environmental data

There should be an option to include them in dataset building, should they exist.

intégrer information timezone au moment de créer le timestamp.csv

remove email sending from jobs

Lighten the dependency relations between jobs

Currently, between the resamples, audioNormalization, reshaping, and spectrogram generation step, each job depends on the execution of every job of the previous step. This puts a stress on the cluster and while datarmor may handle it (especially if there are not many concurrent jobs), other clusters might not.

Eventually, it would be good to have 1:1 dependencies between the jobs.

timestamps of reshaping 'classic' are not right

seems that you assume continuous recordings , forgetting about duty cycle

bub observed on dataset test_sample_DC , reshaping 30s file to 15s , there is a duty cycle : 30s recording every 5 mins
last timestamp of reshaping is then 2001-12-21T17-14-45_000.wav , while it should be the same as original ones: 20211201_201000.wav

create unitary test comparing our numerical estimations against reference values

compare welch values, SPL and other soundscape figures to reference codes such as scipy, matlab, pamguide....

Add CLI

While notebooks are an easy and inherently reproducible way to use OSmOSE, they may not be suited to every use case. We can already import any function like a regular OSmOSE module, but a complete CLI would cover more needs, and ease the quick execution of some features.

Add an energy detector

To adjust the spectrogram parameters in the best way possible, an energy detector able to use spectrogram with a supposed activity will be a great improvement.

Implement the Soundscape module

The base code exists already, now it must be integrated with the package.

saved matrices not indexed correctly

currently only *_4_3.npz files are saved , should be _4_0.npz ,_4_1.npz , _4_2.npz .......

/home/datawork-osmose/dataset/MPSU_ForestouHuella/processed/spectrogram/30_48000/1024_1024_90/matrix/
20211201_235000_4_3.npz 20211204_094000_4_3.npz 20211208_014500_4_3.npz
20211202_142500_4_3.npz 20211204_131500_4_3.npz 20211208_172000_4_3.npz
20211202_192000_4_3.npz 20211204_184000_4_3.npz 20211209_061500_4_3.npz
20211203_010500_4_3.npz 20211204_204000_4_3.npz 20211210_062500_4_3.npz
20211203_064500_4_3.npz 20211204_223500_4_3.npz 20220110_012000_4_3.npz
20211203_110000_4_3.npz 20211205_101500_4_3.npz 20220110_151500_4_3.npz
20211203_114500_4_3.npz 20211206_072500_4_3.npz 20220110_182000_4_3.npz
20211203_142000_4_3.npz 20211206_102500_4_3.npz 20220110_191500_4_3.npz
20211203_180500_4_3.npz 20211206_124500_4_3.npz
20211203_203500_4_3.npz 20211207_054500_4_3.npz

creation of a `wav_list_0.csv` file after reshaping (and resampling I guess)

it is located in the new folder

add matrix format to LTAS

revise readme

code de génération automatique du timestamp.csv

metadata description at dataset uploading

after reading the entire dataset, what and how do we save metadata? Add cells to the notebook dataset_upload to visualize summary stats of metadata (eg precise timelines of recordings)

create audio_metadata.csv file with following columns (ideally get back to Thetys database to normalize fieldnames)

filename
timestamp
status (read_header failed or succeed)
format
duration
sample rate
dutycyle (timedelta between current timestamp and previous one)
volume
sampwidth
subtype (eg PCM16, see rumengol)
number of channels (stereo or mono)

Audio_reshaper.py duplicates second-to-last file if last file is too short

If the files are reshaped into 720s files and the total duration is 3000s, then the last file should have 120s of audio and 500s of silence. Currently, it is just a duplicate of the previous file, and the last 120s are lost.

Automatically fillsr_analysis by origin Fs in notebook spectrogram_geenrator_pkg

add "all" alias to the zscore_duration

allows to compute zscore statistics over the entire dataset

bug in bash getops

on datarmor at least we get the following error with the legacy reshaping :

File "/home/datawork-osmose/osmose_package_dev/src/OSmOSE/cluster/reshaper.sh", line 5
d)
^
SyntaxError: unmatched ')'

Set_var_to_default

Reduce pure python latency

Especially true on the call to the package, some dependencies take a long time to load. Importing the OSmOSE package and using basic function should not have a visible impact on the duration of code execution.

add EPD in soundscape figures

Adding depth of hydrophone in metada.

So far only latitude and longitude values are saved in metadata.csv
However depth is an important parameter when analyzing audio.

add SPL filtered

Delete job in Job_builder()

Currently, the regular way to cancel a job is through qdel (or scancel). However, doing so prevents the removal of the pbs (or sh) job file and confuses the Job builder. A delete_job(jobname) function should launch the qdel/scancel command and remove the job from its ongoing_jobs list to handle things in a cleaner way.

forcing the initialization of the origin dataset ?

i am not sure we should be able to do Force = True for the origin dataset , i have juste done it by mistake , two jobs of resample and reshaping are launched (do not know what they are doing anyway) , i think we should just block the execution and warn the user to set Force = False

last update with os.setegid(gid) causing a bug on datarmor , see error below

PermissionError Traceback (most recent call last)
Cell In[2], line 7
4 local_execution = False
5 date_template = "" # strftime format, used to build the dataset from scratch (ignore if the dataset is already built)
----> 7 dataset = Spectrogram(dataset_path =Path(path_osmose_dataset, dataset_name), sr_analysis=sr_analysis, owner_group="gosmose", local=local_execution)
9 print(dataset)

File /home/datawork-osmose/osmose_package_dev/src/OSmOSE/Spectrogram.py:91, in Spectrogram.init(self, dataset_path, sr_analysis, gps_coordinates, owner_group, analysis_params, batch_number, local)
30 def init(
31 self,
32 dataset_path: str,
(...)
39 local: bool = False,
40 ) -> None:
41 """Instanciates a spectrogram object.
42
43 The characteristics of the dataset are essential to input for the generation of the spectrograms. There is three ways to input them:
(...)
89 alone. The default is False.
90 """
---> 91 super().init(
92 dataset_path=dataset_path,
93 gps_coordinates=gps_coordinates,
94 owner_group=owner_group,
95 )
97 self.__local = local
99 processed_path = self.path.joinpath(OSMOSE_PATH.spectrogram)

File /home/datawork-osmose/osmose_package_dev/src/OSmOSE/Dataset.py:67, in Dataset.init(self, dataset_path, gps_coordinates, owner_group, original_folder)
65 self.__path = Path(dataset_path)
66 self.__name = self.__path.stem
---> 67 self.owner_group = owner_group
68 self.__gps_coordinates = []
69 if gps_coordinates is not None:

File /home/datawork-osmose/osmose_package_dev/src/OSmOSE/Dataset.py:186, in Dataset.owner_group(self, value)
184 try:
185 gid = grp.getgrnam(value).gr_gid
--> 186 os.setegid(gid)
187 except KeyError as e:
188 raise KeyError(
189 f"The group {value} does not exist on the system. Full error trace: {e}"
190 )

PermissionError: [Errno 1] Operation not permitted

Adding auxiliary_variables class in Features directory

Creating class that loads and/or computes auxiliary data for a dataset.
Class becomes an option of other features, eg. osmose.features.spectro(..... , aux = True)
It takes the same timestamps as the creates spectros/spl/welch

making it possible to create in /datsets/ a folder of datasets, eg /datasets/SES/SES_A, _B , _C ...

Add audio normalization before reshaping (when building dataset ?)

Add a way to get job info in job_builder

Most important infos to get : queue state (waiting, ongoing...), elapsed time... + job id

Remove over-printing in data_uploader

Reimplement parameter check before generating all spectrograms

reshape should be made dependent on resample

currently there is a bug when you do resample + reshape because jobs run simultaneously , and if i understand properly the parameters of reshape it reshapes the resampled wav moved on the output directory , so reshape job should be made dependent on resample job

Parameters : Namespace(input_files='/home/datawork-osmose/dataset/test_sample_DC/data/audio/10_2150', chunk_size=10, output_dir='/home/datawork-osmose/dataset/test_sample_DC/data/audio/10_2150', batch_ind_min=0, batch_ind_max=39, offset_beginning=0, offset_end=0, max_delta_interval=5, verbose=True, overwrite=True, force=True, last_file_behavior='pad')

batch process LTAS generation

Reorganize the log folder

Currently, all job files and logs are saved in the same folder, which makes things messy when multiple analysis are conducted on a same dataset over time.

The new organization will put the log and job files in a subdirectory corresponding to the date of the day when the analysis is started, in the format YYYY_MM_DD.

create a _SUCESS file in folder of spectrograms to ensure processing finished correctly

for the moment after closing the notebook the user does not have any way to be sure that all spectrograms for example have been successfully generated

solution : create a light job counting generated files and check the number is correct , this job should come last , ie be dependent on all processing job

Do not regenerate existing spectrogram

When spectrograms are generated for a dataset, if another set of spectrograms with the exact same parameters are generated, it should raise a warning and stop the execution.

Choose between image, matrix or both when generating spectro

Currently it is only possible to generate or not the matrices, but the images are generated no matter what.

Offer choice between new and previous reshaping

Systematically parallelize operations relative to the number of available CPUs

To do that, it will be mandatory to update the process_all_files function to accept a list of files/ind_min ind_max as input.

anomaly tests and message printing and « block or not » uploading

Tests anomalies fortes (= on peut rien en faire) qui permettent de bloquer l’upload , pas de force_upload possible

header mal formé / fichier pas lisible , corrompu
extension pas supportée
→ à développer : print(anomalie forte , no upload possible) + store them and propose to automatically delete (see function delete_abnormal_files)

print upload block

Test anomalies faibles , force upload possible

dès que len(np.unique) > 1 sur duration (round to second) , sample_rate , inter_duration
seuil de check_n_files dépassé

print upload block

If no anomaly, just print summary metadata (average duration , sample rate) and say that everything is OK

Ignore the owner_group in local execution

QoL improvement on linux

add timezone to timestamp.csv

the timestamp csv does not provide the timezone of the data which is necessary for some data processing, it must user definied during the building of the dataset

Create spectrograms without generating new wavs

The datasets might be very large, and if we need only the images, it is superfluous to write all wav files.
It should be possible to load and transform the wav in RAM and only write the resulting spectrogram.

`force_init = True` not clear how to use it during adjustment...

in the scenario where you see a first set of adjusting parameters , you are not satisfied and change the dataset.dynamic_max parameter for example , do you have to rerun the dataset.initialize ? currently i could not see the change of this parameter on my adjusting spectrograms