Code Monkey home page Code Monkey logo

osekit's Introduction



osmose logo

version PyPI status license Open Source Love Python 3.10

OSEkit is an open source suite of tools written in python and dedicated to the management and analysis of data in underwater passive acoustics.

PresentationGetting into itAcknowledgements

Presentation

OSEkit is an open source suite of tools written in python and dedicated to the management and analysis of data in underwater passive acoustics. Among other key features, our toolkit has been adapted to be deployed on a cluster infrastructure; in OSmOSE, our production version runs on DATARMOR. Here are a few indications to help you going through our documentation.


Getting into it

All details to start using our toolkit and make the most out of it are given in our documentation; you will then be redirected to more specific parts of it depending on your profile and intentions:)


Acknowledgements

  • A great part of our processing codes are based on well-established python packages for data analysis including scipy ,pandas, numpy...

© OSmOSE team, 2023-present

osekit's People

Contributors

cazaudo avatar elodieensta avatar gabrieldubus avatar gmanatole avatar ixio avatar maellettrt avatar mathieudpnt avatar paulcarvaillo avatar pcarvaillo avatar pylrr avatar qouagga avatar rumengol avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

osekit's Issues

Compute the expected RAM for reshaping task

Idea : use the highest size in available audio files * number of files to be processed, with an upper bound. It's the most straightforward method, but other, more accurate and complex, exist.

Lighten the dependency relations between jobs

Currently, between the resamples, audioNormalization, reshaping, and spectrogram generation step, each job depends on the execution of every job of the previous step. This puts a stress on the cluster and while datarmor may handle it (especially if there are not many concurrent jobs), other clusters might not.

Eventually, it would be good to have 1:1 dependencies between the jobs.

timestamps of reshaping 'classic' are not right

seems that you assume continuous recordings , forgetting about duty cycle

bub observed on dataset test_sample_DC , reshaping 30s file to 15s , there is a duty cycle : 30s recording every 5 mins
last timestamp of reshaping is then 2001-12-21T17-14-45_000.wav , while it should be the same as original ones: 20211201_201000.wav

Add CLI

While notebooks are an easy and inherently reproducible way to use OSmOSE, they may not be suited to every use case. We can already import any function like a regular OSmOSE module, but a complete CLI would cover more needs, and ease the quick execution of some features.

Add an energy detector

To adjust the spectrogram parameters in the best way possible, an energy detector able to use spectrogram with a supposed activity will be a great improvement.

saved matrices not indexed correctly

currently only *_4_3.npz files are saved , should be _4_0.npz ,_4_1.npz , _4_2.npz .......

/home/datawork-osmose/dataset/MPSU_ForestouHuella/processed/spectrogram/30_48000/1024_1024_90/matrix/
20211201_235000_4_3.npz 20211204_094000_4_3.npz 20211208_014500_4_3.npz
20211202_142500_4_3.npz 20211204_131500_4_3.npz 20211208_172000_4_3.npz
20211202_192000_4_3.npz 20211204_184000_4_3.npz 20211209_061500_4_3.npz
20211203_010500_4_3.npz 20211204_204000_4_3.npz 20211210_062500_4_3.npz
20211203_064500_4_3.npz 20211204_223500_4_3.npz 20220110_012000_4_3.npz
20211203_110000_4_3.npz 20211205_101500_4_3.npz 20220110_151500_4_3.npz
20211203_114500_4_3.npz 20211206_072500_4_3.npz 20220110_182000_4_3.npz
20211203_142000_4_3.npz 20211206_102500_4_3.npz 20220110_191500_4_3.npz
20211203_180500_4_3.npz 20211206_124500_4_3.npz
20211203_203500_4_3.npz 20211207_054500_4_3.npz

metadata description at dataset uploading

after reading the entire dataset, what and how do we save metadata? Add cells to the notebook dataset_upload to visualize summary stats of metadata (eg precise timelines of recordings)

create audio_metadata.csv file with following columns (ideally get back to Thetys database to normalize fieldnames)

  • filename
  • timestamp
  • status (read_header failed or succeed)
  • format
  • duration
  • sample rate
  • dutycyle (timedelta between current timestamp and previous one)
  • volume
  • sampwidth
  • subtype (eg PCM16, see rumengol)
  • number of channels (stereo or mono)

bug in bash getops

on datarmor at least we get the following error with the legacy reshaping :

File "/home/datawork-osmose/osmose_package_dev/src/OSmOSE/cluster/reshaper.sh", line 5
d)
^
SyntaxError: unmatched ')'

Reduce pure python latency

Especially true on the call to the package, some dependencies take a long time to load. Importing the OSmOSE package and using basic function should not have a visible impact on the duration of code execution.

Delete job in Job_builder()

Currently, the regular way to cancel a job is through qdel (or scancel). However, doing so prevents the removal of the pbs (or sh) job file and confuses the Job builder. A delete_job(jobname) function should launch the qdel/scancel command and remove the job from its ongoing_jobs list to handle things in a cleaner way.

forcing the initialization of the origin dataset ?

i am not sure we should be able to do Force = True for the origin dataset , i have juste done it by mistake , two jobs of resample and reshaping are launched (do not know what they are doing anyway) , i think we should just block the execution and warn the user to set Force = False

last update with os.setegid(gid) causing a bug on datarmor , see error below


PermissionError Traceback (most recent call last)
Cell In[2], line 7
4 local_execution = False
5 date_template = "" # strftime format, used to build the dataset from scratch (ignore if the dataset is already built)
----> 7 dataset = Spectrogram(dataset_path =Path(path_osmose_dataset, dataset_name), sr_analysis=sr_analysis, owner_group="gosmose", local=local_execution)
9 print(dataset)

File /home/datawork-osmose/osmose_package_dev/src/OSmOSE/Spectrogram.py:91, in Spectrogram.init(self, dataset_path, sr_analysis, gps_coordinates, owner_group, analysis_params, batch_number, local)
30 def init(
31 self,
32 dataset_path: str,
(...)
39 local: bool = False,
40 ) -> None:
41 """Instanciates a spectrogram object.
42
43 The characteristics of the dataset are essential to input for the generation of the spectrograms. There is three ways to input them:
(...)
89 alone. The default is False.
90 """
---> 91 super().init(
92 dataset_path=dataset_path,
93 gps_coordinates=gps_coordinates,
94 owner_group=owner_group,
95 )
97 self.__local = local
99 processed_path = self.path.joinpath(OSMOSE_PATH.spectrogram)

File /home/datawork-osmose/osmose_package_dev/src/OSmOSE/Dataset.py:67, in Dataset.init(self, dataset_path, gps_coordinates, owner_group, original_folder)
65 self.__path = Path(dataset_path)
66 self.__name = self.__path.stem
---> 67 self.owner_group = owner_group
68 self.__gps_coordinates = []
69 if gps_coordinates is not None:

File /home/datawork-osmose/osmose_package_dev/src/OSmOSE/Dataset.py:186, in Dataset.owner_group(self, value)
184 try:
185 gid = grp.getgrnam(value).gr_gid
--> 186 os.setegid(gid)
187 except KeyError as e:
188 raise KeyError(
189 f"The group {value} does not exist on the system. Full error trace: {e}"
190 )

PermissionError: [Errno 1] Operation not permitted

Adding auxiliary_variables class in Features directory

Creating class that loads and/or computes auxiliary data for a dataset.
Class becomes an option of other features, eg. osmose.features.spectro(..... , aux = True)
It takes the same timestamps as the creates spectros/spl/welch

reshape should be made dependent on resample

currently there is a bug when you do resample + reshape because jobs run simultaneously , and if i understand properly the parameters of reshape it reshapes the resampled wav moved on the output directory , so reshape job should be made dependent on resample job

Parameters : Namespace(input_files='/home/datawork-osmose/dataset/test_sample_DC/data/audio/10_2150', chunk_size=10, output_dir='/home/datawork-osmose/dataset/test_sample_DC/data/audio/10_2150', batch_ind_min=0, batch_ind_max=39, offset_beginning=0, offset_end=0, max_delta_interval=5, verbose=True, overwrite=True, force=True, last_file_behavior='pad')

Reorganize the log folder

Currently, all job files and logs are saved in the same folder, which makes things messy when multiple analysis are conducted on a same dataset over time.

The new organization will put the log and job files in a subdirectory corresponding to the date of the day when the analysis is started, in the format YYYY_MM_DD.

Do not regenerate existing spectrogram

When spectrograms are generated for a dataset, if another set of spectrograms with the exact same parameters are generated, it should raise a warning and stop the execution.

anomaly tests and message printing and « block or not » uploading

  1. Tests anomalies fortes (= on peut rien en faire) qui permettent de bloquer l’upload , pas de force_upload possible
  • header mal formé / fichier pas lisible , corrompu
  • extension pas supportée
    → à développer : print(anomalie forte , no upload possible) + store them and propose to automatically delete (see function delete_abnormal_files)

print upload block

  1. Test anomalies faibles , force upload possible
  • dès que len(np.unique) > 1 sur duration (round to second) , sample_rate , inter_duration
  • seuil de check_n_files dépassé

print upload block

  1. If no anomaly, just print summary metadata (average duration , sample rate) and say that everything is OK

add timezone to timestamp.csv

the timestamp csv does not provide the timezone of the data which is necessary for some data processing, it must user definied during the building of the dataset

Create spectrograms without generating new wavs

The datasets might be very large, and if we need only the images, it is superfluous to write all wav files.
It should be possible to load and transform the wav in RAM and only write the resulting spectrogram.

`force_init = True` not clear how to use it during adjustment...

in the scenario where you see a first set of adjusting parameters , you are not satisfied and change the dataset.dynamic_max parameter for example , do you have to rerun the dataset.initialize ? currently i could not see the change of this parameter on my adjusting spectrograms

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.