riggsd / guano-py Goto Github PK
View Code? Open in Web Editor NEWPython reference implementation of the GUANO bat acoustics metadata specification
Home Page: http://guano-md.org
License: MIT License
Python reference implementation of the GUANO bat acoustics metadata specification
Home Page: http://guano-md.org
License: MIT License
A small thing, but I think the type of Loc Accuracy should be float, not int, per the spec here: https://www.wildlifeacoustics.com/SCHEMA/GUANO.html. That's at guano.py line 170.
This discrepancy means that in strict mode, the parser fails to parse this field, and throws an exception. This became an issue in real life when I couldn't parse out a real wav file with this guano data:
GUANO|Version: 1.0
Species Manual ID: MyoBra/MyoMys,MyoDau,MyoNat
Loc Position: (60.805893333, 24.600683333)
Loc Accuracy: 40.02338521414699
Anabat|CallType: Matala
Perhaps a further refinement would be to have the parser fail open even in strict mode, so that a single bad field doesn't prevent other fields from being parsed? And possibly, make the default mode lenient? I can't imagine that clients typically want the parser to choke on a minor issue such as this.
Clarify the spec to require that GUANO|Version
field must be the first to appear in the metadata block, and all GUANO
namespace fields must appear before top-level fields or vendor fields.
When I try to write MD to files stored on a drive different from the one used by NamedTemporaryFile, I get a OSError: [WinError 17] The system cannot move the file to a different disk drive:
Could https://github.com/riggsd/guano-py/blob/master/guano.py#L489 be changed to use shutil.move?
This corrected the issue for me. I'd be happy to submit a pr.
Due to a buggy timestamp creation, Elekon BatExplorer 2.1 (https://www.batlogger.com/en/downloads/batexplorer/software/be_2.1/) creates GUANO wavefiles with 7-digit milliseconds. strptime() in parse_timestamp() expects 6 digits for %f:
timestamp = datetime.strptime(s, '%Y-%m-%dT%H:%M:%S.%f')
(line 126)
The function can't parse the timestamp and throws a ValueError. So the default guano-py cannot be used with Elekon wavefiles.
I'm trying to use guano_edit.py to add Loc Position to a WAV file that contains other GUANO metadata but I'm getting errors. It is probably a Python NOOB issue but
guano_edit.py 'Loc Position: (1, 1)' 20231102_210205_NoID.wav produces an error
guano_edit.py 'Loc Position: (1, 1)' 20231102_210205_NoID.wav
{'Loc Position': '(1, 1)'}
20231102_210205_NoID.wav
Traceback (most recent call last):
File "/home/user/.local/bin/guano_edit.py", line 100, in
main()
File "/home/user/.local/bin/guano_edit.py", line 96, in main
update(gfile, md, dry_run=DRY_RUN)
File "/home/user/.local/bin/guano_edit.py", line 75, in update
print(gfile.to_string())
File "/home/user/.local/lib/python3.10/site-packages/guano.py", line 418, in to_string
v = self._serialize(k, v)
File "/home/user/.local/lib/python3.10/site-packages/guano.py", line 228, in _serialize
return serialize(value)
File "/home/user/.local/lib/python3.10/site-packages/guano.py", line 177, in
'Loc Position': lambda value: '%f %f' % value,
TypeError: must be real number, not str
I have tried other types of Position formats with the same error.
What is the correct format please?
Devices targeted at active recording typically allow the user to record an audible-range voice note to accompany the ultrasonic bat recording. The spec currently doesn't define a top-level field for voice note; should we do so?
While voice notes are likely of a lower samplerate (eg. 44.1kHz), they may be even longer in duration than the actual bat recording, so the voice note could easily exceed a few mb in size (60 seconds of 16-bit 44.1kHz mono .WAV is ~5mb in size).
Should voice notes be embedded as a base64 binary field value?
Should we instead define a second chunk gbin
for housing large binary "attachments", then reference them with a "pointer" inside the main guan
chunk? With this strategy, reading implementations won't need to allocate memory and resources for reading these potentially large attachments unless they recognize that they want to. Additionally, by storing pure binary data in gbin
a writing implementation won't need to base64 encode the data.
This issue applies not only to voice notes, but also thumbnail images (of rendered spectrogram, etc.), or any other "large" metadata value.
I've played around with this a bit, but it seems fairly complex and maybe something someone else has already solved. The code from the Zcant package appears related but doesn't quite seem to work for this for me.
I'd be happy to work on this if it makes sense in the current package.
Timestamp from Wildlife Acoustics Echo Meter Touch 2 (iPhone app) is shown as None in guano_dump output:
(venv) root@be7e71f34666:/var/www/webroot/media/sessions/Session 20190430_210036# guano_dump.py PIPPYG_20190430_210244.wav | grep Timestamp
Timestamp: None
(venv) root@be7e71f34666:/var/www/webroot/media/sessions/Session 20190430_210036# tail -c800 PIPPYG_20190430_210244.wav
[BINARY STRIPPED]guan/GUANO|Version: 1.0
Firmware Version: App 2.7.7
Length: 15.00
Loc Position: 51.41691 -0.0671
Loc Elevation: 55.62311
Make: Wildlife Acoustics
Model: Echo Meter Touch 2
Original Filename: 20190430_210244.wav
Samplerate: 256000
Serial: E2B02356
Species Auto ID: PIPPYG
Species Manual ID:
Timestamp: 2019-04-30 21:02:44+0100
Note:
WA|Echo Meter|Auto ID: PIPPYG
WA|Song Meter|Audio settings: [{"prefix":"Yamato","trig max len":"15.00","rate":"256000","trig window":"3.00","trig min freq":"30000.00","trig level":"2.00","trig max freq":"128000.00","gain":"0.00"}]wamd�Echo Meter Touch E2B02356 App 2.7.72019-04-30 21:02:44+0100WGS84,51.41691,-0.0671,55.62311
Version info from Pipfile.lock:
"guano": {
"hashes": [
"sha256:a913804a1844866d74eab3b38ec5b7d7759d8ad6238e6d90b88bfa4fd4e4b235",
"sha256:fff5b8a63e713d47c0d8bb438bee989f2afb15c2a5aa95abd6e664f64786279d"
],
"index": "pypi",
"version": "==1.0.12"
},
I note while processing this timestamp programatically in GuanoFile(filepath).item() that it is read from the guano data, but appears to have no timezone set, which causes a crash if you them call datetime.isoformat() on it. I would guess that parsing of this field is not setting a valid tz.
Happy to provide the wav file for testing; can also look at patching this myself at a later date once I'm more familiar with the library
Using a namespaced field or a field containing a space character within a template substitution causes guano_edit.py
to fail with the following exception:
Traceback (most recent call last):
File "./bin/guano_edit.py", line 95, in <module>
main()
File "./bin/guano_edit.py", line 91, in main
update(gfile, md, dry_run=DRY_RUN)
File "./bin/guano_edit.py", line 69, in update
gfile[k] = Template(v).substitute(gfile)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/string.py", line 176, in substitute
return self.pattern.sub(convert, self.template)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/string.py", line 173, in convert
self._invalid(mo)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/string.py", line 146, in _invalid
(lineno, colno))
ValueError: Invalid placeholder in string: line 1, col 1
Examples:
$> guano_edit.py 'Species Auto ID: ${SB|AutoID} foo/bar/*.wav
$> guano_edit.py 'Note: Recorded at ${Loc Position} by me.' foo/bar/*.wav
The class attribute string.Template.idpattern
is a regular expression responsible for validating template patterns.
The GUANO|Size
field is at best redundant, and could possibly be in conflict with reality. Remove it from the spec.
The original use case is that implementations may wish to pre-allocate a large fixed size metadata block so that metadata may be edited without having to rewrite the other RIFF chunks in the file.
The recommended way to do this is to:
Write initial metadata, but pad it out to a total of 2048 bytes (for example) with whitespace.
Whenever editing the file, read the size of the guan
RIFF chunk. If the UTF-8 rendered metadata block is less than 2048 bytes, go ahead and write it starting at chunk offset 0x00, and continue padding with whitespace out to 2048 total bytes written.
If the UTF-8 rendered metadata block is greater than 2048 bytes, you must rewrite the entire .WAV file since your edited metadata will not fit in the preallocated "guan" chunk. Consider increasing its size by double so that subsequent edits won't overflow this newly allocated block.
The GUANO 1.0 specification defines Auto Species ID
and Manual Species ID
fields, which allow specifying a single species label. The abstraction that a single recording equals a single "bat pass" doesn't match reality, and we frequently record multiple individuals - including multiple species - in a single recording.
How should GUANO metadata handle the case of multiple species present within a single recording?
This might be well out of scope for the quano-py package but I'll share anyway. In working with bat wav files and introspecting them with guano-py I wanted to have some way to visualize their contents.
Would a plot function for a GuanoFile object be useful to others? I'm kinda thinking something like this:
`g = GuanoFile("bat.wav")
g.plot()
`
Which would produce the following matplotlib figure:
This would require additional dependencies (these could be optional) such as matplotlib, numpy and scipy. As such the additional complexity might not be desirable. At any rate I'll share the code I came up with for this in case others find it useful.
https://github.com/talbertc-usgs/Notebooks/blob/master/bats/VisualizingGuanoData.ipynb
I'd be happy to submit a pr for this. I'd also be happy to either start or contribute to a different package that had a collection of useful bat data utilities or applications.
Also, I'm relatively new to visualizing sound files and bat data, so suggestions on the plot are welcome.
Would it be possible to add support for the Wildlife Acoustics W4V format? It uses guano meta out of the box. But its not possible to edit with guano-py
I have some WAV files from an EchoMeter TouchPro that fail to load as a GuanoFile
(see: EPTFUS_20210617_213005.zip ). Any help would be appreciated.
import guano
path = r"\\nas3\NAS8_13Jan20\Sounds\read_metadata_test\WILDLIFE ACOUSTICS EM-TouchPRO\EPTFUS_20210617_213005.wav"
gf = GuanoFile(path)
Traceback (most recent call last):
File "", line 1, in
GuanoFile(path)File "", line 215, in init
self._load()File "", line 285, in _load
self._parse(metadata_buf)File "", line 307, in _parse
self._md[namespace][key] = self._coerce(full_key, val)File "", line 221, in _coerce
returnself._coersion_rules[key](value)
ValueError: could not convert string to float:
GuanoFile.write()
currently writes the fmt_
, data
, and guan
chunks to an output file, but any other chunks are excluded.
We should (optionally?) persist all original chunks in the saved file.
The GUANO spec says that Species Auto ID
and Species Manual ID
are a "list of strings".
The serialized value should be comma-separated:
Species Manual ID: Mylu, Epfu
The Python library currently returns an str
value, which in the above case would be "Mylu, Epfu"
.
This means that library users would need to split it and strip it to know how many and which species labels were present.
It means that they might need to parse it and reassemble it in order to append a new value. (The spec doesn't actually state that values must be unique, but for most use cases that's probably desired.)
Therefore I think we should set up coercion/serialization for these Species fields so that they return a Python list
, as follows:
Species field not present: md.get("Species Auto ID") -> None
and "species Auto ID" in md == false
Species field empty/blank: md["Species Auto ID"] -> []
Species field has a single species label: md["Species Auto ID"] -> ["Mylu"]
Species field has multiple species labels: md["Species Auto ID"] -> ["Mylu", "Epfu"]
The spec defines these fields as optional
, so the way to check whether a recording is a Mylu recording would be:
if "Mylu" in md.get("Species Auto ID", []): ...
It would be cleaner and more intuitive if these two fields were required
, but Timestamp
is the only required field at this time.
Another approach to making it slightly cleaner would be adding explicit accessor methods for all of the well-known fields, where in
could still check whether the field was present, and def species_auto_id(self) -> list
would always return a list
, possibly empty.
The WAC
namespace was removed in d2e65ac, re-add it to the specification doc.
Unfortunately, Guano module (1.0.14) no longer seems to be able to properly parse the files generated by Wildlife Acoustics devices such as SM4BAT.
Error:
Traceback (most recent call last):
File "/test.py", line 11, in <module>
print(g['Prefix'])
File "/usr/local/lib/python3.10/dist-packages/guano.py", line 349, in __getitem__
return self._md[namespace][key]
KeyError: ''
Attaching a test case for testing:
testcase.zip
Module Version: 1.0.14
Python: 3.10
Ensure that we clearly state the GUANO specification version in the actual spec and changelog, and clearly denote that the Python reference implementation's versioning is independent.
Idea: SQLite3 database or fast key-value database like berkeleydb named .guano.py.cache
with index of (filename, filesize, timestamp, hash)
.
Hash should be a fast non-cryptographic function like crc32, md5, sha1, xxHash.
If we determine that the file hasn't changed, load metadata from cache.
Would this be significantly faster given that we'd need to do full file reads to compute hash? Is (filename, filesize, timestamp)
sufficient without a hash?
Because the '\n' character is used as a field delimiter, the spec defines the two-character string "\n" as an embedded newline within a field value. Without further escaping rules, this makes it impossible to intentionally use those characters together.
For example, a field may have a Windows filesystem path as value, C:\bat_calls\new\
, but a newline would inadvertently be inserted between C:\bat_calls
and new\
.
Should we then define the two-character string "\\" to represent the character '\'? The above example would then need to be encoded as C:\\bat_calls\\new\\
.
It should also be decided if these special escape characters apply to all string values, or exclusively to fields which specify "multi-line string" values?
For what it's worth, escaping a single token like "\n" is simple; simultaneously escaping several tokens, especially tokens containing the '' character (which itself escapes special characters in C-based languages), is non-trivial.
Failure when parsing raw Pettersson D1000X recordings:
Traceback (most recent call last):
gfile = GuanoFile(fname, strict=False)
File "/Users/driggs/workspace/guano-py/guano.py", line 202, in __init__
self._load()
File "/Users/driggs/workspace/guano-py/guano.py", line 256, in _load
raise ValueError(e)
ValueError: unpack requires a string argument of length 4
This may be because the RIFF chunk size at offset 0x04 is incorrect: it lists the entire filesize in bytes, when the value should be filesize - 8
. No... it looks like the files include dozens of kilobytes of null padding after the correctly-declared data
chunk. Is this even legal for a RIFF file?? Our chunk parser is reading \00\00\00\00
as the chunkid with chunksize 0, over and over and over again
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.