mdaeron / d47crunch Goto Github PK

Python library for processing and standardizing carbonate clumped-isotope analyses, from low-level data out of a dual-inlet mass spectrometer to final, “absolute” Δ47, Δ48, and Δ49 values with fully propagated analytical error estimates.

License: Other

Python 99.84% Jinja 0.16%

clumped-isotopes geochemistry python stable-isotopes

d47crunch's People

Contributors

Stargazers

Watchers

Forkers

loveears japhir itsmig

d47crunch's Issues

Implementation of Δ48 standardization

Clearly, D47crunch needs to implement Δ48 standardization methods, the sooner the best. Two obvious approaches are:

just copy-paste the relevant methods and replace 47 by 48 in the new functions, as you did. With a bit of house-cleaning (docs, etc.), this will work in the short term;
alternatively, rewrite everything so that a single D47data.standardize() method does Δ47, Δ48, or both.

Option (2) is harder to implement and I am not sure what we would gain from doing things in a very general way (it's not like there are many other standardizations that we will be doing in the near future beyond perhaps one day Δ49). What's more, it is entirely possible that many labs would use slightly different methods for D47 and D48 standardization (e.g. process Δ47 in the I-CDES RF but process Δ48 using H/E gas standards), so it is perhaps more natural to implement separate functions, potentially used with different arguments.

There are additional, minor issues to deal with, like how should the code take into account that some samples will be anchors for Δ48 but not for Δ47 or vice versa.

`standardize` is slow with large datasets

This seems to work! The standardize step is pretty slow (several minutes for a 552 samples and 1959 anchors) but I have no idea how fast it should be, and computing session ETFs with fancy maths is also very slow in my R version.
It didn't throw an immediate error about the wrong Sample names this time and seems to have worked out okay!

Originally posted by @japhir in #6 (comment)

I'm now running it on my full dataset of samples and standards, which consists of about 19000 aliquots in total. It's been running for half an hour or so now, and I have no idea whether I should let it keep on chugging or if I should cancel it and run it only for those subsets of the data that I want to do this for.

Perhaps we can improve this by implementing parallelization? Or showing a progress bar so that users know how long the wait will likely last?

UTF-8 encoding of csv

Just a suggestion regarding the documentation...

It seems thatD47data.read() will fail if the input csv file is in UTF-8 encoding. This is not a problem with D47crunch, but may trip up some new users if they were to produce their csv file using the default csv format of certain spreadsheet programs.

Would it be worth putting a warning about this in section 1.2 of the documentation just in case?

D47crunch name

Mathieu now that D47crunch also handles D48 crunching, would you consider changing the name to something mass neutral, like clumpy? I.e.

import numpy as np
import clumpy as cp

Or would this be a terrible idea since D47crunch is mentioned quite a bit in your uncertainty propagation paper, and probably elsewhere?

missing depency: rich?

Hi Mathieu,

I just installed the latest version of D47crunch in a new virtual env and encountered the following error when importing D47crunch:

AttributeError: module 'typer' has no attribute 'rich_utils'

It seems to have been resolved by installing the rich package.

Should this be added as a dependency in the .toml file?

Improve `plot_distribution_of_analyses` or throw warnings in case there are many sessions/samples

Ok I'll file it if I get the example output working. I thought it might be because I'm putting waaaay too many unique Samples in there, but if you used it for the Fiebig study I guess they also had many replicates. I think in general it's best to keep code that isn't strictly necessary/useful to almost all users out of this repo and include it in the repository that accompanies the paper in stead. This will make the codebase easier to maintain.

Agreed in principle, but such a plot could be pretty useful for anybody, there's nothing specific to Jens' study here. Again, if it works poorly for your use case, let me know and we can improve it if you feel it's worth it.

Originally posted by @mdaeron in #5 (comment)

I think the main issue with my outputs are that while I have only 3 sessions, I did run the script for ALL measurements ever, so the number of unique Samples was huge, meaning that it had so many rows in the figure that it was illegible and basically a rectangle full of blue points.

clumpycrunch formatting

When using ClumpyCrunch v2.1, processing tab-delimited pasted data (e.g., "Daeron") returns a "WG computation failed for some reason" message.

Temporary fix by sanitizing tab-delimited data to remove trailing tabs at the end of lines (e.g., "Daeron_sanitized")
Daeron.txt
Daeron_sanitized.txt
.

decide on how to make plotting interfaces consistent

Continuing the discussion started in #9:

I need to go back to the various plotting functions, but you're right that they lack consistency. The thing to decide first is what should be the default behavior?
* return a matplotlib `Figure` object by defaut, which means that most aspects of the figure may be modified before displaying/saving it; return `None` and save to disk with some optional parameter (e.g. `output = 'savefig'`)

* return `None` and save to disk by default; return a `Figure` object with some optional parameter (e.g. `output = 'fig'`)
In both cases, output = 'ax' could be used to ask for a matplotlib Axes object, which allows including the plot in a larger figure.

Any opinion on the best default behavior? Any change needs to be deliberate because it will (slightly) break backward compatibility.

I had the same problem with clumpedr's plotting functions, where I had to decide on whether plots were generated by default, whether they were returned as plot objects or printed or saved. In the end I decided to remove plotting from all the processing functions themselves, and to let the plots always return the plot object for further tweaking, letting the user decide how to print/save for themselves.

I'm not sure what is best though: are the users of D47crunch going to be python users? In that case it's probably nicest to return the plot objects for futher tweaking. If they're going to be python novices (like me) it might be nicer to save PDFs by default, so that you don't need to do more in python but can just look at the output.
I think in this case, the second option may actually be better! Where the advanced users can specify that they want the plotting function to return the plot object in stead. What are your thoughts?

standardize throws an error because of package lmfit

after having solved the first output errors in #5, I immediately ran into another issue when calling mydata.standardize():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/py2vPmgk", line 3, in <module>
  File "/tmp/babel-UvbDtD/python-jGCwX1", line 1, in <module>
    mydata.standardize()
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1625, in standardize
    params.add(f'D{self._4x}_{pf(sample)}', value = 0.5)
  File "/home/japhir/.local/lib/python3.9/site-packages/lmfit/parameter.py", line 373, in add
    self.__setitem__(name, Parameter(value=value, name=name, vary=vary,
  File "/home/japhir/.local/lib/python3.9/site-packages/lmfit/parameter.py", line 137, in __setitem__
    raise KeyError("'%s' is not a valid Parameters name" % key)
KeyError: "'D47_AU002_(2)' is not a valid Parameters name"

Note that the lines referring to /tmp/... etc. are there because I'm running python from an orgmode file in emacs.

I'm not sure how to debug this one, so I'll leave it as an issue here!
I'm sure it must be something in my dataset (cannot share unfortunately, latest measurements) because I didn't get this error when running the code on your example tiny dataset. My dataset is structured as follows:

UID,Session,Sample,d45,d46,d47,d48,d49
21401,2021-05-12,ETH-2,-7.442475902551424,-5.885893603532355,-13.981637618936443,-12.401052771990098,-19.14472530378544
...

I've made sure there are no NAs in there, so I'm a bit confused.

I'm running python 3.9.6 with D47crunch 2.0.2, which in my case relies on lmfit 1.0.2, matplotlib 3.4.3, scipy 1.7.1 and numpy 1.21.2

Output temperature in Table of Samples?

Would it be possible / make any sense for D47data.table_of_samples() to accept temperature calibration constants as optional arguments and output D47 temperature in the Table of Samples?

crunch throws an error with failed d45 type

Hi Mathieu!

I've been trying out D47crunch for our data!

The first issue I had was that our UIDs and Sessions were integers and datetimes respectively, and that I had to cast them to character to get mydata.crunch() to work. Otherwise it would throw this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/pygzCIH1", line 9, in <module>
  File "<string>", line 1, in <module>
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1274, in crunch
    self.compute_bulk_and_clumping_deltas(r)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1349, in compute_bulk_and_clumping_deltas
    R45 = (1 + r['d45'] / 1000) * R45_wg
TypeError: unsupported operand type(s) for /: 'str' and 'int'

I think I managed to fix this by first casting them to characters and then running it again.

add a see also section to readme

With the recent release of David Bajnai's isogeochem, people looking for clumped-related processing code might be getting a bit lost in the woods when searching github. It would be nice to add a short description to each of our README sections linking to each other's projects, specifying how the projects differ from each other. What do you think?

I've added the following short sentence to my R package clumpedr's dev branch:

D47crunch is the Python processing code to process clumped data from δ47 values. It features a pooled regression for the empirical transfer function and many other features.

are you happy with the description?

`indep_sessions` standardization is not working

Hi @mdaeron,
I didn't notice so far since we almost entirely use the pooled approach, but self.standardize(method='indep_sessions') is not working for me after updating D47crunch .

Python v3.9.6
D47crunch v2.3.2

import D47crunch
mydata = D47crunch.D47data()
mydata.read('rawdata.csv')
mydata.wg()
mydata.crunch()
mydata.standardize(method = 'indep_sessions')

Traceback (most recent call last):
  File ".../TESTING_D47crunch/test.py", line 14, in <module>
    mydata.standardize(
  File ".../TESTING_D47crunch/venv/lib/python3.9/site-packages/D47crunch/__init__.py", line 957, in newfun
    out = oldfun(*args, **kwargs)
  File ".../TESTING_D47crunch/venv/lib/python3.9/site-packages/D47crunch/__init__.py", line 1757, in standardize
    self.consolidate(tables = consolidate_tables, plots = consolidate_plots)
  File ".../TESTING_D47crunch/venv/lib/python3.9/site-packages/D47crunch/__init__.py", line 957, in newfun
    out = oldfun(*args, **kwargs)
  File ".../TESTING_D47crunch/venv/lib/python3.9/site-packages/D47crunch/__init__.py", line 2322, in consolidate
    self.consolidate_sessions()
  File ".../TESTING_D47crunch/venv/lib/python3.9/site-packages/D47crunch/__init__.py", line 2215, in consolidate_sessions
    self.sessions[session][f'r_D{self._4x}'] = self.compute_r(f'D{self._4x}', sessions = [session])
  File ".../TESTING_D47crunch/venv/lib/python3.9/site-packages/D47crunch/__init__.py", line 957, in newfun
    out = oldfun(*args, **kwargs)
  File ".../TESTING_D47crunch/venv/lib/python3.9/site-packages/D47crunch/__init__.py", line 2399, in compute_r
    _ for _ in self.standardization.params
AttributeError: 'D47data' object has no attribute 'standardization'

issue with trying to set parameter b to 0 in standardize(constraints = ...) call

In an email (because the data are not published yet) you mentioned

If you want to force your "compositional non-linearity slopes" (parameter
b in Daëron, 2021) to be zero, you can use the constraints argument of
D47data.standardize(). I realize now that this is under-documented, but the
way it works, if you want to set b=0 for sessions foo, bar and baz, is
to use constraints = dict(b_foo = 0, b_bar = 0, b_baz = 0). I haven't
extensively tested this option yet, however, so things may break somewhere
else as a result. As always, please open an issue if that is the case.

I've tried this out but I couldn't get it to work. My sessions were named '2018-02-23', '2020-01-03', and '2021-05-12' and when I tried:

  mydata.standardize(
      constraints = dict(b_2018-02-23 = 0,
                         b_2020-01-03 = 0,
                         b_2021-05-12 = 0))

I got the following errors

Traceback (most recent call last):
  File "<string>", line 8, in __PYTHON_EL_eval
  File "/usr/lib/python3.10/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<string>", line 1
    dict(b_2018-02-23 = 0,
                ^
SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers

I then thought that they should be passed as characters, so I did it with bracketing ''s. ('b_2018-02-23') but this gave me the next error:

Traceback (most recent call last):
  File "<string>", line 8, in __PYTHON_EL_eval
  File "/usr/lib/python3.10/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "<string>", line 1
    dict('b_2018-02-23' = 0,
         ^^^^^^^^^^^^^^^^
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?

which made me think that the issue may be with the names of the sessions, because typically - is not allowed in a variable name (right?). So I renamed the sessions to session0, session1, and session2 (see, I'm getting the hang of Python's 0-indexing even though I'm an R user ;-))

Traceback (most recent call last):
  File "<string>", line 17, in __PYTHON_EL_eval
  File "<string>", line 3, in <module>
  File "/tmp/babel-qGlEVX/python-7afIAv", line 1, in <module>
    mydata.standardize(
  File "/home/japhir/SurfDrive/PhD/programming/D47crunch/D47crunch/__init__.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/D47crunch/D47crunch/__init__.py", line 1628, in standardize
    params[k].expr = constraints[k]
  File "/usr/lib/python3.10/site-packages/lmfit/parameter.py", line 845, in expr
    self.__set_expression(val)
  File "/usr/lib/python3.10/site-packages/lmfit/parameter.py", line 860, in __set_expression
    self._expr_ast = self._expr_eval.parse(val)
  File "/usr/lib/python3.10/site-packages/asteval/asteval.py", line 257, in parse
    if len(text) > self.max_statement_length:
TypeError: object of type 'int' has no len()

Note that calling mydata.standardize() without any constraints does work as expected (and doesn't result in very different results after limiting the anchors to ETH-1, ETH-2, and ETH-3).

Any ideas?

Update module-level table functions for Δ49

These functions need to be updated at some point to also list Δ49 data:

D47crunch.table_of_samples()
D47crunch.table_of_sessions()
D47crunch.table_of_analyses()

Nothing complicated there, but requires a little bit of time investment.