Code Monkey home page Code Monkey logo

Comments (11)

mdaeron avatar mdaeron commented on July 20, 2024

Good point about the documentation.

We could use str() when reading the Session field, with the following results:

from datetime import date, datetime
import arrow # another popular option

str(date.today())   # returns '2021-09-10'
str(datetime.now()) # returns '2021-09-10 12:03:37.940927'
str(arrow.now())    # returns '2021-09-10T12:04:02.268544+02:00'

Would that work for your use case?

from d47crunch.

japhir avatar japhir commented on July 20, 2024

Am I correct in that arrow includes the timezone and datetime relies on UTC?

I am an absolute beginner in python, so have no idea of what packages are out there and what they do, and how to modify the examples you listed in the docs to my liking ;-). I basically used R to generate the csv in the desired structure and then ran below in sequence:

  import D47crunch
  mydata = D47crunch.D47data()
  mydata.read('/home/japhir/SurfDrive/PhD/programming/dataprocessing/out/rawdata.csv')
  mydata.wg()  # as an aside, where do I tell it the d13C and d18O of the working gas?
  mydata.plot_distribution_of_analyses() # gives an unusable plot if you have many measurements, in my case 1779 measurements only (that's a tiny subset of everything since may 2021)
  mydata.crunch() # this gave me the error if I didn't make sure UID and Session were characters
  mydata.standardize() # this still gives me the error in #6 
  mydata.summary(verbose = True, save_to_file = False) # can only reach this step with the simulated data

the #6 in the comment of the code block doesn't create a link, so here it is.

from d47crunch.

mdaeron avatar mdaeron commented on July 20, 2024
  1. Going back to your original first message above, the error you report has nothing to do (I think) with the Session field. Instead, it's complaining that r['d45'] is a string. After mydata.read(), what does type(mydata[0]['d45']) return?
  2. mydata.wg() is only used if you want to compute the WG composition based on your δ13C and δ18Ο standards. If you want to specify arbitrary values for the WG, add fields d13Cwg_VPDB and d18Owg_VSMOW to your csv file. This used to be included in the documentation but it was dropped at some point; I'll put it back in. I personally advise to use D47data.wg() rather than nominal WG values, but YMMV of course.
  3. plot_distribution_of_analyses(): maybe open another issue with an example of the output, and suggestions re: how to improve the plot. I've only used it once, for the data in Fiebig et al. (2021), but the plot could be heavily customized using function keywords if needed.
  4. crunch(): I don't understand how UID and Session end up not being strings, unless you used the 'session' keyword in D47data.read()? In that case, yes, the Session field needs to be a string. That doesn't explain why UID is not a string. The relevant code is below:
data = [{k: v if k in ['UID', 'Session', 'Sample'] else smart_type(v) for k,v in zip(txt[0], l) if v != ''} for l in txt[1:]]

if session != '':
	for r in data:
		r['Session'] = session

from d47crunch.

japhir avatar japhir commented on July 20, 2024
  1. I was able to get the code running only if I took a small subset of my whole dataset. I thought that there were some rows that force the whole d45 column to be a string, however, for one of the subsets that breaks (n = 1000) I can get the same error while type(mydata[0]['d45']) <class 'float'>
  2. thanks for the explanation! I'll have a look at whether the wg computed value is very different from the one specified by our gas provider.
  3. Ok I'll file it if I get the example output working. I thought it might be because I'm putting waaaay too many unique Samples in there, but if you used it for the Fiebig study I guess they also had many replicates. I think in general it's best to keep code that isn't strictly necessary/useful to almost all users out of this repo and include it in the repository that accompanies the paper in stead. This will make the codebase easier to maintain.
  4. I think I'm mixing issues here, sorry about that. If I filter to rows where the above error doesn't take place, this error pops up with
>>>  mydata.standardize()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/pyPjNI3W", line 3, in <module>
  File "/tmp/babel-yqvP9s/python-GXTFNT", line 1, in <module>
    mydata.standardize()
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1618, in standardize
    params.add(f'a_{s}', value = 0.9)
  File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 373, in add
    self.__setitem__(name, Parameter(value=value, name=name, vary=vary,
  File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 137, in __setitem__
    raise KeyError("'%s' is not a valid Parameters name" % key)
KeyError: "'a_2020_01_03T00:00:00Z' is not a valid Parameters name"

If I then re-implement my manual fix to convert the Session to character prior to export to csv and re-run, I get the other issue #6 where the Sample names cause issues

>>>   mydata.standardize()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/pykIhOHv", line 3, in <module>
  File "/tmp/babel-yqvP9s/python-EUrohX", line 1, in <module>
    mydata.standardize()
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1006, in newfun
    out = oldfun(*args, **kwargs)
  File "/home/japhir/SurfDrive/PhD/programming/apply_D47crunch/D47crunch.py", line 1625, in standardize
    params.add(f'D{self._4x}_{pf(sample)}', value = 0.5)
  File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 373, in add
    self.__setitem__(name, Parameter(value=value, name=name, vary=vary,
  File "/usr/lib/python3.9/site-packages/lmfit/parameter.py", line 137, in __setitem__
    raise KeyError("'%s' is not a valid Parameters name" % key)
KeyError: "'D47_AU002_(2)' is not a valid Parameters name"

from d47crunch.

mdaeron avatar mdaeron commented on July 20, 2024
  1. I was able to get the code running only if I took a small subset of my whole dataset. I thought that there were some rows that force the whole d45 column to be a string, however, for one of the subsets that breaks (n = 1000) I can get the same error while type(mydata[0]['d45']) <class 'float'>

mydata[0] is only the first line. To find the offending lines, you can try:

for k,r in enumerate(mydata):
    try:
        r['d45'] / 1000
    except:
        print(k, r['d45'], type(r['d45'])

This will print out the index, r['d45'] and type for the offending lines.

from d47crunch.

mdaeron avatar mdaeron commented on July 20, 2024
  1. Ok I'll file it if I get the example output working. I thought it might be because I'm putting waaaay too many unique Samples in there, but if you used it for the Fiebig study I guess they also had many replicates. I think in general it's best to keep code that isn't strictly necessary/useful to almost all users out of this repo and include it in the repository that accompanies the paper in stead. This will make the codebase easier to maintain.

Agreed in principle, but such a plot could be pretty useful for anybody, there's nothing specific to Jens' study here. Again, if it works poorly for your use case, let me know and we can improve it if you feel it's worth it.

from d47crunch.

japhir avatar japhir commented on July 20, 2024

Aargh now It's breaking again but this time type(mydata[0]['d45']) is <class 'str'>.

I think it's because the Sample names in the subset that is causing issues are things like "ETM-2_690_5-83,85(Ntr)", which needs to be quoted, while some others are not quoted.

I've looked into how my R csv export decides to quote and/or escape things. By default it quotes "as needed", meaning that the above one with parentheses etc. gets quoted, but a simple ETH-3 doesn't.
Maybe the offender is "ETM-2_690_4-138,145(Ntr)[2]"? With quotes, parentheses AND square brackets this one is a winner in messing up computers ;-). I found this blogpost about reading/writing csv files with edge cases. For now I will simply transform these names manually?

After specifically filtering out any Sample with (, ), [, ] I think it's running smoothly now!
filter(!str_detect(identifier_1, "\\(|\\)|\\[|\\]")) %>%

mydata[0] is only the first line. To find the offending lines, you can try:

wait, so python's rows are by default lists that can contain different types for each row? o.O

from d47crunch.

mdaeron avatar mdaeron commented on July 20, 2024

I realized your problem is definitely caused by commas in sample names. The csv file uses commas as separators by default so that inserts spurious columns in your data. Using tabs or semicolons instead should help a lot.

wait, so python's rows are by default lists that can contain different types for each row? o.O

Not always. Python has lists of items, where each item can be any object, even another list. There is no concept of rows for such lists. In this case D4xdata objects are built as lists of dictionaries. A dictionary is a set of keys (e.g., of type int, or str), with each key associated to an arbitrary object (which could be a list another dictionary, among others). So mydata[0] is the first item (a dictionary, in this case, which describes a single analysis) in the list mydata, and mydata[2]['d45'] is the value associated with key d45 in the third item (=analysis) of mydata.

There is no obligation that all items in mydata have the same fields. You could add arbitrary fields (e.g., mineralogy) to some analyses., and test for the presence of these fields using the following code:

for r in mydata:
    if 'mineralogy' in r:
        print(f'Analysis {r["UID"]}: {r["mineralogy"]}')

from d47crunch.

japhir avatar japhir commented on July 20, 2024

I realized your problem is definitely caused by commas in sample names. The csv file uses commas as separators by default so that inserts spurious columns in your data. Using tabs or semicolons instead should help a lot.

ah yes, that must be it. Is there any escape syntax that you can use during export to keep the original identifier_1 column? The R export has quoted the fields, so the commas are technically correct csv syntax right? The link I shared before has some python csv package code that specifies how to handle quote syntax etc.

from d47crunch.

japhir avatar japhir commented on July 20, 2024

Not always. Python has lists of items, where each item can be any object, even another list. There is no concept of rows for such lists. In this case D4xdata objects are built as lists of dictionaries. A dictionary is a set of keys (e.g., of type int, or str), with each key associated to an arbitrary object (which could be a list another dictionary, among others). So mydata[0] is the first item (a dictionary, in this case, which describes a single analysis) in the list mydata, and mydata[2]['d45'] is the value associated with key d45 in the third item (=analysis) of mydata.

There is no obligation that all items in mydata have the same fields. You could add arbitrary fields (e.g., mineralogy) to some analyses., and test for the presence of these fields using the following code:

for r in mydata:
    if 'mineralogy' in r:
        print(f'Analysis {r["UID"]}: {r["mineralogy"]}')

Thanks for the elaboration :)

from d47crunch.

mdaeron avatar mdaeron commented on July 20, 2024

ah yes, that must be it. Is there any escape syntax that you can use during export to keep the original identifier_1 column? The R export has quoted the fields, so the commas are technically correct csv syntax right? The link I shared before has some python csv package code that specifies how to handle quote syntax etc.

I confess that I never took the time to use a real csv parser as you suggest. I'll add this as a todo item for the dev version.

from d47crunch.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.