Ok I'll file it if I get the example output

FTR: "sample" = an amount of presumably homogeneous carbonate

Just in case, beware that by defining a Sample (in <c

Improve `plot_distribution_of_analyses` or throw warnings in case there are many sessions/samples about d47crunch HOT 7 CLOSED

mdaeron commented on July 20, 2024

Improve `plot_distribution_of_analyses` or throw warnings in case there are many sessions/samples

from d47crunch.

Comments (7)

mdaeron commented on July 20, 2024 1

To combine them, I first had to create a massive dict that contains all 1335 samples. I first tried to read in a csv that links the bin to a sample name, but struggled too much and in the end made a massive dict definition just using some search/replace on a csv export.

For the record, one simple way to do this kind of thing is:

csv_contents = '''
sample1, bin1
sample2, bin1
sample3, bin1
sample4, bin2
sample5, bin2
sample6, bin2'''[1:]

bindata = [l.split(', ') for l in csv_contents.split('\n')]
bins = {_[1] for _ in bindata} # NB this is a set, not a dict
bindict = {b: [sample for sample, bin in bindata if bin == b] for b in bins}

# this results in:
# 
# bindict = {
# 	'bin2': ['sample4', 'sample5', 'sample6'],
# 	'bin1': ['sample1', 'sample2', 'sample3'],
# 	}

Now I managed to use combine_samples() with this dict, and it returns, as it says on the tin:

the list of group names

an array of the corresponding Δ4x values

the corresponding (co)variance matrix

but now I'm not sure how to combine these again so that I get a table of:

| bin id | D47 average | D47 95% CI |

Basically I don't know how to calculate the 95% CI from a NxN covariance matrix and how to export the stuff to a csv.

By default I estimate 95% confidence limits by multiplying the standard error for each bin (from the diagonal elements of the covariance matrix) by the relevant Student's t factor for the number of degrees of freedom in the standardization model. This t factor is conveniently stored in the D47data object's t95 property. So if your D47data instance is mydata, and mydata.combine_samples() returns mybins, D47_avg, D47_CM, you could do:

with open('binned_output.csv', 'w') as fid:
	fid.write('bin,D47,D47_SE,D47_95CL')
	for k in range(len(bins)):
		fid.write(f'\n{mybins[k]},{D47_avg[k]},{mydata.t95 * D47_CM[k,k]**0.5}')

I also noticed that the averages are different from the ones that I get in my R script, but this may be because of the ETF pooling that you do, or the offset correction that we do. Will have to look into that part later.

Also, remember that binned samples are weighted by number of replicate analyses, which is a reasonable default behavior but is not always optimal. If you want to use different weights, look at the code; it should be simple to add an option to weigh samples equally. If this is something you think is necessary, feel free to let me know / open an issue / submit a PR.

Any help would be appreciated! :)

Hope this helps. I'm just happy someone is using this library apart from myself!

from d47crunch.

mdaeron commented on July 20, 2024

In that case a quick and dirty improvement would be to let users arbitrarily redefine the figure dimensions. Could you play with 6e71f67 and let me know if it helps?

from d47crunch.

japhir commented on July 20, 2024

That helps! And works.

But I think I misdiagnosed the problem: I was thinking of "Samples" as replicate measurements of the same exact sample (i.e. IODP Leg 208 Site 1264 Hole A 28H2 96 cm–97.5 cm) but for clumped we need to calculate averages over longer time scales to get good reproducibility, and thus a "sample" is now something else: a period of time where you assume the material to be homogeneous enough to chuck it all together.

Obviously we're not at a stage yet where we need 100s of rows in this figure, because that would be a crazy well replicated study!

I guess I should have re-read your paper prior to messing around with the code and complaining that it wasn't working ;-)

from d47crunch.

mdaeron commented on July 20, 2024

FTR:

"sample" = an amount of presumably homogeneous carbonate material. Each sample should be uniquely identified by a sample name (field Sample in the csv file).
"analysis" or "replicate" = corresponds to a single acid reaction followed by purification of the evolved CO2 and by a series of dual-inlet IRMS measurements. Each analysis is identified by a unique identifier (field UID in the csv file, but if it's missing a default series of UIDs will be generated).

from d47crunch.

japhir commented on July 20, 2024

Thanks, just to clarify further:
If I specify UID to mean one acid reaction and Sample to mean the things that I want to calculate averages for, is there a way to account for other "block" effects, in this case the unique label that was sieved/picked/cleaned separately but may have multiple analyses? Or, say, two benthic species picked from the same sample that we want to average together but might display some block effects, especially on d13C and d18O?

from d47crunch.

mdaeron commented on July 20, 2024

Just in case, beware that by defining a Sample (in D47crunch terminology), you postulate that it is isotopically homogeneous, and this assumption will play into your estimates of analytical repeatability. So you should not assign identical sample names to materials that you expect to have different isotopic compositions just for the sake of estimating their average Δ47 value.

Instead, even if you treat materials of different isotopic compositions as different samples, you may still average their Δ47 values after the standardization step, using D4xdata.sample_average():

# compute the unweighted average of Δ47(sample1), Δ47(sample3), and Δ47(sample3)
# and the corresponding standard error:
mydata.sample_average(samples = ['sample1', 'sample2', 'sample3'])

# compute the weighted average of Δ47(sample1), Δ47(sample3), and Δ47(sample3)
# with respective weights of 1, 2, and 3, and the corresponding SE:
mydata.sample_average(
    samples = ['sample1', 'sample2', 'sample3'],
    weights = [1, 2, 3])

# compute the difference between Δ47(sample2) and Δ47(sample1)
# and the corresponding SE:
mydata.sample_average(
    samples = ['sample1', 'sample2'],
    weights = [-1, 1])

Another option is to use D4xdata.combine_samples() which specifically designed to compute average Δ47 values for bins of samples. This function also returns the full covariance matrix for these binned Δ47 values. For example, if samples foram_10, foram_11, and foram_12 are all Holocene and samples foram_18, foram_19 correspond to the LGM, calling combine_samples(dict(Holocene = ['foram_10', 'foram_11', 'foram_12'], LGM = ['foram_18', 'foram_19'])) will return the average Δ47 value for the Holocene group of samples, the average value for the LGM group, and the 2x2 covariance matrix of these two averages. This is how I would personally approach your benthic species example.

from d47crunch.

japhir commented on July 20, 2024

Hi again Mathieu!

After all this time I'm taking another look at what you proposed here. I've now managed to calculate clumped values for each 2-cm sample separately. To combine them, I first had to create a massive dict that contains all 1335 samples. I first tried to read in a csv that links the bin to a sample name, but struggled too much and in the end made a massive dict definition just using some search/replace on a csv export.

Now I managed to use combine_samples() with this dict, and it returns, as it says on the tin:

the list of group names
an array of the corresponding Δ4x values
the corresponding (co)variance matrix

but now I'm not sure how to combine these again so that I get a table of:

| bin id | D47 average | D47 95% CI |

Basically I don't know how to calculate the 95% CI from a NxN covariance matrix and how to export the stuff to a csv.

I also noticed that the averages are different from the ones that I get in my R script, but this may be because of the ETF pooling that you do, or the offset correction that we do. Will have to look into that part later.

Any help would be appreciated! :)

from d47crunch.

Improve `plot_distribution_of_analyses` or throw warnings in case there are many sessions/samples about d47crunch HOT 7 CLOSED

Comments (7)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent