The forc's discuss from forc-db

Reviewer comment on biomass variables

From a reviewer:
Biomass terms are used to label perhaps a majority of the records, but the labels used are highly ambiguous and need clarification. A significant fraction of the data records are labelled as “Biomass_ag” – what was measured?. just woody stems? also understory?. grasses/mosses?. “Biomass_total” – this is used for many records, but I doubt all components of aboveground and belowground biomass were actually assessed (particularly belowground, to max. soil depth). Again, while some additional info on this is given for a subset of the records in the supplementary METHODOLOGY table, I think instead the term in the MEASUREMENTS table used to describe the datum needs to be specific.

load/modify script to make histogram of dominant vegetation types

Please load the code used to make this figure:

Please make the following modifications:

Rearrange some of the categories:
-change grass or shrubs to just grass
-new category: 2TM, which includes 2TD and 2TE.
-woody other/unclassified- for dominant lifeform=woody and dominant veg = 2TREE, NAC, or anything else

Remove”dominant lifeform” label

Change labels as follows: 2TEN- evergreen needleleaf; 2TDB- broadleaf deciduous; 2TEB-broadleaf evergreen; 2TDN- needleleaf deciduous; 2TM- broadleaf- needleleaf mix

Add Regentype and Managementtype codes

@ValentineHerr, ForC does not currently contain Regentype and Managementtype (Measurements file) for extratropical sites. I have found a file where Maria had added these codes. Could you please merge these into the master? (I'm assuming this will be easy. If not, please alert me.)

Categorical variable check

@teixeirak ,

I am finally getting to that part of the check list but it is bigger than what I thought.. Not sure I can get it down by the end of the day...

I think the first step would be to fill in Variable.codes in the metadata tables with the list of all levels of categories used in the tables. Then I can pull that and make sure that the category levels in the (future) tables already exist.

But that means that we might want to separate the description of the variables codes and the variables codes themselves. I am thinking of metadata of HISTTYPE. We would have "Establishment", "Regrowth(_prior)", "Disturbance(_prior)", "Management", "No.disturbance", "No.info" in Variable.codes and what they mean ("Disturbance - includes natural and anthropogenic disturbances that directly kill ..." in another field called something like Variable.codes.description

Let me know what you think.

Update entity relationship diagram.

The entity relationship diagram is currently out of date and needs to be updated both for this site and our publication. An editable version is here. @ValentineHerr, it would be great if you could update this.

Mismatches between PLOTS and HISTORY

There are 0 plots with no corresponding history record
There are 0 history records with no corresponding plot record

👏

ensure that ORNL DAAC forest NPP data is in ForC

The ORNL DAAC has NPP estimates for boreal, temperate, and tropical forests.

At least some of these are already in ForC, and the files for tropical forests are here in our private repo. According to Maria's records, she has entered the data included in "(Clark et al., 2001, 2013) and
https://daac.ornl.gov/" (see this doc); however, a spot check revealed that NPP data from Marafunga, Papua New Guinea is not included. Moreover, I do not believe we've ever downloaded temperate and boreal NPP data. It may be that she only loaded the subset in the Clark compilations.

Make the database into an R package?

There are some significant advantages to taking a collection of data and code, as here, and making it into an R package.

Easy for users. R users are very accustomed to doing library(xxx) as part of their analysis scripts, and packaging the ForC data in this way makes it super easy for people to use.
Reproducible science. In line with the above point, users can easily cite "package xxxx version yy".
Automated tests. To your point in #23 , package come with a variety of built-in (via R) and user-created automated tests. So when someone opens a PR, for example, the check script can run and verify whether their data are 'good' or not.
Better documentation. You can make data documentation in the R help system, and include vignettes on how to use the data.

Anyway, something to consider!

Database cleanup

Standardize variable names
Standardize field names (see #51 )
Break out references list
Link sites and plots by unique ID numbers rather than full name
headers in metadata need some work (not column names of tables, just headers of metadata)

Fix references

from a reviewer:

A previous data compilation that is listed as the source for nearly 10% of the records (Anderson et al. 2006) I could not find when I searched the Bibliography. This could be fixed simply by inserting the reference, or, more thoroughly, by breaking out the reference list (see issue #52).
Also, the Reference list given for “Measurements” in the Bibliography is in reverse alphabetical order. Simple fix.

SITES metadata missing loaded_from and loaded_by field description

@teixeirak,

there is no description of field loaded_from and loaded_by in SITES table

transfer vegetation notes from sites to measurements table

The notes field in the SITES table contains some notes describing the vegetation, and these are more appropriate for the vegnotes field in the MEASUREMENTS table.

update or remove Regentype field

Regentype (measurements table; populated based on plothistory data using @mmw590's R code) has not been populated for extratropical sites. Categories should be redefined to be more appropriate for extratropical sites.

update bibliography

Bibliography is currently out of date (includes only records from TropForC-db v. 1.0).

load script to make histogram of elevations

Please load the script used to make this figure

Minor issues

@teixeirak
I will use this issue to ask you a few things that I need your input with

Design a system to faciliate data entry

Data entry is more cumbersome than it needs to be, and understanding the system is a learning curve.

One idea to improve this is to create a single data entry spreadsheet and R code that can draw from that to update the database. The speadsheet could include field descriptions and instructions for data entry. It could also allow for multiple measurement records or history events to be recorded quickly without requiring entry of duplicate information.

update metadata

Metadata is currently out of date.

Regarding problematic GPP estimates

Reviewer comment:
annual GPP – many included records are for GPP (total forest photosynthesis), but this flux cannot be directly measured in the field (see Wehr et al. 2016 Nature; Clark et al. 2017 Biogeosciences). Estimating this flux is problematic. It is approached either by model simulations, by summing a subset of the field biometric components, or by complex inferences made from eddy-covariance estimates of forest net CO2 exchange. The explanations of methods for a subset of these GPP records that are in the Methodology table are useful in those cases, but for many (most?) of the GPP records, there is no information given on the underlying methods. It is clear, however, from the limited information, that many of these values are based purely on modelling, with no direct measurements made in the field.

Response:
Although estimation of GPP is problematic, we feel strongly that it is an important variable that should be included in this database (and, of course, it is not the only variable with problematic methods). However, ForC should not include modeled values, and we removed all records based on modeling (see issue #59).

check BDFFP plotarea

For BDFFP sites, plotarea appears to be fragment size. Depending on sampling design, this may or may not be appropriate. Specifically, if the entire area of each fragment was thoroughly sampled or systematically subsampled, it is approriate that plotarea=fragment size. However, if a single plot of defined size was located somewhere within the fragment, that is the plot size that should be recorded.

Organize references

In order to improve database structure and reconstruct the record of which values have been checked against original publications, we want to:

Record which pdfs we have (in order to know which values have been checked):

move folders containing references and intermediary data sheets to ForC_private
copy all references to the the references repository
generate list of references for which we have pdfs
mark the checked.ori.pub field (Measurements table) with '1' code for all references for which we have the pdf

Break out references records, creating a separate data table for references

Create/ reinstate fields for management status and disturbance/regeneration category

Previous versions of this database contained "managementtype" and "regentype" fields, which broadly categorized forests based on their management status and disturbance/regeneration category (e.g., natural regeneration following agriculture, plantation).

These were deleted because they appeared to have become corrupted and because of lack of transparency/reproducibility in how they were calculated (produced by script, fixed by hand). For details, see issue #21.

It would be nice to reinstate or recreate these. The best option is to write new script to produce these based on the PLOTS and/or HISTORY files and plot names. Plot names often contain relevant information.

finish/load script to make historgram of stand ages

Please load the script used to make this figure.

Then, update as follows:

add legend (tropical, extratropical)
try grouping by 5-year age bins
add two bars at the left end: one for for stands of known age >200 years, another for old-growth/undisturbed stands (coded 999). This will likely work best with a broken y-axis.

Distinguish between direct measurements and modeled values

From a reviewer:
Another important need is for each datum to be accompanied by a flag that indicates whether the value is (a) based on direct forest-based observations (e.g., NEE values from eddy covariance, fine-litterfall measurements), or is instead (b) partially or totally based on modelling or other non-direct methods (as is the case for a large number of the records here, per the METHODOLOGY table; examples there are “BGC-model” and “GPP-Re”).

Response:
In response to this comment, we seriously considered adding a field to indicate the measurement type, but decided against it. The reason is that it would be difficult to classify methods into discreet categories, as method types are really more of a spectrum. The two given examples of direct forest-based observations (NEE, fine litterfall) both generally require some type of gap filling or interpolation to produce an annual estimate. On the other end of the spectrum, there’s a wide variety of methods combining using scaling/ gap-filling/ modeling techniques to extrapolate from empirically measured values to annual ecosystem carbon budgets. Rather than attempt to classify into discreet categories, we feel that it is most appropriate to continue presenting this information in the methodology table. It is our philosophy that responsible use of the database requires some familiarity with carbon cycle measurement methods, and when specific methodology is important, users can refer to the methodology table and original publications, if necessary. To ensure that the database documentation addresses the issue of variation in the directness of measurement vs relying on some type of modeling, we have inserted text in our discussion of major factors affecting data quality along with recommendations to users as to how to appropriately screen the data ().

We do agree that the database should not contain purely modeled values, and our general philosophy while compiling the database has been to not include modeled values. However, the Luyssaert database, which was incorporated to ForC, did include some modeled values, and these methods were recorded in the METHODOLOGY table. We therefore reviewed all of the records for which entries in the METHODOLOGY table indicated that they were modeled. Modeled values, or values whose methodology could not easily be traced (i.e., would require tracing back through multiple publications), were removed from the database. Those that were scaled or extrapolated from site-specific measurements were retained, and the methodology was clarified (METHODOLOGY table).

Checking records from intermediary databases against original publications.

Issue raised by a reviewer:
"The citation information given for the data records (MEASUREMENTS table) indicates that most of the data presented here were obtained by the data-set authors from secondary compilations, not from the original publication of each study. The Disclaimers section of the Metadata (lines 383-388) indicates that the many derived (secondarily-cited) data were not checked against the original source publications. The authors actually warn potential users here that there are likely to be errors and to please help in the quality control by alerting these authors of such errors.
Users of such a data compilation should be able to rely on the accuracy of the data presented. That is clearly not the current case. All the secondarily-cited data in this data set need to be checked by the data-set authors against the original source publications. Such work is tedious and demanding, and in some cases (e.g., Chinese productivity literature), language issues will make it highly challenging. I would argue, however, that no record should be included in a data compilation such as this one, unless the data-set authors have checked the numbers and associated information against the corresponding original publications. "

Response:
**We appreciate the concern for data quality; however, we respectfully disagree with the philosophy that all of these records must be checked against original data publications to be of value to the scientific community. ....

We recognize the highest data quality standards require double checking reported values against original publications. We previously did not have a field for recording when values had been checked against original publications. To provide this structure going forward, we have added the field checked.ori.pub to the Measurements table. Unfortunately, this field does not record records that were double checked prior to February 2018 (with the exception of a few whose verification had been noted in the notes field).

Regarding associated information, we have made a point of documenting the reason for incomplete data in the database, which is accomplished through the use of different missing data codes. The “NAC” code indicates data for which we have made no effort to obtain the information from the original publication. Thus, researchers who would benefit certain information can easily identify records that should be checked in the original publication.**

load script to make map of ForC sites

Please load the code to make this figure:

It would also be good to have a simplified version with minimized height that leaves out the biogeographic zones and chart (for a grant proposal).

update R code to work with new database structure

Database structure has been modified a bit; @mmw590's R code may need to be modified to run correctly.

Review/ fix stand.age issue

from a reviewer: There are many oddities among the records included with this age designation, including monospecific plantations (e.g., 15 records from La Selva) and many logged S. American forests (diverse sites).

check for additional TropForC updates since original publication

Updates to TropForC measurements, sites, and plothistory files made after the original publication have now been posted in this public repository. However, there may be additional updates to supporting files (e.g., allometries) that have not yet been posted.

finish/load script to plot climate of ForC sites

Please load the script used to create this figure:

Modify as follows:

add a histogram of Köppen-Geiger climate zones represented in the database (Koeppen field in SITES). Use horizontal bars along the x axis to group climate zones into broader categories: A-Tropical; B-Arid; C-Temperate; D-Cold (continental); E- Polar.
annotate plots (a) and (b).
@bpbond suggests adding global climate space on (a); he may have a suggestion as to one that would work better than Whittaker (which we tried and looked bad).

Check for and correct any records where different-aged stands are identified by a single plot

@ValentineHerr,

There may be some records where stands of different age have the same site and plot name. As every plot should be differentiated based on the site and plot names only, this would be incorrect.
Note, however, that when the same plot is repeatedly measured, site and plot name should stay the same while stand age changes. * Problematic records would be those with the same site name, plot name, and measurement date but different ages.* There may be some where the measurement date is unknown, in which case this can't be assessed. Let's assess whether the problem exists.

If we do find cases where this occurs, could you please redo all of those plot names by adding '_age##' to the end of the current name?

For example (note that this example is a case that might not be wrong--measurement date not acquired):

Site	Plot	Age
Honghuaerji	Pinus sylvestris var. mongolica	81
Honghuaerji	Pinus sylvestris var. mongolica	86

becomes:

Site	Plot	Age
Honghuaerji	Pinus sylvestris var. mongolica_age81	81
Honghuaerji	Pinus sylvestris var. mongolica_age86	86

We'll need to add any new plots to ForC_history (and from there ForC_plots). If you confirm that there are some plots with this problem, please generate records for ForC_history as follows:

historyID	sites.sitename	plot.name	plotarea	event.sequence	date	dateloc	distcat	disttype	level	units	percent.mortality	distnotes	plothistoryID.v1	tropical_extratropical
NA	Honghuaerji	Pinus sylvestris var. mongolica_age81	NAC	1	NAC	9	No.Info	NAC	NA	NA	NA	NA	NA	extratropical

I will then compare with existing records and either merge by hand or provide instructions for merging via script.

ensure that tropical ANPP data from Taylor et al. 2017 is included

Taylor et al. 2017 compiled tropical ANPP estimates, and data are published here. This effort was independent of ours. There's probably a fair amount of overlap, but it looks like they have some records we don't.

standardize and polish column names

Standardize and polish column names across tables, metadata, scripts and paper.

I believe it can really improve the opinion of future users of the database who could otherwise think that the lack of attention to details in the database is a bad sign about its quality.

Maybe this can be done after submission of paper but before it gets published?

write script to make a histogram of number of records by measurement date

This will be Fig. 1 in the data paper.

Here's what we want:

Histogram showing years in which data were collected or originally published. For records lacking information on the year in which measurements were made, we substitute year of the data’s original publication. If this is unknown (## records), the record is excluded from this histogram.

Shade black when we have the year of measurement, grey for year of publication.

Please include records with start/end date (as opposed to just date)—use end date.

Create script to do some basic database checking

It would be good to have a script here (public database) that can be run whenever there is a substantial update to the database to check that the structure of the database remains correct and that there are no egregious errors in values.

Here's the start of a list of things to check:

For each site-plot combination in MEASUREMENTS, there is a corresponding site-plot record in PLOTS, and there are no records in PLOTS that lack corresponding records in MEASUREMENTS.
For each site in PLOTS, there is a corresponding record in SITES, and there are no sites in SITES that lack records in PLOTS.
For each site-plot combination in PLOTS, there is at least one corresponding record in HISTORY, and there are no records in HISTORY that lack corresponding records in PLOTS (the latter can be identified based on whether the site-plot combination and the historyID show up in PLOTS. see metadata to understand how historyID works in PLOTS).
All other fields that link across tables are represented once in the table where they are defined and 1+ times in the table where they are used.

Variable	Table defined	Field(s) used
PFTcode	PFT	MEASUREMENTS:dominantveg
distcat	DISTTYPE	HISTORY:distcat
disttype	DISTTYPE	HISTORY:distcat, PLOTS(various)

(THIS TABLE NEEDS TO BE COMPLETED BASED ON THE RELATIONSHIP ENTITY DIAGRAM, which needs to be updated.)

Within each field, all entries are either data of the type specified for that field (metadata tables) or a missing data code.
All new records in MEASUREMENTS:mean fall within the range reported for that variable in the VARIABLES table. (It is possible that valid new records will fall outside the range, but script should generate a warning: "Measurement falls outside the range of values that currently exist for this variable in the database. Check value, and if valid update range for this variable in VARIABLES."
For all other numerical fields, all records fall within the range specified in the metadata tables (metadata tables). Any values that fall outside the current range are flagged for screening.
For all categorical variables, all records match one of the defined variable codes (metadata tables).
~~Check for records where stands of different age are represented by a single plot (see #22)~~

Mismatches between MEASUREMENTS and PFT

There are 16 measurement records with undefined PFTs
   measurementID dominantveg
           <int>       <chr>
1           1015         NAC
2           1016         NAC
3           1017         NAC
4           1677         NAC
5           2499         NAC
6           2503         NAC
7           2506         NAC
8           5524         NAC
9           5525         NAC
10          7821         NAC
11          7822         NAC
12          7823         NAC
13          7922         NAC
14          7923         NAC
15          8083         NAC
16          2654         NAC

Similar to #31 , should NAC and NI entries in MEASUREMENTS be ignored here?

write script to create analysis-ready table ("ForC-simplified")

As a first step for most data analyses and to facilitate use of the database, I'd like to create a script that does some preliminary data manipulation and outputs a single spreadsheet containing the fields that will most commonly be used in analysis.

Script should do the following:

convert all measurements to units of C (use IPCC default C=0.47*biomass).
resolve duplicate records (following guidelines of issue #68)
potentially, combine variables to calculate others according to a set of equations governing relationships among variables (issue #67).

Fields to be contained in this table will be defined through issue #69.

Variables with no range because no data

@teixeirak would you like to fill in the range for the variables that are not used in MEASUREMENTS but that might be added in the future ? This would be so that they are not flagged when someone enters new records.

If you wan to do that you can do it here and I can add it myself in the VARIABLES file

Here is the list of variables I am talking about:

NPP_4
NPP_5
NPP_understory
NPP_woody
ANPP_litterfall_2_C
ANPP_litterfall_3_C
trees(0)_or_stems(1)

date formatting

@teixeirak
weird date formatting in plothistory, perhaps due to autoformatting in Excel?
needs to be unambiguous, ideally in YYYY-MM-DD format.

Mismatches between MEASUREMENTS and METHODOLOGY

There are 6474 measurement records with no corresponding methodology record

This is because three values in MEASUREMENTS$method_id don't appear in METHODOLOGY: NI, NAC, and 436.

I assume the first two are a string matching issue--how should this be handled? Do we ignore MEASUREMENT entries with method_id equal to NI or NAC?

The last one, 436, might be a genuine problem?

Variable cleanup

@ValentineHerr, I modified variable names and ordering according to some conventions that developed (see docs here). I'd like your help updating the master branch with these names.

First, please review the variable name conventions for compliance with database best practices. Let me know if there's anything you think we should change.
Update ForC_variables: I made changes in several fields, so please use this file as the new master, but first delete the field with old variable names. You'll also need to update variable names throughout the document, specifically in the equations, associate covariates, and notes fields.
Update ForC_measurements: update both variable names and associated covariate names
Update metadata and entity relationship diagram: changes include new field in ForC_variables (id number) and changes to categorical variables in variable.type ("primary" is replaced with "stock" and "flux)

issues in plothistory and measurements (plot.name, duplicate entries)

plot.names not matching between measurements and plothistory:

changed spelling: 'Terra' changed to 'terra'. 'firme' changed to 'firma'
plot.names in measurements manually changed to be follow what was in plothistory.

duplicate entries in plothistory with no extra information (all NACs).

these entries were manually removed and attached here
TropForC_plothistory_duplicatesremoved.xlsx

Aragao 2009 records

@mmw590, In the measurements file, there are differences between Aragao 2009 records in this master version of TropForC and the original spreadsheet (2ForC, in the private repository). I'm not sure which is the correct version. Could you please take a look?

Re-run database consistency checks, metadata tables, figures, and stats in text

@ValentineHerr,

Quite a few records have been removed from the database (issue #59). We'll need to re-run:

the database checks (some plots and sites may go)
the numbers in the metadata tables
figures
statistics for the text.

Could you please work through as much of this as possible before your time off, in that order?

Update metadata and statistics in paper following some cleanup of MEASUREMENTS

@ValentineHerr, I just reviewed the ranges of variables and had to make some updates to measurements that will affect our statistics. Here's the commit: 1f5e1bd. Specifically, we will need to update measurements metadata, numbers of records in the variables table (min and max should be okay), and some of the statistics in the paper (only those pertaining to measurements file). Could you please rerun these?

FYI, the ranges for all other fields were fine.

Filling FAO ecozone and KOEPPEN for sites where is it missing

@teixeirak ,

1- All FAO ecozones to fill are for Klamath sites, right ? (Temperate Montain system)
2- There are 6 sites where I get a different FAO than the one already in the data base. I am guessing that was edited by hand at some point but I want to make sure.

Here is the list (sites.sitename: WAS --> I FOUND)

Vancouver Island unmanaged plantatio: Boreal coniferous forest -->Temperate oceanic forest
Barro Colorado Island chronosequence: Tropical moist forest --> Water
Canada - Western Taiga Shield: Boreal coniferous forest --> Water
Santa Rita Mesquite savanna: Subtropical humid forest --> Subtropical desert
LaBiche River CA-WP1: Subtropical desert --> Boreal coniferous forest
Barro Colorado Island CTFS-ForestGEO plot: Tropical moist forest --> Water

calculate dates of forest regeneration in plot history

@mmw590, I added entries in the plot history for all the previously missing sites/plots. There are quite a few that have age data in the measurements file. If you have code to calculate their regeneration dates based on age and date of measurement, it would be great to add those dates to the measurements file. See notes therein.

Allometric equation #15 missing

@teixeirak ,
Allometric equation #15 is missing in ALLOMETRY but referred to in MEASUREMENTS (allometry column).

Mismatches between site-plot combinations in MEASUREMENTS and PLOTS

First, there are 30 sites.sitename values that appear in MEASUREMENTS that don't appear in PLOTS. These are all sites with apostrophes in the name, which sounds like is due to a known bug in MATLAB or the script you're using.

Second, there are 58 (but only 5 unique) plot names in MEASUREMENTS that don't appear in PLOTS.

measurementID	plot.name
2654	Eucalypt open-forest savanna
6488	birch stand_age55
6489	birch stand_age55
6032	EMS Tower (HFR1) _age92
6045	EMS Tower (HFR1) _age92
1859	Mature lowland 'wet' perfmafrost-free spruce forest
(lots more)	Mature lowland 'dry' perfmafrost-free spruce forest
136	mature 'old growth' forest
(lots more)	mature 'old growth' forest

(Note I haven't run any trim whitespace or capitalization fixes on these data, so could be something extremely trivial that's causing the match failure.)

In combination, there are 478 site/plot combinations in MEASUREMENTS that don't appear in PLOTS. I'd suggest let's first fix the problems above, and then worry about this. File attached though.
missing_site_plot_combos.txt

Mismatches between PLOTS and SITES records

There are 0 plots with no corresponding site record
There are 0 sites with no corresponding plot record

👍

forc-db / forc Goto Github PK

forc's Issues

Recommend Projects

Recommend Topics

Recommend Org