forc-db / forc Goto Github PK
View Code? Open in Web Editor NEWGlobal Forest Carbon Database
Home Page: https://forc-db.github.io/
License: Creative Commons Attribution 4.0 International
Global Forest Carbon Database
Home Page: https://forc-db.github.io/
License: Creative Commons Attribution 4.0 International
From a reviewer:
Biomass terms are used to label perhaps a majority of the records, but the labels used are highly ambiguous and need clarification. A significant fraction of the data records are labelled as “Biomass_ag” – what was measured?. just woody stems? also understory?. grasses/mosses?. “Biomass_total” – this is used for many records, but I doubt all components of aboveground and belowground biomass were actually assessed (particularly belowground, to max. soil depth). Again, while some additional info on this is given for a subset of the records in the supplementary METHODOLOGY table, I think instead the term in the MEASUREMENTS table used to describe the datum needs to be specific.
Please load the code used to make this figure:
Please make the following modifications:
Rearrange some of the categories:
-change grass or shrubs to just grass
-new category: 2TM, which includes 2TD and 2TE.
-woody other/unclassified- for dominant lifeform=woody and dominant veg = 2TREE, NAC, or anything else
Remove”dominant lifeform” label
Change labels as follows: 2TEN- evergreen needleleaf; 2TDB- broadleaf deciduous; 2TEB-broadleaf evergreen; 2TDN- needleleaf deciduous; 2TM- broadleaf- needleleaf mix
@ValentineHerr, ForC does not currently contain Regentype and Managementtype (Measurements file) for extratropical sites. I have found a file where Maria had added these codes. Could you please merge these into the master? (I'm assuming this will be easy. If not, please alert me.)
I am finally getting to that part of the check list but it is bigger than what I thought.. Not sure I can get it down by the end of the day...
I think the first step would be to fill in Variable.codes in the metadata tables with the list of all levels of categories used in the tables. Then I can pull that and make sure that the category levels in the (future) tables already exist.
But that means that we might want to separate the description of the variables codes and the variables codes themselves. I am thinking of metadata of HISTTYPE. We would have "Establishment", "Regrowth(_prior)", "Disturbance(_prior)", "Management", "No.disturbance", "No.info" in Variable.codes and what they mean ("Disturbance - includes natural and anthropogenic disturbances that directly kill ..." in another field called something like Variable.codes.description
Let me know what you think.
The entity relationship diagram is currently out of date and needs to be updated both for this site and our publication. An editable version is here. @ValentineHerr, it would be great if you could update this.
There are 0 plots with no corresponding history record
There are 0 history records with no corresponding plot record
👏
The ORNL DAAC has NPP estimates for boreal, temperate, and tropical forests.
At least some of these are already in ForC, and the files for tropical forests are here in our private repo. According to Maria's records, she has entered the data included in "(Clark et al., 2001, 2013) and
https://daac.ornl.gov/" (see this doc); however, a spot check revealed that NPP data from Marafunga, Papua New Guinea is not included. Moreover, I do not believe we've ever downloaded temperate and boreal NPP data. It may be that she only loaded the subset in the Clark compilations.
There are some significant advantages to taking a collection of data and code, as here, and making it into an R package.
library(xxx)
as part of their analysis scripts, and packaging the ForC data in this way makes it super easy for people to use.Anyway, something to consider!
Standardize variable names
Standardize field names (see #51 )
Break out references list
Link sites and plots by unique ID numbers rather than full name
headers in metadata need some work (not column names of tables, just headers of metadata)
from a reviewer:
A previous data compilation that is listed as the source for nearly 10% of the records (Anderson et al. 2006) I could not find when I searched the Bibliography. This could be fixed simply by inserting the reference, or, more thoroughly, by breaking out the reference list (see issue #52).
Also, the Reference list given for “Measurements” in the Bibliography is in reverse alphabetical order. Simple fix.
there is no description of field loaded_from and loaded_by in SITES table
The notes field in the SITES table contains some notes describing the vegetation, and these are more appropriate for the vegnotes field in the MEASUREMENTS table.
Regentype (measurements table; populated based on plothistory data using @mmw590's R code) has not been populated for extratropical sites. Categories should be redefined to be more appropriate for extratropical sites.
Bibliography is currently out of date (includes only records from TropForC-db v. 1.0).
@teixeirak
I will use this issue to ask you a few things that I need your input with
In HISTORY percent.mortality there is 8 records with value "<<100%" and 184 with "<100%". I am ignoring them to calculate range (in metadata) but I want to make sure you want to keep that (especially the "<<100%")
In HISTORY plothistoryID.v1 I get 5722 for the max and you had 2477. Is that normal ?
In metadata of HISTORY levels I am replacing the range I find by NA, OK ?
In MEASUREMENTS measurementID.v, I get max = 17172 while yours is 4797. Is that normal ?
In MEASUREMENTS dupcode there is 1 "d", should I replace by "D"?
In MEASUREMENTS dupcode should D/DC, and DC/D be the change to be one or the other ?
In MEASUREMENTS dupcode should D/M and M/D be the change to be one or the other ?
In metadata of PLOTS year.establishment.oldest.trees, why is range "-" and not "##-##" ?
Once all above is solved. Can I remove range column of all metadata files and keep only Min and Max ?
Do we want to keep/rename column indicates.site.that.lacks.info.direct.from.pub in SITES ? It does not appear in metadata
In paper. I get different number of variables than you. Are you looking at primary only and "merging" Carbon and Biomass into individual variables for the count ?
Can we really compare the number of geographic.area between new and old version since we are not using the same method ? I get 249 for tropical sites (apparently it was 178 before).
I (and maybe the reader) need more guidance to fill (and maybe retrieve) the numbers in the following sentence : "Of the 2,731 plot records, 1,206 (44%) contain records of dates of establishment of the oldest cohort of trees, ## (#%) contain records of initiation of a post-disturbance cohort, ## (#%) contain records for one or more major disturbances, ## (#%) contain records for one or more non-stand clearing disturbances, and ## (#%) contain records for one or more management events."
Data entry is more cumbersome than it needs to be, and understanding the system is a learning curve.
One idea to improve this is to create a single data entry spreadsheet and R code that can draw from that to update the database. The speadsheet could include field descriptions and instructions for data entry. It could also allow for multiple measurement records or history events to be recorded quickly without requiring entry of duplicate information.
Metadata is currently out of date.
Reviewer comment:
annual GPP – many included records are for GPP (total forest photosynthesis), but this flux cannot be directly measured in the field (see Wehr et al. 2016 Nature; Clark et al. 2017 Biogeosciences). Estimating this flux is problematic. It is approached either by model simulations, by summing a subset of the field biometric components, or by complex inferences made from eddy-covariance estimates of forest net CO2 exchange. The explanations of methods for a subset of these GPP records that are in the Methodology table are useful in those cases, but for many (most?) of the GPP records, there is no information given on the underlying methods. It is clear, however, from the limited information, that many of these values are based purely on modelling, with no direct measurements made in the field.
Response:
Although estimation of GPP is problematic, we feel strongly that it is an important variable that should be included in this database (and, of course, it is not the only variable with problematic methods). However, ForC should not include modeled values, and we removed all records based on modeling (see issue #59).
For BDFFP sites, plotarea appears to be fragment size. Depending on sampling design, this may or may not be appropriate. Specifically, if the entire area of each fragment was thoroughly sampled or systematically subsampled, it is approriate that plotarea=fragment size. However, if a single plot of defined size was located somewhere within the fragment, that is the plot size that should be recorded.
In order to improve database structure and reconstruct the record of which values have been checked against original publications, we want to:
Record which pdfs we have (in order to know which values have been checked):
Break out references records, creating a separate data table for references
Previous versions of this database contained "managementtype" and "regentype" fields, which broadly categorized forests based on their management status and disturbance/regeneration category (e.g., natural regeneration following agriculture, plantation).
These were deleted because they appeared to have become corrupted and because of lack of transparency/reproducibility in how they were calculated (produced by script, fixed by hand). For details, see issue #21.
It would be nice to reinstate or recreate these. The best option is to write new script to produce these based on the PLOTS and/or HISTORY files and plot names. Plot names often contain relevant information.
Please load the script used to make this figure.
Then, update as follows:
From a reviewer:
Another important need is for each datum to be accompanied by a flag that indicates whether the value is (a) based on direct forest-based observations (e.g., NEE values from eddy covariance, fine-litterfall measurements), or is instead (b) partially or totally based on modelling or other non-direct methods (as is the case for a large number of the records here, per the METHODOLOGY table; examples there are “BGC-model” and “GPP-Re”).
Response:
In response to this comment, we seriously considered adding a field to indicate the measurement type, but decided against it. The reason is that it would be difficult to classify methods into discreet categories, as method types are really more of a spectrum. The two given examples of direct forest-based observations (NEE, fine litterfall) both generally require some type of gap filling or interpolation to produce an annual estimate. On the other end of the spectrum, there’s a wide variety of methods combining using scaling/ gap-filling/ modeling techniques to extrapolate from empirically measured values to annual ecosystem carbon budgets. Rather than attempt to classify into discreet categories, we feel that it is most appropriate to continue presenting this information in the methodology table. It is our philosophy that responsible use of the database requires some familiarity with carbon cycle measurement methods, and when specific methodology is important, users can refer to the methodology table and original publications, if necessary. To ensure that the database documentation addresses the issue of variation in the directness of measurement vs relying on some type of modeling, we have inserted text in our discussion of major factors affecting data quality along with recommendations to users as to how to appropriately screen the data ().
We do agree that the database should not contain purely modeled values, and our general philosophy while compiling the database has been to not include modeled values. However, the Luyssaert database, which was incorporated to ForC, did include some modeled values, and these methods were recorded in the METHODOLOGY table. We therefore reviewed all of the records for which entries in the METHODOLOGY table indicated that they were modeled. Modeled values, or values whose methodology could not easily be traced (i.e., would require tracing back through multiple publications), were removed from the database. Those that were scaled or extrapolated from site-specific measurements were retained, and the methodology was clarified (METHODOLOGY table).
Issue raised by a reviewer:
"The citation information given for the data records (MEASUREMENTS table) indicates that most of the data presented here were obtained by the data-set authors from secondary compilations, not from the original publication of each study. The Disclaimers section of the Metadata (lines 383-388) indicates that the many derived (secondarily-cited) data were not checked against the original source publications. The authors actually warn potential users here that there are likely to be errors and to please help in the quality control by alerting these authors of such errors.
Users of such a data compilation should be able to rely on the accuracy of the data presented. That is clearly not the current case. All the secondarily-cited data in this data set need to be checked by the data-set authors against the original source publications. Such work is tedious and demanding, and in some cases (e.g., Chinese productivity literature), language issues will make it highly challenging. I would argue, however, that no record should be included in a data compilation such as this one, unless the data-set authors have checked the numbers and associated information against the corresponding original publications. "
Response:
**We appreciate the concern for data quality; however, we respectfully disagree with the philosophy that all of these records must be checked against original data publications to be of value to the scientific community. ....
We recognize the highest data quality standards require double checking reported values against original publications. We previously did not have a field for recording when values had been checked against original publications. To provide this structure going forward, we have added the field checked.ori.pub to the Measurements table. Unfortunately, this field does not record records that were double checked prior to February 2018 (with the exception of a few whose verification had been noted in the notes field).
Regarding associated information, we have made a point of documenting the reason for incomplete data in the database, which is accomplished through the use of different missing data codes. The “NAC” code indicates data for which we have made no effort to obtain the information from the original publication. Thus, researchers who would benefit certain information can easily identify records that should be checked in the original publication.**
from a reviewer: There are many oddities among the records included with this age designation, including monospecific plantations (e.g., 15 records from La Selva) and many logged S. American forests (diverse sites).
Updates to TropForC measurements, sites, and plothistory files made after the original publication have now been posted in this public repository. However, there may be additional updates to supporting files (e.g., allometries) that have not yet been posted.
Please load the script used to create this figure:
Modify as follows:
There may be some records where stands of different age have the same site and plot name. As every plot should be differentiated based on the site and plot names only, this would be incorrect.
Note, however, that when the same plot is repeatedly measured, site and plot name should stay the same while stand age changes. * Problematic records would be those with the same site name, plot name, and measurement date but different ages.* There may be some where the measurement date is unknown, in which case this can't be assessed. Let's assess whether the problem exists.
If we do find cases where this occurs, could you please redo all of those plot names by adding '_age##' to the end of the current name?
For example (note that this example is a case that might not be wrong--measurement date not acquired):
Site | Plot | Age |
---|---|---|
Honghuaerji | Pinus sylvestris var. mongolica | 81 |
Honghuaerji | Pinus sylvestris var. mongolica | 86 |
becomes:
Site | Plot | Age |
---|---|---|
Honghuaerji | Pinus sylvestris var. mongolica_age81 | 81 |
Honghuaerji | Pinus sylvestris var. mongolica_age86 | 86 |
We'll need to add any new plots to ForC_history (and from there ForC_plots). If you confirm that there are some plots with this problem, please generate records for ForC_history as follows:
historyID | sites.sitename | plot.name | plotarea | event.sequence | date | dateloc | distcat | disttype | level | units | percent.mortality | distnotes | plothistoryID.v1 | tropical_extratropical |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NA | Honghuaerji | Pinus sylvestris var. mongolica_age81 | NAC | 1 | NAC | 9 | No.Info | NAC | NA | NA | NA | NA | NA | extratropical |
I will then compare with existing records and either merge by hand or provide instructions for merging via script.
Taylor et al. 2017 compiled tropical ANPP estimates, and data are published here. This effort was independent of ours. There's probably a fair amount of overlap, but it looks like they have some records we don't.
Standardize and polish column names across tables, metadata, scripts and paper.
I believe it can really improve the opinion of future users of the database who could otherwise think that the lack of attention to details in the database is a bad sign about its quality.
Maybe this can be done after submission of paper but before it gets published?
This will be Fig. 1 in the data paper.
Here's what we want:
Histogram showing years in which data were collected or originally published. For records lacking information on the year in which measurements were made, we substitute year of the data’s original publication. If this is unknown (## records), the record is excluded from this histogram.
Shade black when we have the year of measurement, grey for year of publication.
Please include records with start/end date (as opposed to just date)—use end date.
It would be good to have a script here (public database) that can be run whenever there is a substantial update to the database to check that the structure of the database remains correct and that there are no egregious errors in values.
Here's the start of a list of things to check:
Variable | Table defined | Field(s) used |
---|---|---|
PFTcode | PFT | MEASUREMENTS:dominantveg |
distcat | DISTTYPE | HISTORY:distcat |
disttype | DISTTYPE | HISTORY:distcat, PLOTS(various) |
(THIS TABLE NEEDS TO BE COMPLETED BASED ON THE RELATIONSHIP ENTITY DIAGRAM, which needs to be updated.)
There are 16 measurement records with undefined PFTs
measurementID dominantveg
<int> <chr>
1 1015 NAC
2 1016 NAC
3 1017 NAC
4 1677 NAC
5 2499 NAC
6 2503 NAC
7 2506 NAC
8 5524 NAC
9 5525 NAC
10 7821 NAC
11 7822 NAC
12 7823 NAC
13 7922 NAC
14 7923 NAC
15 8083 NAC
16 2654 NAC
Similar to #31 , should NAC
and NI
entries in MEASUREMENTS be ignored here?
As a first step for most data analyses and to facilitate use of the database, I'd like to create a script that does some preliminary data manipulation and outputs a single spreadsheet containing the fields that will most commonly be used in analysis.
Script should do the following:
Fields to be contained in this table will be defined through issue #69.
@teixeirak would you like to fill in the range for the variables that are not used in MEASUREMENTS but that might be added in the future ? This would be so that they are not flagged when someone enters new records.
If you wan to do that you can do it here and I can add it myself in the VARIABLES file
Here is the list of variables I am talking about:
@teixeirak
weird date formatting in plothistory, perhaps due to autoformatting in Excel?
needs to be unambiguous, ideally in YYYY-MM-DD format.
There are 6474 measurement records with no corresponding methodology record
This is because three values in MEASUREMENTS$method_id don't appear in METHODOLOGY: NI
, NAC
, and 436
.
I assume the first two are a string matching issue--how should this be handled? Do we ignore MEASUREMENT entries with method_id
equal to NI or NAC?
The last one, 436
, might be a genuine problem?
@ValentineHerr, I modified variable names and ordering according to some conventions that developed (see docs here). I'd like your help updating the master branch with these names.
First, please review the variable name conventions for compliance with database best practices. Let me know if there's anything you think we should change.
Update ForC_variables: I made changes in several fields, so please use this file as the new master, but first delete the field with old variable names. You'll also need to update variable names throughout the document, specifically in the equations, associate covariates, and notes fields.
Update ForC_measurements: update both variable names and associated covariate names
Update metadata and entity relationship diagram: changes include new field in ForC_variables (id number) and changes to categorical variables in variable.type ("primary" is replaced with "stock" and "flux)
@mmw590, In the measurements file, there are differences between Aragao 2009 records in this master version of TropForC and the original spreadsheet (2ForC, in the private repository). I'm not sure which is the correct version. Could you please take a look?
Quite a few records have been removed from the database (issue #59). We'll need to re-run:
Could you please work through as much of this as possible before your time off, in that order?
@ValentineHerr, I just reviewed the ranges of variables and had to make some updates to measurements that will affect our statistics. Here's the commit: 1f5e1bd. Specifically, we will need to update measurements metadata, numbers of records in the variables table (min and max should be okay), and some of the statistics in the paper (only those pertaining to measurements file). Could you please rerun these?
FYI, the ranges for all other fields were fine.
1- All FAO ecozones to fill are for Klamath sites, right ? (Temperate Montain system)
2- There are 6 sites where I get a different FAO than the one already in the data base. I am guessing that was edited by hand at some point but I want to make sure.
Here is the list (sites.sitename: WAS --> I FOUND)
@mmw590, I added entries in the plot history for all the previously missing sites/plots. There are quite a few that have age data in the measurements file. If you have code to calculate their regeneration dates based on age and date of measurement, it would be great to add those dates to the measurements file. See notes therein.
@teixeirak ,
Allometric equation #15 is missing in ALLOMETRY but referred to in MEASUREMENTS (allometry column).
First, there are 30 sites.sitename
values that appear in MEASUREMENTS that don't appear in PLOTS. These are all sites with apostrophes in the name, which sounds like is due to a known bug in MATLAB or the script you're using.
Second, there are 58 (but only 5 unique) plot names in MEASUREMENTS that don't appear in PLOTS.
measurementID | plot.name |
---|---|
2654 | Eucalypt open-forest savanna |
6488 | birch stand_age55 |
6489 | birch stand_age55 |
6032 | EMS Tower (HFR1) _age92 |
6045 | EMS Tower (HFR1) _age92 |
1859 | Mature lowland 'wet' perfmafrost-free spruce forest |
(lots more) | Mature lowland 'dry' perfmafrost-free spruce forest |
136 | mature 'old growth' forest |
(lots more) | mature 'old growth' forest |
(Note I haven't run any trim whitespace or capitalization fixes on these data, so could be something extremely trivial that's causing the match failure.)
In combination, there are 478 site/plot combinations in MEASUREMENTS that don't appear in PLOTS. I'd suggest let's first fix the problems above, and then worry about this. File attached though.
missing_site_plot_combos.txt
There are 0 plots with no corresponding site record
There are 0 sites with no corresponding plot record
👍
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.