Comments (7)
My concern with any alternate term is that users will have a concept about the term "duplicate". I agree that the way you would think about it depends on usage. In most statistics (as SDM), you do not want occurrence records weighted by 'duplicates'.
Again, the key issue here is that the test FLAGS a POTENTIAL duplicate and it is then up to the user to decide what if anything needs to be done depending on their purpose (FFU).
Our wording of the description would seem fitting...
"The record appears to be a duplicate based on a combination of taxon name and/or collector and/or collector number and/or locality and/or date/time".
I am therefore happy to leave this as is.
from bdq.
Context could/should sort this out. In the first case, records that duplicate a set of values for relevant fields, the "status" of being a duplicate can't be a property of the record independently of its context; it's a property of the record (and at least one other) within data set, i.e., a subset. For that context (a particular dataset) then, would "duplicate_records" work? Is this necessary? Is the converse test more appropriate: all records in this dataset unique -- tested and passed.
For the duplicate specimen case, "duplicate_specimen of X" is something that could stay with the duplicate. But some additional details might be needed to nail down how far duplicate extends -- same "biological individual", same collection date?
from bdq.
Thanks Lee and Stan. You have convinced me that "duplicate" will work in context. Lee has explained that the description of the test in the Tests and Assertions places the test in context. I will check on the Vocabulary we are developing and make sure our definition uses context.
I will leave the issue open for a while to see if anyone else wants to comment.
from bdq.
from bdq.
Indeed, to determine that a record is duplicated (or unique - DQ Dimension, in the conceptual framework context) we have to consider the dataset context, as mentioned by Stan, but also it depends on the Use Case context, as mentioned by Lee. For example, in a given context (A), users may want to identify if there is another record in the dataset that have the same species name and occur in the same exact date and exact coodinates. But, in another context (B), users may want to identify the records in the dataset that have the same species name and occur in the same year and in the same area/cell of 10km x 10km (just naive examples, probably not realistic).
One way to measure it, unsing the framework, would be 1/(D+1) where D is the number of records "duplicated/similar" according to (A) or (B) definition. Ex.: if a record have another 3 similar records in the dataset, the "uniqueness" measure (specificaly in this Use Case context) would be equal 1/(3+1)= 0.25 and in this context. But in the same context, it would require a "uniqueness" = 1 to be considered fit for use, for example, so this record is not fit for use in the context of the dataset and the Use Case.
This example was related to DQ measure assigned to records, but based on those measures, we could generate a DQ mesure assigned to the dataset/multi records. Ex.: a dataset has uniqueness equal 0.87, where this measure was obtained based on the average of the "uniqueness measure" of each record in this dataset. So, if all the records has uniqueness equal 1, the dataset would have uniqueness equal 1 too, otherwise, DQ enhancement actions would be necessary to increase this measure, e.g. remove/filter records that doesn't have "uniqueness" equal 1.
from bdq.
from bdq.
At this stage it seems to be OK to use Duplicate, however, we must be careful to make sure that the definition is sure to mention context. I will now close this discussion. Thank you all for participating.
from bdq.
Related Issues (20)
- TG2-Normative and Informative elements
- TG2 Check Test References for DwC term links HOT 2
- TG2-AMENDMENT_[]_STANDARDIZED source authorities HOT 11
- TG2 - Note Test dependencies and/or workflows HOT 4
- TG2-VALIDATION_MAXDEPTH_INRANGE HOT 12
- TG2 - Test Data Framework HOT 31
- TG3 - impact assessment studies on uses cases or data quality profiles
- TG4 - Best definitions for parasitic worms (helminths)
- TG2 - understanding the status and process of developing tests and assertions HOT 4
- TG2-Issues toward the standard HOT 4
- TG2 OLD (2022-02-20) - Develop a VOCABULARY that covers the terms used within TG2
- How to map BDQ tests to CMS HOT 2
- Process of running an AMENDMENT HOT 4
- TG2 - Structuring Test Descriptions HOT 26
- TG2 - no flag for DATE_INVALID? HOT 1
- TG2-VALIDATION_STATEPROVINCE_FOUND HOT 12
- TG2-VALIDATION_COUNTRYSTATEPROVINCE_CONSISTENT HOT 14
- TG2-VALIDATION_COUNTRYSTATEPROVINCE_UNAMBIGUOUS HOT 12
- how are the default values of the parameters determined? HOT 5
- Recognition of geodeticDatum Field by IPT HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bdq.