Term: Duplicate versus ??? about bdq HOT 7 CLOSED

tdwg commented on August 24, 2024

Term: Duplicate versus ???

from bdq.

Comments (7)

Tasilee commented on August 24, 2024 1

My concern with any alternate term is that users will have a concept about the term "duplicate". I agree that the way you would think about it depends on usage. In most statistics (as SDM), you do not want occurrence records weighted by 'duplicates'.

Again, the key issue here is that the test FLAGS a POTENTIAL duplicate and it is then up to the user to decide what if anything needs to be done depending on their purpose (FFU).

Our wording of the description would seem fitting...

"The record appears to be a duplicate based on a combination of taxon name and/or collector and/or collector number and/or locality and/or date/time".

I am therefore happy to leave this as is.

from bdq.

stanblum commented on August 24, 2024 1

Context could/should sort this out. In the first case, records that duplicate a set of values for relevant fields, the "status" of being a duplicate can't be a property of the record independently of its context; it's a property of the record (and at least one other) within data set, i.e., a subset. For that context (a particular dataset) then, would "duplicate_records" work? Is this necessary? Is the converse test more appropriate: all records in this dataset unique -- tested and passed.

For the duplicate specimen case, "duplicate_specimen of X" is something that could stay with the duplicate. But some additional details might be needed to nail down how far duplicate extends -- same "biological individual", same collection date?

from bdq.

ArthurChapman commented on August 24, 2024 1

Thanks Lee and Stan. You have convinced me that "duplicate" will work in context. Lee has explained that the description of the test in the Tests and Assertions places the test in context. I will check on the Vocabulary we are developing and make sure our definition uses context.

I will leave the issue open for a while to see if anyone else wants to comment.

from bdq.

ArthurChapman commented on August 24, 2024

Sounds reasonable for the tests - still not sure about its use in the Framework. Have to check the definition used by Allan. Arthur

…

On 3/02/2017 4:35 PM, Lee Belbin wrote: My concern with any alternate term is that users will have a concept about the term "duplicate". I agree that the way you would think about it depends on usage. In most statistics (as SDM), you do not want occurrence records weighted by 'duplicates'. Again, the key issue here is that the test FLAGS a POTENTIAL duplicate and it is then up to the user to decide what if anything needs to be done depending on their purpose (FFU). Our wording of the description would seem fitting... "The record *appears* to be a duplicate based on a combination of taxon name and/or collector and/or collector number and/or locality and/or date/time". I am therefore happy to leave this as is. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#10 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVx40_NXnkr5QnCb2WEH1FvKi0waNWukks5rYryMgaJpZM4L179m>.

-- ---------------------- Arthur Chapman (Australian Biodiversity Information Services) PO Box 35 Ballan Vic 3342 Australia +61 (0)400 400 326

from bdq.

allankv commented on August 24, 2024

Indeed, to determine that a record is duplicated (or unique - DQ Dimension, in the conceptual framework context) we have to consider the dataset context, as mentioned by Stan, but also it depends on the Use Case context, as mentioned by Lee. For example, in a given context (A), users may want to identify if there is another record in the dataset that have the same species name and occur in the same exact date and exact coodinates. But, in another context (B), users may want to identify the records in the dataset that have the same species name and occur in the same year and in the same area/cell of 10km x 10km (just naive examples, probably not realistic).

One way to measure it, unsing the framework, would be 1/(D+1) where D is the number of records "duplicated/similar" according to (A) or (B) definition. Ex.: if a record have another 3 similar records in the dataset, the "uniqueness" measure (specificaly in this Use Case context) would be equal 1/(3+1)= 0.25 and in this context. But in the same context, it would require a "uniqueness" = 1 to be considered fit for use, for example, so this record is not fit for use in the context of the dataset and the Use Case.

This example was related to DQ measure assigned to records, but based on those measures, we could generate a DQ mesure assigned to the dataset/multi records. Ex.: a dataset has uniqueness equal 0.87, where this measure was obtained based on the average of the "uniqueness measure" of each record in this dataset. So, if all the records has uniqueness equal 1, the dataset would have uniqueness equal 1 too, otherwise, DQ enhancement actions would be necessary to increase this measure, e.g. remove/filter records that doesn't have "uniqueness" equal 1.

from bdq.

qgroom commented on August 24, 2024

Hi Arthur, can you point me to the "Framework and in the Tests and Assertions" so I can take a look? Thanks Quentin Dr. Quentin Groom (Botany and Information Technology) Botanic Garden Meise Domein van Bouchout B-1860 Meise Belgium ORCID: 0000-0002-0596-5376 <http://orcid.org/0000-0002-0596-5376> Landline; +32 (0) 226 009 20 ext. 364 FAX: +32 (0) 226 009 45 E-mail: [email protected] Skype name: qgroom Website: www.botanicgarden.be

…

On 3 February 2017 at 04:58, Arthur Chapman ***@***.***> wrote: In a number of places - e.g. the Framework and in the Tests and Assertions, we have used "Duplicate" to indicate records that are equivalent when considering a subset of fields. This is used a lot is Species Distribution Modelling (SDM) where one record for a taxon in a grid square (i.e. only using taxon, latitude, longitude is regarded as a duplicate). In botany, a duplicate is where parts of a single collection from one plant is sent to a number of different institutions. Some people have thus suggested that the use of "duplicate" to describe records that are equivalent when considering only a subset of fields is misleading and doesn't refer to a true duplicate. We are looking for a suitable term instead of duplicate. Suggestions have included "replicate" but checking various thesauri, this would not appear correct either. Some suggestions are: - equivalent - matching - indistinguishable — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#10>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFRj0MSGUvYGZEFB1KbkV-uZeXx7asZEks5rYqXjgaJpZM4L179m> .

from bdq.

ArthurChapman commented on August 24, 2024

At this stage it seems to be OK to use Duplicate, however, we must be careful to make sure that the definition is sure to mention context. I will now close this discussion. Thank you all for participating.

from bdq.

Term: Duplicate versus ??? about bdq HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent